Historically, numerals in almost all languages are little-endians from "thirteen" (3+10) to "five and twenty". Re: The historical accident of little-endian "Character" is a very slippery defitinion, is language sensitive ("rijstafel" is 9 letters long if you're English only eight if you're Dutch), and doesn't always correspond to the number of symbols the user sees anyway: when the five codes 's','o','e','u','r' arerendered as the four glyphs "sœur", how many characters really are in the string? (both answers are equally right and wrong, btw) Any argument to the contrary is based on a misplaced notion that somehow a byte is a character (I blame C). However, anyone who thinks that strcmp() sorts strings in "alphabetical" order is at best living in a dream-world, or at worst, a hopeless xenophobe.Īs for the other "problem", that of lenght: strlen() returns the length of a UTF-8 string in bytes, and outside of font rendering engines, that is all you ever need to know to write proper text-processing code. UTF-8 strings will sort in codepoint order if you give them to strcmp(), which is as good and bad as its behaviour in ASCII. But if Perl had simply dropped "utf8", it would have broken at least some old programs and if it had made "utf8" a synonym for "UTF-8", some old data would have been rejected. It's good to have an implementation that follows the spec, and it's especially good when that implementation is a lot safer than the overly-permissive one it supersedes. The later UTF-8 implementation follows the spec. They make it too easy to slip malicious data past poorly-designed filters (i.e., most filters), for example. Those sequences - such as non-minimal encodings - have bad security implications. That made it easier for people to start using UTF-8 with Perl. Broadly speaking, it follows Postel's Interoperability Principle, and allows many sequences that were forbidden by the standard when it was finalized. Perl's original UTF-8 implementation ("utf8") was created before the format was standardized.
![words spelled with ascii alphabet words spelled with ascii alphabet](https://image.slidesharecdn.com/asciicode-140220145846-phpapp01/95/ascii-code-with-webdings-and-wingdings-all-characters-2-1024.jpg)
![words spelled with ascii alphabet words spelled with ascii alphabet](https://i.pinimg.com/originals/e8/23/2e/e8232e9a5602aee89a54ba63cc595011.png)
I'm no fan of Perl, but the utf8 / UTF-8 distinction is probably the best solution to a real problem. UTF-16 does that too, apparently, thus defeating its original objective, but suffers from ASCII and endian compatibility issues and, probably more than anything else, Microsoft's typically retarded implementation. Simplistically it seems the best solution is to implement a sufficiently large encoding length to accommodate all possible characters, which was supposedly the goal of UTF-16, except Becker naively assumed that "16 bits ought to be enough for anyone" (to paraphrase a well-known fallacy).Īgain, simplistically, the answer to these "enough for anyone" fallacies would seem to be dynamic allocation, as in dynamic arrays or linked lists, which is in fact what UTF-8 does, although in its case the dynamic allocation pertains to the encoding length of each member rather than the overall length of the array, if I'm reading the descriptions correctly. Yes, that was my reaction too, but then I admit near-total ignorance on the subject, beyond what I've just read.