The Absolute Minimum I Know About Unicode And Character Sets.

Reminder for myself from this great article.
  • "plain text = ascii = characters are 8 bits" is wrong.
  • Back in the semi-olden days, when Unix was being invented only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII
  • ASCII was able to represent every character using a number between 32 and 127.
  • Space was 32, the letter "A" was 65, etc. 
  • This could conveniently be stored in 7 bits.
  • Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." 
  • The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.
  • Unicode was an effort to create a single Character Set that included every reasonable writing system on the planet.
  • Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory: A -> 0100 0001.
  • In Unicode, a letter maps to something called a Code Point which is still just a theoretical concept.
  • Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium. 
  • Magic Numbers are written like this: U+0639.  
  • This magic number is called a Code Point
  • The U+ means "Unicode" and the numbers are hexadecimal. 
  • U+0639 is the Arabic letter Ain. 
  • The English letter A would be U+0041.
  • Hello corresponds to these five Code Points: U+0048 U+0065 U+006C U+006C U+006F.
  • We haven't yet said anything about how to store this in memory.

Encodings

  • The earliest idea for Unicode Encoding was: "hey, let's just store those numbers in two bytes each". 
  • So Hello becomes: 00 48 00 65 00 6C 00 6C 00 6F
  • Right? Not so fast! Couldn't it also be: 48 00 65 00 6C 00 6C 00 6F 00 ?

UTF-8

  • UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.
  • In UTF-8, every code point from 0-127 is stored in a single byte. 
  • Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
  • This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII.
  • Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which is the same as it was stored in ASCII.

Other Popular Encodings

  • There are hundreds of traditional encodings which can only store some Code Points correctly and change all the other code points into question marks.
  • Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).

Summary / Conclusion

It does not make sense to have a string without knowing what encoding it uses. 

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

There are over a hundred encodings and above code point 127, all bets are off.

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page. 

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. 

You can always get this far on the HTML page without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.