UTF Matters for the last time.

Yes, the last time.

These are my sources:

And this is my summary:

There are now three different forms of Unicode: UTF-8, UTF-16, and UTF-32.

Characters vs. code points vs. code units
In principle, the goal of Unicode is pretty simple: to assign each character in the world a number, called a code point (or scalar value).

Given this goal, we can just say that Unicode encodes characters, right? Wrong.

A particular encoding will represent code points as a sequence of one or more code units.

A code unit is a unit of memory: 8, 16, or 32 bits.

Each UTF-n represents a code point as a sequence of one or more code units, where each code unit occupies n bits.

Serialized formats
Serialization is the process of converting a sequence of code units into a sequence of bytes for storage or transmission. There are two complications with serialization, endianness and encoding signatures:

If a code unit is not a single byte, it can be written in two ways because of differences in machine architectures: big endian (most significant byte first) or little endian (least significant byte first). With today's microprocessor speed this is not a big deal, but at the time Unicode was being adopted it was felt that both BE and LE formats were required.

If a system does not tag files with the character encoding, then it might know that the file contains text, but not know which encoding is used.

To meet these two requirements (from an unnamed, but rather influential company), the character ZERO WIDTH NOBREAK SPACE (FEFF 16) can be used as a signature in the initial few bytes of a file. When the character has that usage, it is called a byte order mark (BOM). The BOM has the special feature that its byte-swapped counterpart BSBOM (FFFE) is defined to never be a valid Unicode character, so it also serves to indicate the endianness. This signature is not part of the content -- think of it as a mini-header -- and must be stripped when processing. For example, blindly concatenating two files will give an incorrect result.