What is ASCII and Unicode and a Character Set and this and that and Encoding and a Character...

I am still not sure myself!

These are my notes from various sources I found on the Internet. I think the Terminology is not used in a Standard way, which leads to a lot of confusion. The sources I have used are linked at the very bottom of this article.

For example, some articles will say that ASCII is a Character Set, others will say it is an Encoding and others will say it is a Standard. There are several divergences like this.

History Time

Morse Code
Morse invented the code he used to send his historic message in 1838. Like the binary system used in modern computers, it is based on combinations of two possible values: in the case of Morse code, a dot or a dash. However, unlike the character codes used in modern computers, the combinations of the two values used to represent characters in Morse code vary in length.

The dots and dashes that form the patterns for the individual characters are separated by an interval equivalent to one dot, the space between the individual characters of a word are separated by an interval equivalent to three dots, and words in message are separated by an interval equivalent to six dots.

The next great leap in telegraph technology was a primitive printing telegraph, or "teleprinter," patented by Jean-Maurice-Émile Baudot.

Like Morse's telegraph, it involved the creation of a new character code, the 5-bit Baudot code, which was also the world's first binary character code for processing textual data.

Being a 5-bit character code, Baudot code only has room for handling 32 elements (2^5 = 32 code points). This is not enough to handle both the letters of the Latin alphabet plus Arabic numerals and punctuation marks, so Baudot code employs a "locking shift scheme" to switch between two planes of 32 elements each.

ASCII Character Codes
As a result of the rapid development and spread of communications and data processing technologies in the United States in the first half of the 20th century, it became apparent there was a need for a standard character code for interchanging data that could handle the full character set of an English-language typewriter. The American National Standards Institute (ANSI) began studying this problem in the late 1950s, and it eventually decided that a 7-bit code that did not require shifting in the manner of Baudot code would be sufficient.

In 1963, ANSI announced the American Standard Code for Information Interchange (ASCII). However, ASCII as it was announced in 1963 left many positions, such as those for the lower case Latin letters, unallocated. It wasn't until 1968 that the currently used ASCII standard of 32 control characters and 96 printing characters was defined.

Moreover, in spite of the fact that ASCII was devised to avoid shifting, it included control characters for shifting, i.e., SHIFT IN (SI) for SHIFT OUT (SO) for Baudot-style locking shift, and ESCAPE (ES) for non-locking shift. These control characters were later used to extend ASCII code into 8-bit codes with 190 printing characters.

Because U.S. computer vendors were also the largest computer vendors in the world at the time, the ASCII character code immediately became a de facto "international character code standard." Naturally, that made it necessary to adapt ASCII code to other languages that use the Latin alphabet, in particular the languages of Western Europe. This work was carried out by the International Organization for Standardization (ISO) in Geneva, Switzerland, which in 1967 issued ISO Recommendation 646.

ASCII code was also used as the basis for creating 7-bit character codes for languages that did not employ the Latin alphabet, such as Arabic and Greek, and in 1969 it was incorporated into the JIS character code of Japan. To date, a total of 180 character codes based on extensions of ASCII have been registered with the ISO.

Of course, the vast majority of people who come into contact with ASCII code today use it in the form of a "personal computer manufacturer's ASCII-based character set," which is usually an extended version of 7-bit ASCII designed for use throughout a region rather than in a single country.

Moreover, it should be pointed out that these 8-bit extensions of 7-bit ASCII code are not complete. For example, they do not include all the currency symbols used throughout a particular region.

ISO 8859-1 (Latin-1)
While 7-bit character codes such as ASCII are sufficient for processing English-language data, specifically modern English-language data, they are not adequate for processing data written in most of the Latin-based scripts of Europe, which employ various accent marks together with the letters of the Latin script. Moreover, in Europe there are also non-Latin, native scripts, such as the Greek alphabet, in addition to scripts from nations around the periphery of Europe, such as Arabic, Cyrillic, and Hebrew, which also have occasion to be used within the borders of Europe.

Accordingly, after ASCII was standardized, it became necessary to go well beyond them and create a number of new character codes to handle these European data processing needs. To that end, the International Organization for Standardization first created a standard called ISO 2022, which outlines how 7-bit and 8-bit character codes are to be structured and extended. This standard was later applied to create the standard, unofficially known as "Latin-1" (ISO 8859-1).

ISO 10646 and Unicode
As mentioned above, U.S. computer firms began work in the first half of the 1980s on multilingual character sets and multilingual character encoding systems, and Xerox Corporation and IBM Corporation successfully implemented computer systems based on their research results. The Xerox researchers then proselytized their work to other U.S. software firms, and they were eventually successful in launching a U.S. industry project called Unification Code, or Unicode, the goal of which was to unify all of the worlds character sets into a single large character set. Unicode was also to be simple for computers to process, and to that end it had two important design goals: (1) avoiding the use of escape sequences to switch planes of characters, and (2) limiting the character space to 16 bits, or a maximum of 65,536 characters, and giving each character a fixed-length code (16 bits, or two bytes). However, the International Organization for Standardization (ISO) in Geneva, Switzerland, the creator of those pesky escape sequences in the widely used ISO 2022 standard, also wanted to create a multilingual character code and encoding system. Unlike Unicode, however, it aimed at creating a 32-bit Universal Coded Character Set (UCS), which would use escape sequences to switch between large planes of characters that together would have enough space for as many as 4,294,967,296 characters--in other words, a character code that was for all practical purposes unlimited in nature.

This ISO multilingual standard, which went by the name of ISO/IEC DIS 10646 Version 1, was supported by Japanese and European researchers. However, ISO/IEC DIS 10646 Version 1 was not supported by the American computer firms, which were doing parallel research on Unicode, and who had even gone so far as to create a Unicode Consortium to conduct that research. They deemed Unicode to be superior to ISO/IEC DIS 10646 Ver. 1, since it was simpler. For that reason, they counter proposed that Unicode be made the "Basic Multilingual Plane" [2] of the ISO's multilingual standard. Of course, since they were also the developers of the world's leading operating systems, they were in a position to create a parallel, de facto alternative multilingual character code and encoding system to any multilingual character set and encoding system that the ISO might develop. Accordingly, they were successful in persuading the ISO that ISO/IEC DIS 10646 Version 1 should be dropped. It was, and a Unicode-based multilingual scheme called ISO/IEC 10646 Version 2 came into being. In essence, Unicode had swallowed the ISO standard, which is now called ISO/IEC 10646-1: 1993.

     So What is the difference between ISO10646 and Unicode, again?
    In the late 1980s, there have been two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in around 1991 that two different unified character sets is not exactly what the world needs. They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. Unicode 1.1 corresponded to ISO 10646-1:1993, Unicode 3.0 corresponded to ISO 10646-1:2000, Unicode 3.2 added ISO 10646-2:2001, and Unicode 4.0 corresponds to ISO 10646:2003, and Unicode 5.0 corresponds to ISO 10646:2003 plus its amendments 1–3. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future.

    The Unicode Standard defines in addition much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (say Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more.


A character is a minimal unit of text that has semantic value.

A Glyph is a particular image that represents a Character or part of a Character.

Character Set
Maybe a better terminology is: Character Repertoire.

The phrase character set is used in a variety of meanings. It might denotes just a character repertoire but it may also refer to a character code, and quite often a particular character encoding is implied too. Unfortunately this causes much confusion. It is even the official term to be used in several contexts by Internet protocols, in MIME headers.

No specific internal presentation in computers or data transfer is assumed. The repertoire per se does not even define an ordering for the characters; ordering for sorting and other purposes is to be specified separately. A character repertoire is usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form. Notice that a character repertoire may contain characters which look the same in some presentations but are regarded as logically distinct, such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha.

The biggest Character Set is the Universal Character Set with 1,114,112 entries.

Code Points
Also used: Character Code.
A mapping, often presented in tabular form, which defines a one-to-one correspondence between characters in a character repertoire and a set of non-negative integers. That is, it assigns a unique numerical code, a code position, to each character in the repertoire. In addition to being often presented as one or more tables, the code as a whole can be regarded as a single table and the code positions as indexes.

"How" the Characters are stored in memory. There can be more than one Encoding for a given Character Repertoire.

The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset". This has caused a lot of confusion, since "set" can easily be understood as repertoire!

The official registry of "charset" (i.e., character encoding) names, with references to documents defining their meanings, is kept by IANA at this page.

Specifically, when data is sent in MIME format, the media type and encoding are specified in a manner illustrated by the following example:
Content-Type: text/html; charset=iso-8859-1
This specifies, in addition to saying that the media type is text and subtype is html, that the character encoding is ISO 8859-1.

From Unicode's point a view, text is stored on a computer as a series of numbers, one per character. There are many different ways to arrange these numbers in memory (or in a network transmission), some straightforward and efficient, some less so. These are called “encodings”. Unicode itself defines several different encoding schemes, the two best known of which are UTF-8 and UTF-16.

Code Units
A particular Encoding will represent Code Points as a sequence of one or more Code Units.

A Code Unit is a unit of memory: 8, 16, or 32 bits.

A code unit in US-ASCII consists of 7 bits.
A code unit in UTF-8 consists of 8 bits.
A code unit in UTF-16 consists of 16 bits.
A code unit in UTF-32 consists of 32 bits.

Basic Multilingual Plane.
The range of valid code points for the Unicode standard is U+0000 to U+10FFFF, inclusive. These characters are sometimes called the Basic Multilingual Plane (BMP). Characters in the range U+10000 to U+10FFFF are called supplementary characters.

The Universa Character Set uses surrogates to address characters outside the initial Basic Multilingual Plane without resorting to more than 16 bit byte representations. In this way, UCS has a built-in 16 bit Encoding capability for UTF-16.

Surrogate Pair
Code Points encoded as two 16-bit Code Units called Surrogate Pairs.


ASCII (American Standard Code for Information Interchange)
ASCII denotes an old Character Repertoire, Character Code, and Encoding.

ASCII has been used and is used so widely that often the word ASCII refers to "text" or "plain text" in general, even if the character code is something else! The words "ASCII file" quite often mean any text file as opposite to a binary file.

ASCII is one of the most commonly known and frequently misunderstood character encodings. Contrary to popular belief, it is only 7 bit - there are no ASCII characters above 127. If anyone says that they wish to encode (for example) "ASCII 154" they may well not know exactly which encoding they actually mean. If pressed, they're likely to say it's "extended ASCII". There is no encoding scheme called "extended ASCII". There are many 8-bit encodings which are supersets of ASCII, and usually it is one of these which is meant - commonly whatever Windows Code Page is the default for their computer.

The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one byte.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255

ISO-8859-1 is (according to the standards at least) the default encoding of documents delivered via HTTP with a MIME type beginning with "text/" (however the HTML5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.) It is the default encoding of the values of certain descriptive HTTP headers, and defines the repertoire of characters allowed in HTML 3.2 documents (HTML 4.0, however, is based on Unicode). It and Windows-1252 are often assumed to be the encoding of text on Unix and Microsoft Windows in the absence of locale or other information, this is only gradually being replaced with Unicode encoding such as UTF-8 or UTF-16.

Windows-1252 (CP-1252)
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages.
This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters are curly quotation marks, the Euro sign, and all the printable characters that are in ISO 8859-15. It is known to Windows by the code page number 1252, and by the IANA-approved name "windows-1252".
It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in word-processing software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.

Unicode is not an Encoding. Unicode refers to the abstract Character Set itself, not to any particular Encoding.

First Unicode Proposal is here, if you are interested.

Unicode is a computing industry standard developed in conjunction with the Universal Coded Character Set (UCS) standard and published as The Unicode Standard, for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

The standard is maintained by the Unicode Consortium.

Unicode Standard assigns every Character a Code Point. Unicode defines a large (and steadily growing) number of characters

Each character gets a name and a number, for example LATIN CAPITAL LETTER A is 65 and TIBETAN SYLLABLE OM is 3840. Unicode includes a table of useful character properties such as "this is lower case" or "this is a number" or "this is a punctuation mark".

Unicode labeled each abstract character with a "Code Point". For compatibility with ASCII, code points U+0000 to U+007F (0-127) were the same as ASCII.

For example, "A" mapped to Code Point U+0041 (this code point is in hex; code point 65 in decimal).

Rather than mapping characters directly to bytes, Unicode separately defines what characters are available, corresponding natural numbers (code points), how those numbers are encoded as a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets(encodings).

Unicode can be implemented by different Character Encodings. The most commonly used encodings are UTF-8, UTF-16

Each UTF-n represents a Code Point as a sequence of one or more Code Units, where each Code Units occupies n bits.

Universal Coded Character Set (ISO-10646)
The Unicode Consortium (UC) and the International Organisation for Standardisation (ISO) collaborate on the Universal Character Set (UCS).

The Universal Coded Character Set (UCS), is a standard set of characters defined by the International Standard ISO/IEC 10646, Information technology, which is the basis of many character encodings. The UCS contains over 128,000 abstract characters, each identified by an unambiguous name and an integer number called its code point.

Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points.

Universal Coded Character Set vs Unicode

Unicode and ISO 10646 can largely be regarded as "the same thing" in that they are compatible in almost all respects.

ISO 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

UTF (Unicode Transformation Format)

Code points map to a sequence of one, two, three or four code units.

The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

In UTF-8 the code point maybe represented using either 8, 16, 24 or 32 bits.

* If the character is encoded using one byte, the first bit is "0"

* If the character is encoded using two bytes, the first 3 bits are "110" and the second byte starts with "10" bits

* If the character is encoded using three bytes, the first 4 bits are "1110" and other bytes starts with "10"

* If the character is encoded using four bytes, the first 4 bits are "11110" and  other bytes starts with "10"

Code units are twice longer than 8-bit code units. Therefore, any code point with a scalar value less than U+10000 are encoded with a single code unit. Code points with a value U+10000 or higher require two code units each. These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".

The 32-bit code unit is large enough that every code point is represented as a single code unit.


Internet media types, often called MIME media types, can be used to specify a major media type ("top level media type", such as text), a subtype (such as html), and an encoding (such as iso-8859-1). They were originally developed to allow sending other than plain ASCII data by E-mail. They can be (and should be) used for specifying the encoding when data is sent over a network, e.g. by E-mail or using the HTTP protocol on the World Wide Web.

Character encoding ("charset") information
The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset". This has caused a lot of confusion, since "set" can easily be understood as repertoire!

Java Related
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.