Unicode guide

From JookWiki

This is a WIP page, take nothing here as final.

If you've ever tried to learn Unicode you've most likely looked at online tutorials and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.

This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.

As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.

Important note: This page uses Unicode characters in various examples. These may display wrong if your browser or screen reader has trouble rendering Unicode text properly. I've tried my best to write this article to avoid relying on the examples for this reason.

Standards[edit | edit source]

The Unicode standard defines the following:

  • A large numeric codespace
  • A large multilingual database of characters
  • A database of character properties
  • How to encode and decode the codespace
  • How to normalize equivalent text
  • How to map text between different cases
  • How to segment text in to words, sentences, lines, and paragraphs
  • How to determine text direction

Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization.

The standard is freely available online in the following pieces:

The Unicode Consortium also defines these in separate standards:

  • How to order text for sorting
  • How to incorporate Unicode in to regular expressions
  • How to handle emoji sequences
  • How to handle confusable characters and other security concerns
  • A repository of shared localization data

These are also freely available online at:

Policies for stability in these standards can be found at the Unicode Consortium Policies page.

Characters[edit | edit source]

Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character.

Abstract characters are units of writing that make up textual data. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.

Encoded characters are mappings of an abstract character to the Unicode codespace as one or more code points. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.

In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.

The last point I want to make is a warning: Characters do not correspond to some human identifiable unit of text such as a glyph, letter, phoneme, syllable, vowel or consonant. They are only useful for building higher level abstractions. General text processing should be done with groups of characters and Unicode-aware algorithms.

Some examples of characters are:

I've linked to the Unicode Utilities page for each character so you can see the character properties.

Code points[edit | edit source]

The Unicode standard defines a range of integers from 0x0 to 0x10FFFF as the 'Unicode codespace', and defines a code point as a value within this codespace.

The primary purpose of the code points are to address encoded characters, but it also encodes more. There are seven categories of code points:

  • Graphic: Assigned to visible characters
  • Format: Assigned to invisible formatting characters
  • Control: Assigned to characters used in Unicode and non-Unicode protocols and standards
  • Private-use: Assigned for interpretation used outside the Unicode standard
  • Surrogate: Reserved for UCS-2 compatibility, must not be encoded
  • Noncharacter: Reserved for application internal use, not used for open interchange
  • Reserved: Not assigned yet, used in future Unicode versions

Each code point belongs to one of these categories. I bring this system up because there's two major implications that stem from it:

The first is that it's not always possible to interpret a code point as an encoded character: It may be from a future version of Unicode, it may be private use and not known to you, or it may not even be a character at all and instead used for application specific processing.

The second is that exchanging code points must be done mindfully: Surrogate code points can not be exchanged using official Unicode encodings, noncharacters are not intended to be interchanged openly, and private use characters require an external agreement outside the standard.

Unicode also defines a sequence of one or more code points as a 'Coded character sequence', or just 'character sequence' for short. Despite this name it may include any valid code point, including noncharacters or reserved code points. It is strictly a sequence of code points.

The best way to think about code points and sequences of them are as opaque building blocks used in Unicode-aware algorithms. Much like encoded characters don't map to the human concept of character, code points don't map to the machine concept of encoded characters or anything higher level.

Encodings[edit | edit source]

Storing an arbitrary code point requires an unsigned 21-bit number. This a problem for a few reasons:

  • Modern computers would store this in a 32-bit number
  • Storing a load of 32-bit numbers is space inefficient

Modern development environments break encoded Unicode text in to sequences of one or more code units:

  • Unix strings use 8-bit code units
  • Windows strings use 16-bit code units
  • Java and JavaScript strings use 16-bit code units

The Unicode standard defines encoding forms that transform between code points and code units:

  • UTF-8 which uses 8-bit code units
  • UTF-16 which uses 16-bit code units
  • UTF-32 which uses 32-bit code units

These encoding forms encode all valid code points except surrogate code points, even UTF-32 which is otherwise a straight representation of code points as 32-bit integers.

The standard then defines encoding schemes that transform between code units and bytes:

  • UTF-8 which is the same as its encoding form
  • UTF-16LE and UTF-16BE which use different byte orders
  • UTF-32LE and UTF-32BE which use different byte orders
  • UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection
  • UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection

The byte order mark is actually the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE, but interpreted as a byte order mark for UTF-16 and UTF-32 when present at the start of encoded text. The initial U+FEFF code point is added and removed during decoding and encoding, but any other U+FEFF code points are kept.

Some software treat the byte order mark as a signature to detect which Unicode encoding text is using, if using Unicode at all. Software that does this may require UTF-8 text to include a byte order mark despite the encoding not needing it.

Unicode also offers the ability to gracefully handle decoding failures. This is done by having decoders to substitute invalid data with the U+FFFD REPLACEMENT CHARACTER code point. This character may also be used as a fallback when unable to display a character, or when unable to convert non-Unicode text to Unicode.

All of these encodings may seem overwhelming, but in practice the only two encodings used are UTF-8 and UTF-16. The reason for this split is historical:

The first edition of Unicode had a 16-bit codespace and used a fixed-width 16-bit encoding named UCS-2. The first adopters of Unicode such as Java and Windows chose to represent Unicode with UCS-2 while software that required backwards compatibility such as Unix used UTF-8 and treated Unicode as just another character set.

The second edition of Unicode increased the codespace to 21-bit and introduced UTF-32 as its fixed-width encoding. UCS-2 was succeeded by the variable-width UTF-16 encoding we have today. A portion of the codespace was reserved as 'surrogate' code points to preserve compatibility between UCS-2 and UTF-16: These code points are seen as valid code points by UCS-2 systems but decoded as 21-bit code points by UTF-16.

Lots of time is spent discussing which encoding is the better variable-width encoding and which you should use in new projects. In practice the encoding you use is likely already decided by the tools you use and cultures or APIs you interact with.

Algorithms[edit | edit source]

The Unicode standard defines a set of algorithms that interpret text according to the standard.

In general the algorithms cover the following topics:

  • Looking up information about a code point
  • Breaking text in to smaller pieces
  • Changing the case of text
  • Normalizing and comparing text
  • Sorting and searching text
  • Editing and displaying text

These tend to map cleanly to most text processing done with traditional character sets, with the largest change being that most algorithms operate on character sequences rather than individual characters.

The act of associating code points with some kind of behaviour is known as interpretation. This interpretation varies depending on Unicode version, tailoring, and whether the application supports all of Unicode or just specific code points.

This can be a worrying thought as there's a distinct lack of stability: Running code on one system may have a different output to code on another system depending on their Unicode versions and setup. This is a complicated problem without an easy solution short of never exchanging Unicode text with someone else.

While Unicode does have a stability policy, it is aimed at people writing Unicode algorithms, not people using them. My advice is to treat the output of these algorithms as inherently unstable, and read your tool documentation to see what stability it guarantees.

Normalization[edit | edit source]

An identical sequence of abstract characters may be represented using multiple different encoded character sequences. This can be due to an abstract character being encoded multiple times, or being encodable using multiple other encoded characters.

An easy example is that the ohm symbol may be represented as any of the following:

For a harder example the abstract character "é" may be represented as:

But it can also be represented as:

These are all the same abstract character but can be encoded multiple ways. One is a precomposed character, one is two characters: A base character and a combining character. This makes comparing these for equality very difficult.

To solve this Unicode has a normalization algorithm that can transform a coded character sequence in such a way that it ensure all sequences of the same abstract characters are represented by the same coded character sequence. This works in a series of steps:

The first step is decomposition: Each encoded character is recursively mapped to one or more encoded character sequences that are defined to be equivalent. For the most part this uses a mapping defined in the Unicode database, but special rules are required to decompose Hangul syllables. The simple example above of LATIN SMALL LETTER E WITH ACUTE is expanded to two characters: LATIN SMALL LETTER E and COMBINING ACUTE ACCENT.

The second step is re-ordering: Multiple combining characters can be attached to a base character, and often the order is based on how the character is typed. This step re-orders the combining characters to be in a specific order. Doing this step requires an unbounded buffer which can become a security hazard depending on the application. The standard defines the "Stream-Safe Text Process" which limits this step to processing 30 combining characters but creates output that isn't normalized when dealing with uncharacteristically long inputs.

The third step is composition: This step is optional and does the reverse of decomposition as a form of compression. It looks at the new sequence and recursively matches character sequences in it to decomposition mappings. This step excludes many opportunities to compose: Various scripts have specific exclusions, and single encoded characters will not compose to other single encoded characters. As an example of composition, LATIN SMALL LETTER E and COMBINING ACUTE ACCENT is composed back to LATIN SMALL LETTER E WITH ACUTE. As an example of an exclusion, OHM SIGN will decompose to GREEK CAPITAL LETTER OMEGA but not compose back to OHM SIGN.

When describing these steps I glossed over what it means for encoded characters to be equivalent. Unicode defines two forms of equivalent: Canonical and compatibility equivalence. Both of these equivalences require that the encoded characters represent the same abstract character. Compatibility equivalence goes a step further and defines equivalence between encoded characters that have different appearances or behaviours. This usually includes formatting and other ways to write a character, but does not include other variants of the character such as different cases.

These encoded characters are all compatibly equivalent to the digit two:

Compatibility equivalence combines with canonical equivalence during during the decomposition step in the normalization algorithm. This creates two types of decomposition:

  • Canonical decomposition which uses canonical equivalence
  • Compatibility decomposition which uses both canonical and compatibility equivalence

With all that in mind Unicode defines the following normalization forms:

  • Normalization Form D (NFD) uses canonical decomposition and skips recomposition
  • Normalization Form C (NFC) uses canonical decomposition
  • Normalization Form KD (NFKD) uses compatibility decomposition and skips recomposition
  • Normalization Form KC (NFKD) uses compatibility decomposition

Normalization is stable between Unicode versions after 4.1 (released in 2005):

  • Normalized text from an older version stays normalized in the new version
  • Normalized text in a new version stays normalized to the older version if it contains only characters assigned in the older version

As a developer you will normally find normalization in code that checks for equality between abstract character sequences read from elsewhere, such as usernames in databases and filenames on filesystems. This procedure is generally unnecessary for comparing text generated or manipulated within an application unless those operations are not deterministic.

I also want to note that compatibility decomposition is only useful in specific text processing tasks: It does not act as a filter for malicious text that intends to look visually identical to other text that uses different abstract characters. Various security tools exist to filter these 'confusables', but these should not be used indiscriminately as they are inherently lossy algorithms.

One example where compatibility equivalence is useful is useful is screen readers: Text that is formatted may be read as their compatibility equivalent values during normal reads of text, with the actual values read out verbosely later if needed.

For full details on the algorithm check out the standard: UAX #15: Unicode Normalization Forms

Segmentation[edit | edit source]

Code points and character sequences aren't useful in most text processing. Higher level constructs that map to things humans perceive and reason about are required.

Unicode provides a text segmentation algorithm for breaking character sequences in to groups of sentences, words and user-perceived characters. This can be used to implement many common algorithms such as counting user-perceived characters in text, inserting and removing text, or parsing text in to separate components.

As an example, take the following text: "Hi! 👋🏼". It consists of 6 code points:

It breaks in to:

  • 5 user-perceived characters: "H", "i", exclamation mark, space, and white waving hand
  • 4 words: "Hi", exclamation mark, space, and white waving hand
  • 2 sentences: "Hi! " (including a trailing space), and white waving hand

The default breaking algorithms do not do any kind of linguistic or locale analysis. Instead they are simple sets of rules designed to be get a useful results given arbitrary text.

Some use cases considered for these rules include:

  • Searching and ordering text
  • Selecting text at different granularities
  • Moving cursors through text
  • Inserting and removing text when editing
  • Counting occurrences of text elements

These are desirable goals in most computer programs and tolerant of edge cases: A boundary that is slightly wrong to a human usually doesn't matter in these cases as long as it is consistently wrong. For stronger segmentation guarantees these rules can be tailored for a specific applications or discarded entirely in favour of tools like natural language processing.

One type of segmentation gets a lot more attention than the others: User-perceived characters. These are segmented as 'grapheme clusters' and come in two variants: Legacy and extended. Unless you need to deal with backwards compatibility, extended grapheme clusters are the ones to use. Words and sentences are by default made up of grapheme clusters.

Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs.

The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.

This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points.

All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters.

This tends to work well enough for most applications, but can create some confusing situations:

  • "Jose" can match with "José" if the accent is a separate code point
  • The flag "🇩🇪" (regional indicators DE) matches against "🇧🇩🇪🇺" (indicators BD and EU)
  • The unused regional indicator combinations AB and BC may render as a sole A indicator, "🇧🇧" (regional indicators BB) and a sole C indicator

For full details on the algorithm check out the standard: UAX #29: Unicode Text Segmentation

A related but separate line breaking algorithm can be found at: UAX #14: Unicode Line Breaking Algorithm

You can experiment with breaks online using the Unicode Utilities: Breaks tool.

Non-Unicode data[edit | edit source]

Although many programming languages and development tools support Unicode, we still live in a world full of non-Unicode data. This includes data in other encodings and character sets, corrupted data, or even malicious data attempting to bypass security mechanisms. This data must be handled mindfully according to an application's requirements.

There are only a few ways to deal with non-Unicode data:

  • Don't treat the data as Unicode
  • Reject the data and request Unicode
  • Do a best effort conversion to Unicode

Which action to take is heavily dependent on how important it is to preserve the original data, or how important it is to perform Unicode processing on the text. For example:

  • A filesystem may treat paths as bytes and not perform Unicode processing
  • A website may ask the user to submit a post that isn't valid Unicode
  • A file manager may track filenames as bytes but display them as best effort Unicode
  • A photo labeller may prepend Unicode dates to a non-Unicode filename

The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.

TODO

structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions

unicode -> non-unicode: easy

non-unicode -> unicode: complicated

unicode -> unicode: non-issue

non-unicode -> non-unicode: non-issue

round trips increase pain

separate pipelines in an application

complexity goes up by number of conversions in an application

- greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data.

- lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path

Mixed strings[edit | edit source]

Non-Unicode data is not always represented as bytes.

- cross-platform APIs may

- you can represent non-unicode data as bytes, but many languages represent them as unicode strings with non-unicode data embedded in them. this is done so:

- OS-specific encoding is abstracted away

- round trippable

- code can ignore unicode and treat strings as opaque

- these are often called 'OS strings' but i would call them

- conversion from unicode only works if the string lacks surrogates and has valid codepoints

https://peps.python.org/pep-0383/

https://docs.raku.org/language/unicode#UTF8-C8

https://simonsapin.github.io/wtf-8/

https://doc.rust-lang.org/std/ffi/struct.OsString.html

https://hackage.haskell.org/package/os-string

Abstraction levels[edit | edit source]

- bytes

- code units

- code points

- segmented text

- unicode strings may be encoded data, code units, code points, non-surrogate code points, or mixed data

TODO: locale information/rich text

Level 1: Bytes[edit | edit source]

level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

utf-8b

Level 2: Code units[edit | edit source]

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

Level 3: Unicode scalars[edit | edit source]

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

TODO: Code points can be noncharacters or reserved characters, uh-oh

Level 4: Unicode characters[edit | edit source]

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

TODO: noncharacters

Level 5: Segmented text[edit | edit source]

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

Further reading[edit | edit source]

I highly recommend reading the following resources:

You might also find the following tools helpful:

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode guide/Implementations.