Unicode guide

From JookWiki

This is a WIP page, take nothing here as final.

If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.

This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.

As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.

Standards[edit | edit source]

The Unicode standard defines the following:

  • A large numeric codespace
  • A large multilingual database of characters
  • A database of character properties
  • How to encode and decode the codespace
  • How to normalize equivalent text
  • How to map text between different cases
  • How to segment text in to words, sentences, lines, and paragraphs
  • How to determine text direction

Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization.

The standard is freely available online in the following pieces:

The Unicode Consortium also defines these in separate standards:

  • How to order text for sorting
  • How to incorporate Unicode in to regular expressions
  • How to handle emoji sequences
  • How to handle confusable characters and other security concerns
  • A repository of shared localization data

These are also freely available online at:

Policies for stability in these standards can be found at the Unicode Consortium Policies page.

Characters[edit | edit source]

Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters.

Abstract characters are the units that make up textual data on a computer. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.

Encoded characters are mappings of an abstract character to the Unicode codespace as a code point. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.

In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.

The last point I want to make is a warning: Characters do not correspond to some human identifiable unit of text such as a glyph, letter, phoneme, syllable, vowel or consonant. They are only useful for building higher level abstractions. General text processing should be done with groups of characters and Unicode-aware algorithms.

Some examples of characters are:

I've linked to the Unicode Utilities page for each character so you can see the character properties.

Character sequences[edit | edit source]

Coded Character Sequence. An ordered sequence of one or more code points. Normally, this consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points. (See definition D12 in Section 3.4, Characters and Encoding.)

- groups of characters

- levels of abstraction

- indexing/length

- sort

- match

- search

- normalize

- serialize

- case map

- breaking/segmentation

- reversing

TODO:

languages/locales

Non-Unicode compatibility

- preserving data

๐Ÿ‡ช๐Ÿ‡ณ๐Ÿ‡ฎ๐Ÿ‡ธ -> ๐Ÿ‡ช๐Ÿ‡ณ ๐Ÿ‡ฎ๐Ÿ‡ธ , fonts will cheaply display as ๐Ÿ‡ช ๐Ÿ‡ณ๐Ÿ‡ฎ ๐Ÿ‡ธ

For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine โ€” even though the two original, underlying letters remain separate graphemes.

Strings[edit | edit source]

- bytes

- code units

- code points:

- unicode scalars

- private use characters

- reserved characters

Level 1: Bytes[edit | edit source]

level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

Level 2: Code units[edit | edit source]

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

Level 3: Unicode scalars[edit | edit source]

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

TODO: Code points can be noncharacters or reserved characters, uh-oh

Level 4: Unicode characters[edit | edit source]

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

TODO: noncharacters

Level 5: Segmented text[edit | edit source]

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

Stability[edit | edit source]

TODO, explain

shit changes, there's detailed stability notes but

invaild utf8 etc

Further reading[edit | edit source]

I highly recommend reading the following resources:

You might also find the following tools helpful:

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode guide/Implementations.