Unicode guide: Difference between revisions

Revision as of 03:47, 1 October 2022

This is a WIP page, take nothing here as final.

If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.

This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.

As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.

Standards

The Unicode standard defines the following:

A large multilingual set of characters
A database of properties for each character
How to encode and decode characters to bytes
How to normalize equivalent sequences of characters
How to map text between different cases
How to segment text in to words, sentences, lines, and paragraphs
How to determine text direction

Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization.

The standard is freely available online in the following pieces:

Unicode Core Specification chapters 3 (Conformance) and 4 (Character properties)
Unicode Updates and Errata
Unicode Character Code Charts
Unicode Character Database
Unicode Standard Annexes

EXTRAS outside the standard

How to order text for sorting

How to incorporate Unicode in to regular expressions
Stability policies
Locale data

These are also freely available online at:

Characters

- scripts or bits of things, not gylphs

- explain how characters are selected

- code points/encoded characters

- abstract characters

- properties

- combining characters?

- control characters?

- case

- category

- data for algorithms

- script

- name

- block

- rendering

- breaking

- bidi

Strings

- levels of abstraction

- indexing

- sort

- match

- search

- normalize

- serialize

- case map

- properties

- breaking/segmentation

- reversing

TODO:

languages/locales

Non-Unicode compatibility

- preserving data

Level 1: Bytes

level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

Level 2: Code units

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

Level 3: Unicode scalars

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

Level 4: Unicode characters

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

Level 5: Segmented text

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

@@ Line 37: / Line 37: @@
 *[https://www.unicode.org/policies/ Unicode Consortium Policies]
-== What are characters? ==
+== Characters ==
-TODO: rewrite
+- scripts or bits of things, not gylphs
-When writing code that deals with text often you need to know information about specific abstract characters within the text.
+- explain how characters are selected
-Most programming languages support querying the following character information:*The abstract character's category, such as: Letter, number, symbol, punctuation, whitespace
+- code points/encoded characters
-*The abstract character's case status, such as: Uppercase, lowercase
-The Unicode character database maps various properties to abstract characters. Some are:
+- abstract characters
-*The abstract character's name
-*The abstract character's code point
+- properties
-*The abstract character's script
-*The abstract character's Unicode block
+- combining characters?
-*The abstract character's category, such as: Letter, number, symbol, punctuation, whitespace
-* The abstract character's case status, such as: Uppercase, lowercase, no case
+- control characters?
-*The abstract character's combining status, such as: Modifier or base
-*The numeric value represented by the abstract character
+- case
-The database also maps more general information as properties, such as:
-*Information about rendering the abstract character
+- category
-*Information used for breaking and segmentation
-*Information used for normalization
+- data for algorithms
-*Information used for bidirectional control and display
-Languages that don't provide access to these properties make it impossible to write Unicode aware code. Even worse, the language might provide classical functions that only work for Latin scripts.
+- script
-== What are strings? ==
+- name
+- block
+- rendering
+- breaking
+- bidi
+== Strings ==
 - levels of abstraction