Unicode guide: Difference between revisions
(→Characters: More notes about code points) |
(→Characters: Drop the idea of mentioning character details in this section) |
||
Line 10: | Line 10: | ||
The Unicode standard defines the following: | The Unicode standard defines the following: | ||
*A large multilingual set of characters | *A large multilingual set of encoded characters | ||
*How to encode and decode text | |||
*How to encode and decode | * How to normalize equivalent text sequences | ||
* How to normalize equivalent | |||
* How to map text between different cases | * How to map text between different cases | ||
* How to segment text in to words, sentences, lines, and paragraphs | * How to segment text in to words, sentences, lines, and paragraphs | ||
*How to determine text direction | *How to determine text direction | ||
Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization. | Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization. | ||
Line 33: | Line 30: | ||
* How to incorporate Unicode in to regular expressions | * How to incorporate Unicode in to regular expressions | ||
*How to handle emoji sequences | *How to handle emoji sequences | ||
*How to handle confusable characters and other security concerns | *How to handle confusable encoded characters and other security concerns | ||
*A repository of shared localization data | *A repository of shared localization data | ||
These are also freely available online at: | These are also freely available online at: | ||
Line 40: | Line 37: | ||
Policies for stability in these standards can be found at the [https://www.unicode.org/policies/ Unicode Consortium Policies] page. | Policies for stability in these standards can be found at the [https://www.unicode.org/policies/ Unicode Consortium Policies] page. | ||
== | == Encoded characters == | ||
The term "character" is tossed around throughout Unicode discourse to mean a bunch of different things: | The term "character" is tossed around throughout Unicode discourse to mean a bunch of different things: | ||
Line 51: | Line 48: | ||
* A glyph | * A glyph | ||
* User-perceived character | * User-perceived character | ||
encoded character | |||
The Unicode standard explains that Unicode characters (or 'abstract characters') are the smallest meaningful components of a language script. | The Unicode standard explains that Unicode characters (or 'abstract characters') are the smallest meaningful components of a language script. |
Revision as of 09:22, 2 October 2022
This is a WIP page, take nothing here as final.
If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.
Standards
The Unicode standard defines the following:
- A large multilingual set of encoded characters
- How to encode and decode text
- How to normalize equivalent text sequences
- How to map text between different cases
- How to segment text in to words, sentences, lines, and paragraphs
- How to determine text direction
Some portions of the standard may be overridden (also known as 'tailoring') to aid in localization.
The standard is freely available online in the following pieces:
- Unicode Core Specification chapters 3 (Conformance) and 4 (Character Properties)
- Unicode Updates and Errata
- Unicode Character Code Charts
- Unicode Character Database
- Unicode Standard Annexes
The Unicode Consortium also defines these in separate standards:
- How to order text for sorting
- How to incorporate Unicode in to regular expressions
- How to handle emoji sequences
- How to handle confusable encoded characters and other security concerns
- A repository of shared localization data
These are also freely available online at:
Policies for stability in these standards can be found at the Unicode Consortium Policies page.
Encoded characters
The term "character" is tossed around throughout Unicode discourse to mean a bunch of different things:
- Abstract characters
- A code point that is assigned to an abstract character
- A code point
- A unicode scalar
- The smallest component of written language
- The basic unit of character encoding
- A glyph
- User-perceived character
encoded character
The Unicode standard explains that Unicode characters (or 'abstract characters') are the smallest meaningful components of a language script.
This value is the atomic units of Unicode. They represent
They consist of:
- code point
- chart image
properties used for unicode algorithms like encoding,
- case
- category
- script
- name
- block
- rendering
- breaking
- bidi
- normalization
- low level primitive
- characters alone aren't very useful
- requires context
- almost all algorithms works on groups of characters, or 'text'
- need to be interpreted according to properties
- code points?
- properties
- combining characters?
Strings
- levels of abstraction
- indexing
- sort
- match
- search
- normalize
- serialize
- case map
- properties
- breaking/segmentation
- reversing
TODO:
languages/locales
Non-Unicode compatibility
- preserving data
Level 1: Bytes
level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte
filesystem/unix/C
Level 2: Code units
level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness
windows
Level 3: Unicode scalars
level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them
python
TODO: Code points can be noncharacters or reserved characters, uh-oh
Level 4: Unicode characters
level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.
???
TODO: noncharacters
Level 5: Segmented text
level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules
swift/raku
Further reading
I highly recommend reading the following resources:
You might also find the following tools helpful:
While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode guide/Implementations.