Unicode guide: Difference between revisions
(→What is Unicode?: Split up list of definitions, rename to Unicode overview) |
(→Unicode overview: Rename to standards) |
||
Line 7: | Line 7: | ||
As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me. | As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me. | ||
== | == Standards == | ||
Unicode | The Unicode standard defines the following: | ||
* A large multilingual set of abstract characters (known just as 'characters') | * A large multilingual set of abstract characters (known just as 'characters') | ||
* A database of properties for each character | * A database of properties for each character | ||
*Various algorithms for working with characters | *Various algorithms for working with these characters | ||
*Stability policies for the character set, properties and algorithms | *Stability policies for the character set, properties and algorithms | ||
*How to encode characters for storage | |||
* How to encode characters for storage | |||
* How to normalize characters in to a canonical format | * How to normalize characters in to a canonical format | ||
* How to segment text in to words, sentences, lines, and paragraphs | * How to segment text in to words, sentences, lines, and paragraphs | ||
* How to map text between different cases | * How to map text between different cases | ||
* How to match text for searching | |||
Many of these can be further tailored by locale-dependent rules or application code. | |||
TODO: | |||
These standards are freely available online. | |||
- The Core Specification https://www.unicode.org/versions/latest/ | |||
- Unicode Standard Annexes https://www.unicode.org/reports/index.html#annexes | |||
* How to order text for sorting | * How to order text for sorting | ||
* How to incorporate Unicode in to regular expressions | * How to incorporate Unicode in to regular expressions | ||
Technical Standards https://www.unicode.org/reports/index.html#standards | |||
CLDR https://cldr.unicode.org/index | |||
https://www.unicode.org/policies/ | |||
== What are characters? == | == What are characters? == |
Revision as of 02:22, 1 October 2022
This is a WIP page, take nothing here as final.
If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.
Standards
The Unicode standard defines the following:
- A large multilingual set of abstract characters (known just as 'characters')
- A database of properties for each character
- Various algorithms for working with these characters
- Stability policies for the character set, properties and algorithms
- How to encode characters for storage
- How to normalize characters in to a canonical format
- How to segment text in to words, sentences, lines, and paragraphs
- How to map text between different cases
- How to match text for searching
Many of these can be further tailored by locale-dependent rules or application code.
TODO:
These standards are freely available online.
- The Core Specification https://www.unicode.org/versions/latest/
- Unicode Standard Annexes https://www.unicode.org/reports/index.html#annexes
- How to order text for sorting
- How to incorporate Unicode in to regular expressions
Technical Standards https://www.unicode.org/reports/index.html#standards
CLDR https://cldr.unicode.org/index
https://www.unicode.org/policies/
What are characters?
TODO
What are strings?
- levels of abstraction
- indexing
- sort
- match
- search
- normalize
- serialize
- case map
- properties
- breaking/segmentation
- reversing
Level 1: Bytes
level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte
filesystem/unix/C
Level 2: Code units
level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness
windows
Level 3: Unicode scalars
level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them
python
Level 4: Unicode characters
level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.
???
Level 5: Segmented text
level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules
swift/raku
Further reading
I highly recommend reading the following resources
- The Unicode Standard, Version 15.0 chapters 1, 2, 3, 4 and 23
- Unicode Glossary
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring Unicode algorithms and other localization tasks.
TODO:
languages/locales
Non-Unicode compatibility
- preserving data
- While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode guide/Implementations.