Editing Unicode guide
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
'''This is a WIP page, take nothing here as final.''' | '''This is a WIP page, take nothing here as final.''' | ||
If you've ever tried to learn Unicode you've most likely looked at online | If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture. | ||
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources. | This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources. | ||
Line 43: | Line 43: | ||
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character. | Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character. | ||
Abstract characters are units | Abstract characters are the units that make up textual data on a computer. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage. | ||
Encoded characters are mappings of an abstract character to the Unicode codespace as | Encoded characters are mappings of an abstract character to the Unicode codespace as a code point. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions. | ||
In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information. | In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information. | ||
Line 57: | Line 57: | ||
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES] | * U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES] | ||
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU] | * U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU] | ||
*U+1F1F3 "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N] | *U+1F1F3: "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N] | ||
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR] | * U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR] | ||
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE] | * U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE] | ||
Line 87: | Line 87: | ||
== Encodings == | == Encodings == | ||
Storing an arbitrary code point requires | Storing an arbitrary code point requires a 21-bit number. This a problem for a few reasons: | ||
* Modern computers would store this in a 32-bit number | * Modern computers would store this in a 32-bit number | ||
Line 102: | Line 102: | ||
* UTF-16 which uses 16-bit code units | * UTF-16 which uses 16-bit code units | ||
* UTF-32 which uses 32-bit code units | * UTF-32 which uses 32-bit code units | ||
These encoding forms encode all valid code points except surrogate code points | These encoding forms encode all valid code points except surrogate code points. | ||
The standard then defines encoding schemes that transform between code units and bytes: | The standard then defines encoding schemes that transform between code units and bytes: | ||
Line 111: | Line 111: | ||
* UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection | * UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection | ||
* UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection | * UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection | ||
I would like to point out that code unit sequences are often not valid Unicode code units, even if a development environment claims strings are UTF-8 or UTF-16. An obvious example of this is Linux strings where the 8-bit code units are arbitrary bytes without a specified encoding but UTF-8 is the most common encoding. A less obvious one is Windows and JavaScript strings: Their 16-bit code units should encode UTF-16 but don't enforce its validity. | |||
Be sure to investigate what guarantees your tools give or don't give regarding encoded data. If the guarantees aren't what you need you can always validate the data yourself. | |||
== Algorithms == | == Algorithms == | ||
Line 241: | Line 231: | ||
Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs. | Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs. | ||
The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters. | The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Your Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters. | ||
This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points. | This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points. | ||
For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation] | For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation] | ||
Line 259: | Line 241: | ||
You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool. | You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool. | ||
== | == Abstraction levels == | ||
- bytes | |||
- code units | |||
- code points | |||
- segmented text | |||
=== Level 1: Bytes === | |||
level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte | |||
filesystem/unix/C | |||
utf-8b | |||
=== Level 2: Code units === | |||
level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness | |||
windows | |||
=== Level 3: Unicode scalars === | |||
level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them | |||
python | |||
TODO: Code points can be noncharacters or reserved characters, uh-oh | |||
== | === Level 4: Unicode characters === | ||
level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported. | |||
??? | |||
TODO: noncharacters | |||
=== Level 5: Segmented text === | |||
level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules | |||
swift/raku | |||
- | == Non-standard encodings == | ||
other standards: | |||
- | - GB 18030 | ||
previous standards: | |||
- | - ucs-2 | ||
- ucs-4 | |||
- utf-1 | |||
in memory storage: | |||
- 32-bit integer | |||
- 'runes' | |||
- wtf-8 | |||
- utf-8b | |||
- UTF8-C8 | |||
- nfg | |||
- | - python's weirdness | ||
- | == General mistakes == | ||
- languages don't let you store all codepoints | |||
- | - not tagging data with locale/encoding | ||
- | - relying on locale | ||
- | - not using markup | ||
- utf8b | |||
- with and encoding isn't that important | |||
- APIs will give you invalid data | |||
- APIs may not check code units | |||
- APIs might not let you handle surrogates | |||
- code units, etc | |||
- uint32 | |||
- utf-32 | |||
- not grapheme aware: 🇪🇳🇮🇸 -> 🇪🇳 🇮🇸 , fonts will cheaply display as 🇪 🇳🇮 🇸 , grep | |||
- not the same as ligatures | |||
- fonts cursive | |||
- flags | |||
- default/tailored | |||
For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine — even though the two original, underlying letters remain separate graphemes. | |||
- round trips, invalid unicode, non unicode, confusables | |||
- length | |||
== Further reading == | == Further reading == |