Editing Unicode guide
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
'''This is a WIP page, take nothing here as final.''' | '''This is a WIP page, take nothing here as final.''' | ||
If you've ever tried to learn Unicode you've most likely looked at online | If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture. | ||
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources. | This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources. | ||
Line 43: | Line 43: | ||
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character. | Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character. | ||
Abstract characters are units | Abstract characters are the units that make up textual data on a computer. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage. | ||
Encoded characters are mappings of an abstract character to the Unicode codespace as | Encoded characters are mappings of an abstract character to the Unicode codespace as a code point. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions. | ||
In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information. | In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information. | ||
Line 57: | Line 57: | ||
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES] | * U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES] | ||
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU] | * U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU] | ||
*U+1F1F3 "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N] | *U+1F1F3: "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N] | ||
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR] | * U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR] | ||
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE] | * U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE] | ||
Line 241: | Line 241: | ||
Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs. | Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs. | ||
The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters. | The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Your Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters. | ||
This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points. | This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points. | ||
Line 259: | Line 259: | ||
You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool. | You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool. | ||
== | == Strings == | ||
The majority of programming languages and related development tools choose not to represent text using a sequence of Unicode code points: Instead they provide data types that represent sequences of integers of some size, usually 8-bit, 16-bit or 32-bit. The developer is tasked with correctly storing Unicode sequences in these integers using some encoding defined by the language or tools. These integers serve an identical purpose to code units, but are used instead for a non-Unicode encoding. | |||
There are a few reasons languages use a non-Unicode encoding: | |||
* Non-Unicode data don't need a separate data type | |||
* Non-Unicode APIs can be merged with Unicode APIs | |||
* Surrogate code points can be represented in strings | |||
* No performance is spent on string validation | |||
Languages that have a strict separation between non-Unicode and Unicode usually hit these issues: | |||
* | * Developer fatigue from decoding and encoding Unicode | ||
* Code for handling non-Unicode data is neglected | |||
* | * Automatic decoding can fail and crash an entire program | ||
A common real world example here is handling filenames: Let's say you write a program that adds the date to filenames. What happens if it encounters a non-Unicode filename? | |||
It has a few choices: | |||
* Throw an error and ignore the non-Unicode file | |||
* Replace the non-Unicode code units with question marks | |||
* Mix a date in to the filename's code units and hope for the best | |||
non- | Most people would expect a program to take the last option and at least try to add the date to a non-Unicode string. Languages that allow non-Unicode and Unicode data to mix are able to get the expected outcome easier without developers needing to write extra code. | ||
TODO: This ends up working unreasonably well in practice as most algorithms only operate on portions of strings, with most data being silently ignored regardless of whether it is valid Unicode or invalid data being silently ignored | |||
examples: | |||
- C: bytestrings, UTF-8 sometimes | |||
- JavaScript: UCS-2 | |||
- | - python: code points, bytes as surrogates utf-8b | ||
- | - rust: utf-8, wtf-8 | ||
- haskell: code points | |||
- | - perl treats strings as bytes or unicode based on a flag | ||
- | - go uses u32 | ||
- | - swift uses grapheme clusters | ||
- | - raku uses normalized grapheme clusters | ||
locale information/rich text | |||
== Abstraction levels == | == Abstraction levels == | ||
Line 336: | Line 319: | ||
- segmented text | - segmented text | ||
=== Level 1: Bytes === | === Level 1: Bytes === |