Editing Unicode guide

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
'''This is a WIP page, take nothing here as final.'''
'''This is a WIP page, take nothing here as final.'''


If you've ever tried to learn Unicode you've most likely looked at online tutorials and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.  
If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.  


This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
Line 43: Line 43:
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character.
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character.


Abstract characters are units of writing that make up textual data. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.
Abstract characters are the units that make up textual data on a computer. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.


Encoded characters are mappings of an abstract character to the Unicode codespace as one or more code points. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.
Encoded characters are mappings of an abstract character to the Unicode codespace as a code point. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.


In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.
In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.
Line 57: Line 57:
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES]
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES]
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU]
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU]
*U+1F1F3 "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N]
*U+1F1F3: "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N]
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR]
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR]
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE]
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE]
Line 87: Line 87:


== Encodings ==
== Encodings ==
Storing an arbitrary code point requires an unsigned 21-bit number. This a problem for a few reasons:
Storing an arbitrary code point requires a 21-bit number. This a problem for a few reasons:


* Modern computers would store this in a 32-bit number
* Modern computers would store this in a 32-bit number
* Storing a load of 32-bit numbers is space inefficient
* Storing a load of 32-bit numbers is space inefficient
Modern development environments break encoded Unicode text in to sequences of one or more code units:
Modern development environments break encoded text in to sequences of one or more code units:


* Unix strings use 8-bit code units
* Unix strings use 8-bit code units
Line 102: Line 102:
* UTF-16 which uses 16-bit code units
* UTF-16 which uses 16-bit code units
* UTF-32 which uses 32-bit code units
* UTF-32 which uses 32-bit code units
These encoding forms encode all valid code points except surrogate code points, even UTF-32 which is otherwise a straight representation of code points as 32-bit integers.
These encoding forms encode all valid code points except surrogate code points.


The standard then defines encoding schemes that transform between code units and bytes:
The standard then defines encoding schemes that transform between code units and bytes:
Line 111: Line 111:
* UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection
* UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection
* UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection
* UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection
The byte order mark is actually the Unicode character U+FEFF [https://util.unicode.org/UnicodeJsps/character.jsp?a=FEFF&B1=Show ZERO WIDTH NO-BREAK SPACE], but interpreted as a byte order mark for UTF-16 and UTF-32 when present at the start of encoded text. The initial U+FEFF code point is added and removed during decoding and encoding, but any other U+FEFF code points are kept.
I would like to point out that code unit sequences are often not valid Unicode code unit or even Unicode code code units at all, even if a development environment claims strings are UTF-8 or UTF-16. An obvious example of this is Linux strings where the 8-bit code units are arbitrary bytes without a specified encoding but UTF-8 is the most common encoding.


Some software treat the byte order mark as a signature to detect which Unicode encoding text is using, if using Unicode at all. Software that does this may require UTF-8 text to include a byte order mark despite the encoding not needing it.
A less obvious one is Windows and JavaScript strings: Their 16-bit code units should encode UTF-16 but don't enforce its validity. These systems are likely not going to change: Validation has a performance penalty and being strict about it would break compatibility with data that is not Unicode compliant.


Unicode also offers the ability to gracefully handle decoding failures. This is done by having decoders to substitute invalid data with the U+FFFD [https://util.unicode.org/UnicodeJsps/character.jsp?a=FFFD&B1=Show REPLACEMENT CHARACTER] code point. This character may also be used as a fallback when unable to display a character, or when unable to convert non-Unicode text to Unicode.
Be sure to investigate what guarantees your tools give or don't give regarding encoded data. If the guarantees aren't what you need you can always validate the data yourself.
 
All of these encodings may seem overwhelming, but in practice the only two encodings used are UTF-8 and UTF-16. The reason for this split is historical:
 
The first edition of Unicode had a 16-bit codespace and used a fixed-width 16-bit encoding named UCS-2. The first adopters of Unicode such as Java and Windows chose to represent Unicode with UCS-2 while software that required backwards compatibility such as Unix used UTF-8 and treated Unicode as just another character set.
 
The second edition of Unicode increased the codespace to 21-bit and introduced UTF-32 as its fixed-width encoding. UCS-2 was succeeded by the variable-width UTF-16 encoding we have today. A portion of the codespace was reserved as 'surrogate' code points to preserve compatibility between UCS-2 and UTF-16: These code points are seen as valid code points by UCS-2 systems but decoded as 21-bit code points by UTF-16.
 
Lots of time is spent discussing which encoding is the better variable-width encoding and which you should use in new projects. In practice the encoding you use is likely already decided by the tools you use and cultures or APIs you interact with.


== Algorithms ==
== Algorithms ==
Line 241: Line 233:
Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs.       
Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs.       


The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.       
The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Your Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.       


This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points.       
This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points.       
All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters.     
This tends to work well enough for most applications, but can create some confusing situations:     
* "Jose" can match with "José" if the accent is a separate code point
* The flag "🇩🇪" (regional indicators DE) matches against "<sub>🇧🇩🇪🇺" (indicators BD and EU)</sub>
* The unused regional indicator combinations AB and BC may render as a sole A indicator, "<sub>🇧🇧"</sub> (regional indicators BB) and a sole C indicator


For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation]
For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation]
A related but separate line breaking algorithm can be found at: [https://www.unicode.org/reports/tr14/ UAX #14: Unicode Line Breaking Algorithm]


You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool.
You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool.


== Non-Unicode data ==
== Line breaking ==
Although many programming languages and development tools support Unicode, we still live in a world full of non-Unicode data. This includes data in other encodings and character sets, corrupted data, or even malicious data attempting to bypass security mechanisms. This data must be handled mindfully according to an application's requirements.
this section should talk about new line characters, paragraphs, records, vertical tabs, etc


There are only a few ways to deal with non-Unicode data:
more importantly- how to handle them, are these actually recognized, etc


* Don't treat the data as Unicode
NLF
* Reject the data and request Unicode
* Do a best effort conversion to Unicode


Which action to take is heavily dependent on how important it is to preserve the original data, or how important it is to perform Unicode processing on the text. For example:


* A filesystem may treat paths as bytes and not perform Unicode processing
NEWLINE GUIDELINES:
* A website may ask the user to submit a post that isn't valid Unicode
* A file manager may track filenames as bytes but display them as best effort Unicode
* A photo labeller may prepend Unicode dates to a non-Unicode filename
The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.


TODO
new lines (ambiguous):


structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions
- CR


unicode -> non-unicode: easy
- LF


non-unicode -> unicode: complicated
- CRLF


unicode -> unicode: non-issue
- NEL


non-unicode -> non-unicode: non-issue
- VT


round trips increase pain
- FF


separate pipelines in an application
- LS


complexity goes up by number of conversions in an application
- PS


- greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data.
newline function


- lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path
- macos9 earlier: cr


== Mixed strings ==
- unix/osx: lf
Non-Unicode data is not always represented as bytes.


- cross-platform APIs may
- windows: crlf


- you can represent non-unicode data as bytes, but many languages represent them as unicode strings with non-unicode data embedded in them. this is done so:
- ebcdic: nel


- OS-specific encoding is abstracted away
treated the same on input? distinguished on output


- round trippable
paragraphs/new lines breaking, record separator


- code can ignore unicode and treat strings as opaque
- PS


- these are often called 'OS strings' but i would call them
- LS


- conversion from unicode only works if the string lacks surrogates and has valid codepoints
word processing NLF is PS


https://peps.python.org/pep-0383/
simple text NLF is LS


https://peps.python.org/pep-0540/
both PS and LS terminate lines


https://docs.raku.org/language/unicode#UTF8-C8
- stop reading at NLF, LS, FF, PS, don't include it


https://simonsapin.github.io/wtf-8/
- write should convert NLF/LS/PS to appropriate output


https://doc.rust-lang.org/std/ffi/struct.OsString.html
PAGE SEPARATOR


https://hackage.haskell.org/package/os-string
line breaking algorithm: more complex and intended for displaying text


== Abstraction levels ==
- new line characters


- bytes
- provides opportunities to line break rather than strict breaks


- code units
- not atomic with graphemes


- code points
- new line characters


- segmented text
- provides opportunities to line break rather than strict breaks


- unicode strings may be encoded data, code units, code points, non-surrogate code points, or mixed data
- not atomic with graphemes


TODO: locale information/rich text
For full details on the algorithm check out the standard: [https://www.unicode.org/reports/tr14/ UAX #14: Unicode Line Breaking Algorithm]


You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool.
== Abstractions levels ==
=== Level 1: Bytes ===
=== Level 1: Bytes ===


Line 373: Line 350:


swift/raku
swift/raku
== Non-standard encodings ==
other standards:
- GB 18030
previous standards:
- ucs-2
- ucs-4
- utf-1
in memory storage:
- 32-bit integer
- 'runes'
- wtf-8
- utf-8b
- UTF8-C8
- nfg
- python's weirdness
== General mistakes ==
- languages don't let you store all codepoints
- not tagging data with locale/encoding
- relying on locale
- not using markup
- utf8b
- with and encoding isn't that important
- APIs will give you invalid data
- APIs may not check code units
- APIs might not let you handle surrogates
- code units, etc
- uint32
- utf-32
- not grapheme aware: 🇪🇳🇮🇸 -> 🇪🇳 🇮🇸 , fonts will cheaply display as 🇪 🇳🇮 🇸 , grep
- not the same as ligatures
- fonts cursive
- flags
- default/tailored
For example, two individual letters are often two separate      graphemes. When two letters form a ligature, however, they      combine into a single glyph. They are then part of the same      cluster and are treated as a unit by the shaping engine —      even though the two original, underlying letters remain separate      graphemes.
- round trips, invalid unicode, non unicode, confusables
- length


== Further reading ==
== Further reading ==
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see JookWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)