Editing Unicode guide

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
'''This is a WIP page, take nothing here as final.'''
'''This is a WIP page, take nothing here as final.'''


If you've ever tried to learn Unicode you've most likely looked at online tutorials and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.  
If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.  


This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
Line 43: Line 43:
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character.
Unicode provides two distinct definitions of the term 'character': Abstract characters and encoded characters. When discussing Unicode the term 'character' means an encoded character.


Abstract characters are units of writing that make up textual data. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.
Abstract characters are the units that make up textual data on a computer. These are usually some portion of a written script that has a unique identity independent of Unicode, such as a letter, symbol, accent, logogram, or spacing but they may be something else entirely. The best way to think of these are atoms used to handle text editing, displaying, organization and storage.


Encoded characters are mappings of an abstract character to the Unicode codespace as one or more code points. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.
Encoded characters are mappings of an abstract character to the Unicode codespace as a code point. This is almost always what people mean by 'character' in Unicode discussion. There's not a one-to-one mapping between abstract and encoded characters: Abstract characters might be mapped multiple times to aid in compatibility with other character sets, they might not be mapped at all and instead represented using a sequence of other encoded characters, or they might not be representable at all and require addition in future Unicode versions.


In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.
In addition to having a code point each character has a set of properties that provide information about the character to aid in writing Unicode algorithms. These include things like name, case, category, script, direction, numeric value, and rendering information.
Line 57: Line 57:
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES]
* U+1F440 "👀": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%80&B1=Show EYES]
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU]
* U+00942 " ू": [https://util.unicode.org/UnicodeJsps/character.jsp?a=0942 DEVANAGARI VOWEL SIGN UU]
*U+1F1F3 "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N]
*U+1F1F3: "🇳": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%87%B3+&B1=Show REGIONAL INDICATOR SYMBOL LETTER N]
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR]
* U+02028: [https://util.unicode.org/UnicodeJsps/character.jsp?a=2028 LINE SEPARATOR]
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE]
* U+0200B: [https://util.unicode.org/UnicodeJsps/character.jsp?a=200B ZERO WIDTH SPACE]
Line 274: Line 274:
* A file manager may track filenames as bytes but display them as best effort Unicode
* A file manager may track filenames as bytes but display them as best effort Unicode
* A photo labeller may prepend Unicode dates to a non-Unicode filename
* A photo labeller may prepend Unicode dates to a non-Unicode filename
The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.


TODO
TODO: these must be conscious decisions


structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions
TODO: APIs


unicode -> non-unicode: easy
TODO: talk about


non-unicode -> unicode: complicated
Languages that allow non-Unicode and Unicode data to mix are able to get the expected outcome easier without developers needing to write extra code.


unicode -> unicode: non-issue
This ends up working unreasonably well in practice as most algorithms only operate on portions of strings, with most data being silently ignored regardless of whether it is valid Unicode or invalid data being silently ignored


non-unicode -> non-unicode: non-issue
- the choice of what to do when converting to unicode


round trips increase pain
- how to convert


separate pipelines in an application
- when to convert


complexity goes up by number of conversions in an application
- what do you do with errors


- greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data.
- operating systems


- lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path
-  


== Mixed strings ==
The majority of programming languages and related development tools choose not to represent text using a sequence of Unicode code points: Instead they provide data types that represent sequences of integers of some size, usually 8-bit, 16-bit or 32-bit. The developer is tasked with correctly storing Unicode sequences in these integers using some encoding defined by the language or tools. These integers serve an identical purpose to code units, but are used instead for a non-Unicode encoding.
Non-Unicode data is not always represented as bytes.


- cross-platform APIs may
There are a few reasons languages use a non-Unicode encoding:


- you can represent non-unicode data as bytes, but many languages represent them as unicode strings with non-unicode data embedded in them. this is done so:
* Non-Unicode data don't need a separate data type
* Non-Unicode APIs can be merged with Unicode APIs
* Surrogate code points can be represented in strings
* No performance is spent on string validation


- OS-specific encoding is abstracted away
Languages that have a strict separation between non-Unicode and Unicode usually hit these issues:
* Developer fatigue from decoding and encoding Unicode
* Code for handling non-Unicode data is neglected
* Automatic decoding can fail and crash an entire program


- round trippable


- code can ignore unicode and treat strings as opaque
examples:


- these are often called 'OS strings' but i would call them
- C: bytestrings, UTF-8 sometimes


- conversion from unicode only works if the string lacks surrogates and has valid codepoints
- JavaScript: UCS-2


https://peps.python.org/pep-0383/
- python: code points, bytes as surrogates utf-8b


https://peps.python.org/pep-0540/
- rust: utf-8, wtf-8


https://docs.raku.org/language/unicode#UTF8-C8
- haskell: code points


https://simonsapin.github.io/wtf-8/
- perl treats strings as bytes or unicode based on a flag


https://doc.rust-lang.org/std/ffi/struct.OsString.html
- go uses u32


https://hackage.haskell.org/package/os-string
- swift uses grapheme clusters
 
- raku uses normalized grapheme clusters
 
locale information/rich text


== Abstraction levels ==
== Abstraction levels ==
Line 336: Line 343:


- segmented text
- segmented text
- unicode strings may be encoded data, code units, code points, non-surrogate code points, or mixed data
TODO: locale information/rich text


=== Level 1: Bytes ===
=== Level 1: Bytes ===
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see JookWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)