Unicode guide: Difference between revisions
(Re-organize things) |
(More organization) |
||
Line 33: | Line 33: | ||
Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring. | Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring. | ||
== | == Encodings == | ||
Most programming languages tend to define Unicode strings as one of: | Most programming languages tend to define Unicode strings as one of: | ||
Line 53: | Line 53: | ||
== Stability == | == Stability == | ||
Programmers tend to think of strings and things you can do with strings as 'stable' | |||
- between program runs | |||
- between computers | |||
- etc | |||
- explain here | - explain here | ||
Line 58: | Line 66: | ||
- unstable results and a changing world | - unstable results and a changing world | ||
- indexing | |||
- locales | |||
- living without clear definitions, living in denial | - living without clear definitions, living in denial | ||
Line 63: | Line 75: | ||
- no testing | - no testing | ||
== | == Breaking == | ||
- iterations | - iterations | ||
Line 74: | Line 86: | ||
- text boundaries | - text boundaries | ||
== | - characters | ||
Programming languages usually provide the following | |||
== Typical operations == | |||
Programming languages usually provide the following text operations: | |||
* Matching to see if two strings are the same | * Matching to see if two strings are the same | ||
* Case conversion to turn strings uppercase or lowercase | * Case conversion to turn strings uppercase or lowercase | ||
* Collation to sort strings in to some order | * Collation to sort strings in to some order | ||
*Querying to see if a code point is uppercase, lowercase, alphanumeric, or some other property | |||
* Querying if a | *Encoding functions to serialize and de-serialize a string | ||
string | |||
- matching | - matching |
Revision as of 08:41, 29 March 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
- The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
But as a general overview, Unicode defines the following:
- A large multilingual set of abstract characters
- A database of properties for each character (this includes case mapping)
- How to encode characters for storage
- How to normalize text for comparison
- How to segment text in to characters, words and sentences
- How to break text in to lines
- How to order text for sorting
- How to incorporate Unicode in to regular expressions
Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.
Encodings
Most programming languages tend to define Unicode strings as one of:
- UTF-8 encoded bytes
- UTF-16 encoded 16-bit integers
- Unicode code points encoded as 32-bit integers
However the languages rarely enforce that these strings are well formed:
- UTF-8 encoded strings might just be a bag of bytes
- UTF-16 encoded strings might contain lone surrogates
- Unicode code points might contain arbitrary 32-bit numbers
In practice a developer needs to make sure their strings are valid themselves. This is a lot to put on developers with no clear benefit.
Languages that work with Unicode code points directly alleviate a lot of issues related to encoding, but still rarely enforce that the code points are valid.
Even worse, not all valid code points can be encoded or decoded: Surrogates are valid code points but prohibited in steams of UTF-8, UTF-16 or UTF-32. Sure these are code points, but they don't have any business being in strings.
Stability
Programmers tend to think of strings and things you can do with strings as 'stable'
- between program runs
- between computers
- etc
- explain here
- normalization
- unstable results and a changing world
- indexing
- locales
- living without clear definitions, living in denial
- no testing
Breaking
- iterations
- indexing
- grapheme
- code point
- text boundaries
- characters
Typical operations
Programming languages usually provide the following text operations:
- Matching to see if two strings are the same
- Case conversion to turn strings uppercase or lowercase
- Collation to sort strings in to some order
- Querying to see if a code point is uppercase, lowercase, alphanumeric, or some other property
- Encoding functions to serialize and de-serialize a string
- matching
- case conversion/folding
- collation
- classification
- querying properties
- boundaries: code point, cluster, boundaries
- encoding/decoding
- regex
- char vs string
- locales
Non-Unicode compatibility
- fs
- getenv
- paths
- etc
- reversability
General recommendations
- well-formed graphemes of unicode scalars
- more info = better results
- rich text
- locale apis