Unicode guide

This is a WIP page, take nothing here as final.

Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:


 * 1) The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
 * 2) Unicode Technical Reports
 * 3) Unicode Frequently Asked Questions

You might also find the following tools helpful:


 * Unicode Utilities
 * Unicode Code Charts

But as a general overview, Unicode defines the following:


 * A large multilingual set of abstract characters
 * A database of properties for each character (this includes case mapping)
 * How to encode characters for storage
 * How to normalize text for comparison
 * How to segment text in to characters, words and sentences
 * How to break text in to lines
 * How to order text for sorting
 * How to incorporate Unicode in to regular expressions

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

Stability
Before getting in to detailed discussion of strings, it's important to point out that string operations are not entirely stable.

To be specific, Unicode string operations generally require the following:


 * The string or strings to work on
 * A locale used to tailor operations towards (usually hidden from the programmer)
 * The Unicode database for the supported version (usually supplied by the operating system)
 * The Unicode locale database for the supported locale (usually supplied by the operating system)

Changing any of these can cause string operations to give different output.

The Unicode Policies website documents various policies on technical stability of Unicode and the Unicode database regarding compatibility between versions. Locales don't seem to have any stability policy between versions, but human cultures aren't stable in general.

On one hand, this is annoying to think about. On the other hand, Unicode at least provides a stability policy. Can you say that about Unix or Windows locales?

Encodings
Most programming languages tend to define Unicode strings as one of:


 * UTF-8 encoded bytes
 * UTF-16 encoded 16-bit integers
 * Unicode code points encoded as 32-bit integers

However the languages rarely enforce that these strings are well formed:


 * UTF-8 encoded strings might just be a bag of bytes
 * UTF-16 encoded strings might contain lone surrogates
 * Unicode code points might contain arbitrary 32-bit numbers

In practice a developer needs to make sure their strings are valid themselves. This is a lot to put on developers with no clear benefit.

Languages that work with Unicode code points directly alleviate a lot of issues related to encoding, but still rarely enforce that the code points are valid.

Even worse, not all valid code points can be encoded or decoded: Surrogates are valid code points but prohibited in steams of UTF-8, UTF-16 or UTF-32. Sure these are code points, but they don't have any business being in strings. Despite this, languages that handle code points directly may allow these in strings.

Character properties
When writing code that deals with text often you need to know information about specific characters within the text.

Most programming languages support querying the following character information:


 * The character's category, such as: Letter, number, symbol, punctuation, whitespace
 * The character's case status, such as: Uppercase, lowercase

The Unicode character database maps various properties to code points. Some are:


 * The code point's name
 * The code point's script
 * The code point's Unicode block
 * The code point's category, such as: Letter, number, symbol, punctuation, whitespace, separator, other
 * The code point's case status, such as: Uppercase, lowercase, titlecase, no case
 * The code point's emoji status, such as: Modifier or base
 * The numeric value represented by the code point

The database also maps more general information as properties, such as:


 * Information about rendering the code point
 * Information used for breaking and segmentation
 * Information used for normalization
 * Information used for bidirectional control and display

Languages that don't provide access to these properties make it impossible to write Unicode aware code. Even worse, the language might provide only classical functions that work for Latin scripts.

Breaking and segmentation
Unicode defines various ways to group code points in a string:


 * Individual code points (no grouping)
 * Paragraphs
 * Lines
 * Grapheme clusters (User-perceived characters)
 * Words
 * Sentences

By default these are determined by looking at the properties of code points in a string. However it's encouraged for these to be tailored according to a locale or other methods.

Most languages I've looked at don't support breaking or segmenting text using these Unicode algorithm. This results in people writing non-Unicode compliant mechanisms that only work for Latin-based languages.

Normalization
compatibility code points

Collation
locale specific

depends on the locale you want to use at the moment

Case conversion
non-reversible

may expand

Matching
normalization

graphemes

Characters
graphemes

normalization

matching

properties

Non-Unicode compatibility
- fs

- getenv

- paths

- etc

- reversability

General recommendations
- well-formed graphemes of unicode scalars

- more info = better results

- stability

- rich text

- locale apis

- test your code

- send segmented rich text