Unicode guide: Difference between revisions

Revision as of 06:54, 3 June 2022

This is a WIP page, take nothing here as final.

TODO: review all this, make cuts of stuff not needed to know about programming languages

TODO: some things *are* stable, some things are tailorable

TODO: interpretable cahracters limit

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
Unicode Technical Reports
Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

A large multilingual set of abstract characters
A database of properties for each abstract character (this includes case mapping)
How to encode abstract characters for storage
How to normalize text for comparison
How to segment text in to words, sentences, lines, and paragraphs
How to map between different cases
How to order text for sorting
How to match text for finding
How to incorporate Unicode in to regular expressions

Most of these can be tailored by locale-dependant rules, but Unicode provides some sane defaults. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

Stability

Before getting in to detailed discussion of strings, it's important to point out that string operations are not entirely stable.

To be specific, Unicode string operations generally require the following:

The string or strings to work on
A locale used to tailor operations towards (usually hidden from the programmer)
The Unicode database for the supported version (usually supplied by the operating system)
The Unicode locale database for the supported locale (usually supplied by the operating system)

Changing any of these can cause string operations to give different output.

The Unicode Policies website documents various policies on technical stability of Unicode and the Unicode database regarding compatibility between versions. Locales don't seem to have any stability policy between versions, but human cultures aren't stable in general.

On one hand, this is annoying to think about. On the other hand, Unicode at least provides a well-defined stability policy we can reason about. Can you say that about Unix locales?

Encodings

TODO: rewrite

Most programming languages tend to define Unicode strings as one of:

UTF-8 encoded bytes
UTF-16 encoded 16-bit integers
Unicode code points encoded as 32-bit integers

However the languages rarely enforce that these strings are well formed:

UTF-8 encoded strings might just be a bag of bytes
UTF-16 encoded strings might contain lone surrogates
Unicode code points might contain arbitrary 32-bit numbers

In practice a developer needs to make sure their strings are valid themselves. This is a lot to put on developers with no clear benefit.

Languages that work with Unicode code points directly alleviate a lot of issues related to encoding, but still rarely enforce that the code points are valid.

Even worse, not all valid code points can be encoded or decoded: Surrogates are valid code points but prohibited in steams of UTF-8, UTF-16 or UTF-32. Sure these are code points, but they don't have any business being in strings. Despite this, languages that handle code points directly may allow these in strings.

- NULL

Character properties

When writing code that deals with text often you need to know information about specific abstract characters within the text.

Most programming languages support querying the following character information:

The abstract character's category, such as: Letter, number, symbol, punctuation, whitespace
The abstract character's case status, such as: Uppercase, lowercase

The Unicode character database maps various properties to abstract characters. Some are:

The abstract character's name
The abstract character's code point
The abstract character's script
The abstract character's Unicode block
The abstract character's category, such as: Letter, number, symbol, punctuation, whitespace
The abstract character's case status, such as: Uppercase, lowercase, no case
The abstract character's combining status, such as: Modifier or base
The numeric value represented by the abstract character

The database also maps more general information as properties, such as:

Information about rendering the abstract character
Information used for breaking and segmentation
Information used for normalization
Information used for bidirectional control and display

Languages that don't provide access to these properties make it impossible to write Unicode aware code. Even worse, the language might provide classical functions that only work for Latin scripts.

Breaking and segmentation

Unicode defines various ways to group abstract characters in a string:

Individual abstract character (no grouping)
Paragraphs
Lines
Grapheme clusters (User-perceived characters)
Words
Sentences

By default these are determined by looking at the properties of abstract characters in a string. However it's encouraged for these to be tailored according to a locale or other methods.

Most languages I've looked at don't support breaking or segmenting text using these Unicode algorithm. This results in people writing non-Unicode compliant mechanisms that only work for Latin-based languages.

Normalization

Unicode can represent the same user-perceived character in multiple ways:

Different code point sequences may map to the same abstract character
Combining marks may be applied in many different orders

TODO: is normalization solving the first issue? is it really the same abstract character or grapheme cluster?

Normalization takes a string and applies a mapping to each abstract character and places combining marks in a canonical order.

Normalization specifies three types of mapping:

Canonical decomposition expands a code point to a sequence representing the same abstract character
Compatibility decomposition expands a code point to a sequence representing a similar abstract character
Canonical composition contracts code point sequences to identical abstract characters sequences

Normalization places each combining mark within a code point sequence in a well-defined order after performing a decomposition mapping.

The following normalization forms are supported:

NFD canonically decomposes a code point sequence
NFC compatibility decomposes a code point sequence
NFKD canonically decomposes then canonically composes a code point sequence
NFKC compatibility decomposes then canonically composes a code point sequence

Re-ordering combining marks requires buffering each combining mark in memory. When dealing with streams of input this can potentially allow unbounded memory usage with malicious input. Unicode specifies a process for limiting the number of combining marks in a code point sequence during normalization, however this does not give a output equivalent to the standard normalization forms.

Normalization of unassigned code points may give different results in future Unicode versions where those code points are assigned. Private use characters are not affected by normalization.

Some languages provides support for normalizing strings, but generally not stream safe normalization. This is especially important when handling tasks that deal with individual grapheme clusters that require a fixed buffer size.

TODO: abstract characters, not code points

Case mapping

Unicode handles case mapping by providing information about:

Whether an abstract character is lowercase, uppercase or caseless
How to convert an abstract character to uppercase, lowercase, title case
How to case fold an abstract character
How to apply these operations to strings
Some explicit rules for mappings that depend on context in a string

Title case mappings are only applied contextually to abstract characters, such as in the first cased letter of a each word segment.

The rules for case mapping are a lot weaker than you'd expect. Notably:

Lowercase forms and uppercase forms may not map to each other
Case mapping may expand to a longer abstract character sequence
Case mapping may not map to the desired case
Combining characters are not treated different to non-combining characters

Case folding is used to map abstract characters to a form suitable for case insensitive operations. This is designed to work even in situations where lowercase and uppercase forms don't map to each other.

- general algorithms, plus testing, context, lossy, etc

- CLDR/tailoring handles

- do not use language-specific mappings without tagging

- normalization discouraged? (see 2.2 equivalent sequence)

- NFKC_Casefold

- can unnormalize a string

- is stable?

- compatibility

Sorting

- language/locale specific

- normalization

- customizable for things like case sensitivity, ignoring accents

- might be phonetic, based on appearance of character

- may use lookup dictionaries

- display locale may be different

- strings

- binary ordering depends on encoding

- case folding for legacy stuff

- han

unicode collation algorithm

do languages provide this?

Searching

- searching matches based on boundary such as word of grapheme cluster

- checking for whitespace for last matching character for things like accents

- overlapping searches like flags

- weak equivalence for case insensitivity using 5.18 case mappings

- equivalent characters that look the same

do languages provide this?

Non-Unicode compatibility

- fs

- getenv

- paths

- etc

- reversability

- ttf

- pua

General recommendations

- well-formed graphemes of unicode scalars

- more info = better results

- stability

- rich text

- locale apis

- test your code

- send segmented rich text

- elephant in the room: storage space, microcontrollers

- what is a character?

- benefit over private character sets?