Unicode guide: Difference between revisions

Revision as of 08:41, 29 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
Unicode Technical Reports
Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

A large multilingual set of abstract characters
A database of properties for each character (this includes case mapping)
How to encode characters for storage
How to normalize text for comparison
How to segment text in to characters, words and sentences
How to break text in to lines
How to order text for sorting
How to incorporate Unicode in to regular expressions

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

Encodings

Most programming languages tend to define Unicode strings as one of:

UTF-8 encoded bytes
UTF-16 encoded 16-bit integers
Unicode code points encoded as 32-bit integers

However the languages rarely enforce that these strings are well formed:

UTF-8 encoded strings might just be a bag of bytes
UTF-16 encoded strings might contain lone surrogates
Unicode code points might contain arbitrary 32-bit numbers

In practice a developer needs to make sure their strings are valid themselves. This is a lot to put on developers with no clear benefit.

Languages that work with Unicode code points directly alleviate a lot of issues related to encoding, but still rarely enforce that the code points are valid.

Even worse, not all valid code points can be encoded or decoded: Surrogates are valid code points but prohibited in steams of UTF-8, UTF-16 or UTF-32. Sure these are code points, but they don't have any business being in strings.

Stability

Programmers tend to think of strings and things you can do with strings as 'stable'

- between program runs

- between computers

- etc

- explain here

- normalization

- unstable results and a changing world

- indexing

- locales

- living without clear definitions, living in denial

- no testing

Breaking

- iterations

- indexing

- grapheme

- code point

- text boundaries

- characters

Typical operations

Programming languages usually provide the following text operations:

Matching to see if two strings are the same
Case conversion to turn strings uppercase or lowercase
Collation to sort strings in to some order
Querying to see if a code point is uppercase, lowercase, alphanumeric, or some other property
Encoding functions to serialize and de-serialize a string

- matching

- case conversion/folding

- collation

- classification

- querying properties

- boundaries: code point, cluster, boundaries

- encoding/decoding

- regex

- char vs string

- locales

Non-Unicode compatibility

- fs

- getenv

- paths

- etc

- reversability

General recommendations

- well-formed graphemes of unicode scalars

- more info = better results

- rich text

- locale apis

@@ Line 33: / Line 33: @@
 Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring.
-== String encodings ==
+== Encodings ==
 Most programming languages tend to define Unicode strings as one of:
@@ Line 53: / Line 53: @@
 == Stability ==
+Programmers tend to think of strings and things you can do with strings as 'stable'
+- between program runs
+- between computers
+- etc
 - explain here
@@ Line 58: / Line 66: @@
 - unstable results and a changing world
+- indexing
+- locales
 - living without clear definitions, living in denial
@@ Line 63: / Line 75: @@
 - no testing
-== Indexing ==
+== Breaking ==
 - iterations
@@ Line 74: / Line 86: @@
 - text boundaries
-== Text operations ==
+- characters
-Programming languages usually provide the following string operations:
+== Typical operations ==
+Programming languages usually provide the following text operations:
 * Matching to see if two strings are the same
 * Case conversion to turn strings uppercase or lowercase
 * Collation to sort strings in to some order
+*Querying to see if a code point is uppercase, lowercase, alphanumeric, or some other property
-* Querying if a character is uppercase, lowercase, alphanumeric, or some other property
+*Encoding functions to serialize and de-serialize a string
-string indexing
 - matching