Unicode guide

From JookWiki
Revision as of 16:20, 28 March 2022 by Jookia (talk | contribs) (Plan section for normalization)

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

  1. The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
  2. Unicode Technical Reports
  3. Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

  • A large multilingual set of abstract characters
  • A database of properties for each character (this includes case mapping)
  • How to encode characters for storage
  • How to normalize text for comparison
  • How to segment text in to characters, words and sentences
  • How to break text in to lines
  • How to order text for sorting
  • How to incorporate Unicode in to regular expressions

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

String encodings

Most programming languages tend to define Unicode strings as one of:

  • UTF-8 encoded bytes
  • UTF-16 encoded 16-bit integers
  • Unicode code points encoded as 32-bit integers

However the languages rarely enforce that these strings are well formed:

  • UTF-8 encoded strings might just be a bag of bytes
  • UTF-16 encoded strings might contain lone surrogates
  • Unicode code points might contain arbitrary 32-bit numbers

In practice a developer needs to make sure their strings are valid themselves. This is a lot to put on developers with no clear benefit.

Languages that work with Unicode code points directly alleviate a lot of issues related to encoding, but still rarely enforce that the code points are valid.

Even worse, not all valid code points can be encoded or decoded: Surrogates are valid code points but prohibited in steams of UTF-8, UTF-16 or UTF-32. Sure these are code points, but they don't have any business being in strings.

String normalization

- explain here

String operations

- matching

- case conversion/folding

- collation

- classification

- querying properties

- boundaries: code point, cluster, boundaries

- encoding/decoding - regex

- char vs string

- locales

Stability

- unstable results and a changing world

- living without clear definitions, living in denial

- no testing

Non-Unicode compatibility

- fs

- getenv

- paths

- etc

- reversability

General recommendations

- well-formed graphemes of unicode scalars

- more info = better results

- rich text

- locale apis