Unicode guide

This is a WIP page, take nothing here as final.

Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:


 * 1) The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
 * 2) Unicode Technical Reports
 * 3) Unicode Frequently Asked Questions

You might also find the following tools helpful:


 * Unicode Utilities
 * Unicode Code Charts

But as a general overview, Unicode defines the following:


 * A large multilingual set of abstract characters
 * A database of properties for each character (this includes case mapping)
 * How to encode characters for storage
 * How to normalize text for comparison
 * How to segment text in to characters, words and sentences
 * How to break text in to lines
 * How to order text for sorting
 * How to incorporate Unicode in to regular expressions

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

Classifying implementations
Unicode strings/Implementations

TODO: classify

Non-destructive text processing
- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible