Unicode guide: Difference between revisions

From JookWiki
(→‎Unicode strings: Things to do)
(Restart classification)
Line 31: Line 31:
Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring.
Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring.


== ASCII-compatible strings ==
== Classifying implementations ==
Traditionally programming languages have stuck with using what I call ASCII-compatible strings.
[[Unicode strings/Implementations]]


Here's my definition of this string API:
TODO: classify
 
* Characters are represented by an 8-bit byte
* Values under 128 are ASCII values
* Other values are defined by the current locale setting
* Strings are arrays of these characters
* Matching is done by comparing bytes, no normalization is needed
* Classification, sorting and case conversion depends on the current locale setting
* Length is the number of bytes
 
For example, this model is present in:
 
* C and C++
* Python 2
* Lua
* PHP (ignoring mbstring)
* POSIX APIs
* Windows narrow APIs
*DOS APIs
This model has some appealing features:
 
* No encoding or decoding takes place
* English text is handled with no issues
* You can stuff opaque Unicode bytes in to strings using UTF-8
 
However it also has the huge downside that you can't do anything meaningful with non-English text.
 
== Unicode strings ==
There's a lot to unpack with
 
- encodings
 
- unicode scalar


== Non-destructive text processing ==
== Non-destructive text processing ==

Revision as of 20:29, 20 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

  1. The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
  2. Unicode Technical Reports
  3. Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

  • A large multilingual set of abstract characters
  • A database of properties for each character (this includes case mapping)
  • How to encode characters for storage
  • How to normalize text for comparison
  • How to segment text in to characters, words and sentences
  • How to break text in to lines
  • How to order text for sorting
  • How to incorporate Unicode in to regular expressions

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

Classifying implementations

Unicode strings/Implementations

TODO: classify

Non-destructive text processing

- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible