Unicode guide: Difference between revisions
(→Classifying implementations: Add note about research) |
(More notes) |
||
Line 36: | Line 36: | ||
TODO: classify | TODO: classify | ||
== | == General thoughts == | ||
- string were never predictable outside software versions and locale | |||
- living without clear definitions, living in denial | |||
- no testing | |||
- massaging broken code | |||
- clear, unicode definitions | - clear, unicode definitions | ||
Revision as of 21:55, 21 March 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
- The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
But as a general overview, Unicode defines the following:
- A large multilingual set of abstract characters
- A database of properties for each character (this includes case mapping)
- How to encode characters for storage
- How to normalize text for comparison
- How to segment text in to characters, words and sentences
- How to break text in to lines
- How to order text for sorting
- How to incorporate Unicode in to regular expressions
Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.
Classifying implementations
In an effort to better educate myself, I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations. After doing all this I can clearly see why people dislike Unicode: Most languages provide poor or confusing Unicode support.
TODO: classify
General thoughts
- string were never predictable outside software versions and locale
- living without clear definitions, living in denial
- no testing
- massaging broken code
- clear, unicode definitions
- rich text
- multiple versions
- metadata
- non-reversible