Unicode guide: Difference between revisions
(Add sections) |
(Add back old introduction) |
||
Line 1: | Line 1: | ||
'''This is a WIP page, take nothing here as final.''' | '''This is a WIP page, take nothing here as final.''' | ||
== | == Introduction == | ||
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions. | |||
While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: [[Unicode strings/Implementations]]. | |||
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page. | |||
== Unicode refresher == | |||
If you don't understand what Unicode is, I highly recommend reading the following resources in this order: | |||
# [https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf The Unicode Standard, Version 14.0] chapters 1, 2, 3, 4, 5 and 23 | |||
# [https://www.unicode.org/reports/index.html Unicode Technical Reports] | |||
# [https://www.unicode.org/faq/ Unicode Frequently Asked Questions] | |||
You might also find the following tools helpful: | |||
* [https://util.unicode.org/UnicodeJsps/ Unicode Utilities] | |||
* [https://www.unicode.org/charts/ Unicode Code Charts] | |||
*[https://unicode.org/ucd/ Unicode Character Database] | |||
But as a general overview, Unicode defines the following: | |||
* A large multilingual set of encoded characters (known just as 'characters') | |||
* Properties for each character | |||
* How to encode characters for storage | |||
* How to normalize characters in to a canonical format | |||
* How to segment text in to words, sentences, lines, and paragraphs | |||
* How to map text between different cases | |||
* How to order text for sorting | |||
* How to match text for searching | |||
* How to incorporate Unicode in to regular expressions | |||
Many of these can be further tailored by locale-dependant rules and custom algorithms. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring. | |||
== Introduction == | |||
- indexing | |||
- sort | - sort |
Revision as of 10:51, 3 September 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode strings/Implementations.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
- The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
But as a general overview, Unicode defines the following:
- A large multilingual set of encoded characters (known just as 'characters')
- Properties for each character
- How to encode characters for storage
- How to normalize characters in to a canonical format
- How to segment text in to words, sentences, lines, and paragraphs
- How to map text between different cases
- How to order text for sorting
- How to match text for searching
- How to incorporate Unicode in to regular expressions
Many of these can be further tailored by locale-dependant rules and custom algorithms. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.
Introduction
- indexing
- sort
- match
- search
- normalize
- serialize
- case map
- properties
- breaking/segmentation
- reversing
Level 1: Bytes
level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte
filesystem/unix/C
Level 2: Code units
level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness
windows
Level 3: Unicode scalars
level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them
python
Level 4: Unicode characters
level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.
???
Level 5: Segmented text
level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules
swift/raku
TODO:
languages/locales
Non-Unicode compatibility
- preserving data