Unicode guide: Difference between revisions
(Clarify scope) |
(Add refresher) |
||
Line 7: | Line 7: | ||
== Unicode refresher == | == Unicode refresher == | ||
If you don't understand what Unicode is, I highly recommend reading the following resources in this order: | |||
https://www.unicode.org/versions/Unicode14.0.0/ | # [https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf The Unicode Standard, Version 14.0] chapters 1, 2, 3, 4, 5 and 23 | ||
# [https://www.unicode.org/reports/index.html Unicode Technical Reports] | |||
# [https://www.unicode.org/faq/ Unicode Frequently Asked Questions] | |||
You might also find the following tools helpful: | |||
https://util.unicode.org/UnicodeJsps/ | * [https://util.unicode.org/UnicodeJsps/ Unicode Utilities] | ||
* [https://www.unicode.org/charts/ Unicode Code Charts] | |||
But as a general overview, Unicode defines the following: | |||
* A large multilingual set of abstract characters | |||
* A database of properties for each character (this includes case mapping) | |||
* How to encode characters for storage | |||
* How to normalize text for comparison | |||
* How to segment text in to characters, words and sentences | |||
* How to break text in to lines | |||
* How to order text for sorting | |||
Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring. | |||
- unicode | |||
- | |||
== ASCII strings == | == ASCII strings == |
Revision as of 00:56, 19 March 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
- The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
But as a general overview, Unicode defines the following:
- A large multilingual set of abstract characters
- A database of properties for each character (this includes case mapping)
- How to encode characters for storage
- How to normalize text for comparison
- How to segment text in to characters, words and sentences
- How to break text in to lines
- How to order text for sorting
Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.
ASCII strings
- encoding-neutral but really it's ascii
- character set
- strings
- ops
- OS APIs provide strings
- simple, english based
- works with ascii-compatible encodings
- you don't have to learn anything complicated
Unicode strings
- utf-8
- OS APIs
- string APIs make less sense
- locale tagging
- utf8b
- bytestrings
- poorly defined semantics
ICU strings
ICU/Java?
Non-destructive text processing
- clear, unicode definitions
- rich text
- multiple versions
- metadata
- non-reversible