Unicode guide: Difference between revisions
(More work) |
(Clarify scope) |
||
Line 3: | Line 3: | ||
== Introduction == | == Introduction == | ||
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions. | Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions. | ||
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page. | |||
== Unicode refresher == | == Unicode refresher == |
Revision as of 00:29, 19 March 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
https://www.unicode.org/versions/Unicode14.0.0/
https://www.unicode.org/reports/index.html
https://util.unicode.org/UnicodeJsps/
https://www.unicode.org/charts/
- character set, encodings, CLDR, grapheme, normalization, collation, locale
- unicode strings?
- what should you be able to do with strings?
ASCII strings
- encoding-neutral but really it's ascii
- character set
- strings
- ops
- OS APIs provide strings
- simple, english based
- works with ascii-compatible encodings
- you don't have to learn anything complicated
Unicode strings
- utf-8
- OS APIs
- string APIs make less sense
- locale tagging
- utf8b
- bytestrings
- poorly defined semantics
ICU strings
ICU/Java?
Non-destructive text processing
- clear, unicode definitions
- rich text
- multiple versions
- metadata
- non-reversible