Unicode guide: Difference between revisions
(Finish ASCII-compatible strings section) |
(Refine invalid assumptions) |
||
Line 48: | Line 48: | ||
* Python 2 | * Python 2 | ||
* Lua | * Lua | ||
* PHP (ignoring mbstring) | |||
* PHP | |||
* POSIX APIs | * POSIX APIs | ||
* Windows narrow APIs | * Windows narrow APIs | ||
Line 62: | Line 61: | ||
== Unicode strings == | == Unicode strings == | ||
research and categorize the following: | |||
- | - bytestrings? | ||
- | - wchar | ||
- | - windows | ||
- | - rust | ||
- | - java | ||
- | - swift | ||
- go | |||
- kotlin | |||
- java | |||
- python, utf8b | |||
- tcl | |||
- linux/unix | |||
- javascript | |||
- perl | |||
- ruby | |||
- zig | |||
- raku | |||
== ICU strings == | == ICU strings == | ||
Line 89: | Line 110: | ||
- non-reversible | - non-reversible | ||
[[Category:Research]] |
Revision as of 02:53, 19 March 2022
This is a WIP page, take nothing here as final.
Introduction
Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.
Unicode refresher
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
- The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
- Unicode Technical Reports
- Unicode Frequently Asked Questions
You might also find the following tools helpful:
But as a general overview, Unicode defines the following:
- A large multilingual set of abstract characters
- A database of properties for each character (this includes case mapping)
- How to encode characters for storage
- How to normalize text for comparison
- How to segment text in to characters, words and sentences
- How to break text in to lines
- How to order text for sorting
Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.
ASCII-compatible strings
Traditionally programming languages have stuck with using what I call ASCII-compatible strings.
Here's my definition of this string API:
- Characters are represented by an 8-bit byte
- Values under 128 are ASCII values
- Other values are defined by the current locale setting
- Strings are arrays of these characters
- Matching is done by comparing bytes, no normalization is needed
- Classification, sorting and case conversion depends on the current locale setting
- Length is the number of bytes
For example, this model is present in:
- C and C++
- Python 2
- Lua
- PHP (ignoring mbstring)
- POSIX APIs
- Windows narrow APIs
- DOS APIs
This model has some appealing features:
- No encoding or decoding takes place
- English text is handled with no issues
- You can stuff opaque Unicode bytes in to strings using UTF-8
However it also has the huge downside that you can't do anything meaningful with non-English text.
Unicode strings
research and categorize the following:
- bytestrings?
- wchar
- windows
- rust
- java
- swift
- go
- kotlin
- java
- python, utf8b
- tcl
- linux/unix
- javascript
- perl
- ruby
- zig
- raku
ICU strings
ICU/Java?
Non-destructive text processing
- clear, unicode definitions
- rich text
- multiple versions
- metadata
- non-reversible