Unicode guide: Difference between revisions

Revision as of 00:56, 19 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
Unicode Technical Reports
Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

A large multilingual set of abstract characters
A database of properties for each character (this includes case mapping)
How to encode characters for storage
How to normalize text for comparison
How to segment text in to characters, words and sentences
How to break text in to lines
How to order text for sorting

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

ASCII strings

- encoding-neutral but really it's ascii

- character set

- strings

- ops

- OS APIs provide strings

- simple, english based

- works with ascii-compatible encodings

- you don't have to learn anything complicated

Unicode strings

- utf-8

- OS APIs

- string APIs make less sense

- locale tagging

- utf8b

- bytestrings

- poorly defined semantics

ICU strings

ICU/Java?

Non-destructive text processing

- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible

@@ Line 7: / Line 7: @@
 == Unicode refresher ==
-https://unicode.org/main.html
+If you don't understand what Unicode is, I highly recommend reading the following resources in this order:
-https://www.unicode.org/versions/Unicode14.0.0/
+# [https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf The Unicode Standard, Version 14.0] chapters 1, 2, 3, 4, 5 and 23
+# [https://www.unicode.org/reports/index.html Unicode Technical Reports]
+# [https://www.unicode.org/faq/ Unicode Frequently Asked Questions]
-https://www.unicode.org/reports/index.html
+You might also find the following tools helpful:
-https://util.unicode.org/UnicodeJsps/
+* [https://util.unicode.org/UnicodeJsps/ Unicode Utilities]
+* [https://www.unicode.org/charts/ Unicode Code Charts]
-https://www.unicode.org/charts/
+But as a general overview, Unicode defines the following:
-https://www.unicode.org/faq/
+* A large multilingual set of abstract characters
+* A database of properties for each character (this includes case mapping)
+* How to encode characters for storage
+* How to normalize text for comparison
+* How to segment text in to characters, words and sentences
+* How to break text in to lines
+* How to order text for sorting
-- character set, encodings, CLDR, grapheme, normalization, collation, locale
+Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring.
-- unicode strings?
-- what should you be able to do with strings?
 == ASCII strings ==