Unicode guide: Difference between revisions

From JookWiki
(Clarify scope)
(Add refresher)
Line 7: Line 7:


== Unicode refresher ==
== Unicode refresher ==
https://unicode.org/main.html
If you don't understand what Unicode is, I highly recommend reading the following resources in this order:


https://www.unicode.org/versions/Unicode14.0.0/
# [https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf The Unicode Standard, Version 14.0] chapters 1, 2, 3, 4, 5 and 23
# [https://www.unicode.org/reports/index.html Unicode Technical Reports]
# [https://www.unicode.org/faq/ Unicode Frequently Asked Questions]


https://www.unicode.org/reports/index.html
You might also find the following tools helpful:


https://util.unicode.org/UnicodeJsps/
* [https://util.unicode.org/UnicodeJsps/ Unicode Utilities]
* [https://www.unicode.org/charts/ Unicode Code Charts]


https://www.unicode.org/charts/
But as a general overview, Unicode defines the following:


https://www.unicode.org/faq/
* A large multilingual set of abstract characters
* A database of properties for each character (this includes case mapping)
* How to encode characters for storage
* How to normalize text for comparison
* How to segment text in to characters, words and sentences
* How to break text in to lines
* How to order text for sorting


- character set, encodings, CLDR, grapheme, normalization, collation, locale
Some of these can be tailored by locale-dependant rules. The [https://cldr.unicode.org/ Unicode Common Locale Data Repository] provides locale-specific information that aids in this tailoring.
 
- unicode strings?
 
- what should you be able to do with strings?


== ASCII strings ==
== ASCII strings ==

Revision as of 00:56, 19 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Just to make it clear: Unicode is only a part of a complete localization framework. Languages do a bunch of other things wrong, but broken Unicode string handling is the topic I'm covering in this page.

Unicode refresher

If you don't understand what Unicode is, I highly recommend reading the following resources in this order:

  1. The Unicode Standard, Version 14.0 chapters 1, 2, 3, 4, 5 and 23
  2. Unicode Technical Reports
  3. Unicode Frequently Asked Questions

You might also find the following tools helpful:

But as a general overview, Unicode defines the following:

  • A large multilingual set of abstract characters
  • A database of properties for each character (this includes case mapping)
  • How to encode characters for storage
  • How to normalize text for comparison
  • How to segment text in to characters, words and sentences
  • How to break text in to lines
  • How to order text for sorting

Some of these can be tailored by locale-dependant rules. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

ASCII strings

- encoding-neutral but really it's ascii

- character set

- strings

- ops

- OS APIs provide strings

- simple, english based

- works with ascii-compatible encodings

- you don't have to learn anything complicated

Unicode strings

- utf-8

- OS APIs

- string APIs make less sense

- locale tagging

- utf8b

- bytestrings

- poorly defined semantics

ICU strings

ICU/Java?

Non-destructive text processing

- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible