Unicode guide: Difference between revisions

From JookWiki
(Add links I'll use)
(More work)
Line 15: Line 15:
https://www.unicode.org/charts/
https://www.unicode.org/charts/


- refresher on unicode:
https://www.unicode.org/faq/


- character set, encodings, CLDR, grapheme, normalization, collation, locale
- character set, encodings, CLDR, grapheme, normalization, collation, locale


- unicode strings?
- unicode strings?
- what should you be able to do with strings?


== ASCII strings ==
== ASCII strings ==
Line 52: Line 54:


- poorly defined semantics
- poorly defined semantics
== ICU strings ==
ICU/Java?


== Non-destructive text processing ==
== Non-destructive text processing ==

Revision as of 23:42, 18 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Unicode refresher

https://unicode.org/main.html

https://www.unicode.org/versions/Unicode14.0.0/

https://www.unicode.org/reports/index.html

https://util.unicode.org/UnicodeJsps/

https://www.unicode.org/charts/

https://www.unicode.org/faq/

- character set, encodings, CLDR, grapheme, normalization, collation, locale

- unicode strings?

- what should you be able to do with strings?

ASCII strings

- encoding-neutral but really it's ascii

- character set

- strings

- ops

- OS APIs provide strings

- simple, english based

- works with ascii-compatible encodings

- you don't have to learn anything complicated

Unicode strings

- utf-8

- OS APIs

- string APIs make less sense

- locale tagging

- utf8b

- bytestrings

- poorly defined semantics

ICU strings

ICU/Java?

Non-destructive text processing

- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible