Unicode guide: Difference between revisions

VisualWikitext

Revision as of 06:42, 30 September 2022

This is a WIP page, take nothing here as final.

Introduction

There's a lot of information out on the Internet about Unicode. Generally it falls in to a few separate categories:

How Unicode has lots of characters
How Unicode is unintuitive
How new programming languages make ignoring Unicode easier
How to encode Unicode characters

- standards are the source

- explain this is a guide or on ramp

What is Unicode?

Before I get in to the main article I'd like to provide a quick overview of what Unicode is.

Unicode defines the following:

A large multilingual set of abstract characters (known just as 'characters')
Properties for each character
How to encode characters for storage
How to normalize characters in to a canonical format
How to segment text in to words, sentences, lines, and paragraphs
How to map text between different cases
How to order text for sorting
How to match text for searching
How to incorporate Unicode in to regular expressions

Many of these can be further tailored by locale-dependent rules and custom algorithms. The Unicode Common Locale Data Repository provides locale-specific information that aids in this tailoring.

I highly recommend reading the following resources

The Unicode Standard, Version 15.0 chapters 1, 2, 3, 4 and 23
Unicode Glossary
Unicode Technical Reports
Unicode Frequently Asked Questions

You might also find the following tools helpful:

The layered model

- levels of abstraction

- indexing

- sort

- match

- search

- normalize

- serialize

- case map

- properties

- breaking/segmentation

- reversing

Level 1: Bytes

level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

Level 2: Code units

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

Level 3: Unicode scalars

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

Level 4: Unicode characters

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

Level 5: Segmented text

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

TODO:

languages/locales

Non-Unicode compatibility

- preserving data

- While writing this page I researched and documented Unicode support in various programming languages. You can see my notes here: Unicode guide/Implementations.

@@ Line 1: / Line 1: @@
 '''This is a WIP page, take nothing here as final.'''
-== Background ==
+== Introduction ==
+There's a lot of information out on the Internet about Unicode. Generally it falls in to a few separate categories:
-Many modern programming languages have added or are in the processing of adding support for Unicode. Generally they do this by making their textual strings use Unicode values of some sort and ensure string operations work with these values.
+* How Unicode has lots of characters
+* How Unicode is unintuitive
+* How new programming languages make ignoring Unicode easier
+* How to encode Unicode characters
-This swaps one problem for another: Instead of writing code that only works with one type of string, programmers will now write code that only works with another type of string.
+- standards are the source
-Having poor language abstractions is a problem, but it's not the problem. The main problem is that programmers just don't have a useful mental model for Unicode.
+- explain this is a guide or on ramp
-For example, if you search online for a Unicode tutorial you'll likely get the following information:
+== What is Unicode? ==
-* Before Unicode there was ASCII
-* ASCII stores a Latin character without complex encoding
-* Unicode has characters from almost every written script
-* It needs to be encoded and decoded using UTF-8, UTF-16 or UTF-32
-* UTF-8 is used commonly for serving web pages
-This isn't very useful knowledge in the grand scheme of what people do with strings in applications.
-Likewise, sitting down and reading the standard will teach you everything you need to know about Unicode, but not how to really think about it when programming.
-After thinking about it for a while I've built up what I thought is a useful model for Unicode strings: The layered model.
-== Unicode overview ==
 Before I get in to the main article I'd like to provide a quick overview of what Unicode is.