Unicode guide: Difference between revisions

Revision as of 00:48, 1 October 2022

This is a WIP page, take nothing here as final.

Introduction

If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.

This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.

As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.

What is Unicode?

Unicode defines the following:

A large multilingual set of abstract characters (known just as 'characters')
Properties for each character
How to encode characters for storage
How to normalize characters in to a canonical format
How to segment text in to words, sentences, lines, and paragraphs
How to map text between different cases
How to order text for sorting
How to match text for searching
How to incorporate Unicode in to regular expressions

Many of these can be further tailored by locale-dependent rules and custom algorithms.

The layered model

- levels of abstraction

- indexing

- sort

- match

- search

- normalize

- serialize

- case map

- properties

- breaking/segmentation

- reversing

Level 1: Bytes

level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

Level 2: Code units

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

Level 3: Unicode scalars

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

Level 4: Unicode characters

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

Level 5: Segmented text

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

@@ Line 2: / Line 2: @@
 == Introduction ==
-I've been frustrated with online Unicode tutorials and learning resources. They tend to miss important information, focus on the wrong things or just include out of date information.
+If you've ever tried to learn Unicode you've most likely looked at online tutorial and learning resources. These tend to focus on specific details about how Unicode works instead of the broader picture.
-- this is my guide to unicode
+This guide is my attempt to help you build a mental model of Unicode that can be used to write functional software and navigate the official Unicode standards and resources.
-- should give you a mental model
+As a disclaimer: I'm just a random person, some of this might be wrong. But hopefully by the end of reading this you should be able to correct me.
-- should let you navigate reading the standards and official resources
-- should help you understand what tools to use when programming
 == What is Unicode? ==