Unicode guide: Difference between revisions

From JookWiki
(Start with a quick list)
(Organize a bit more)
Line 24: Line 24:
- reversing
- reversing


- validity
- sorting
- matching
- searching
- normalization
if we treat strings as an array of 'something', there's a few different levels you can work with a unicode string:


level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte
level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte


- used by C
filesystem/unix/C
 
- OS, serialization


level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness
level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness


- used by most languages?
windows
 
- go? haskell? python?
 
- inability to represent NULs


level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them
level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them


- python?
python


level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.
level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.


- properties
???
 
- stability
 
- case mapping
 
- normalization
 
- breakign, segmentation
 
- tailoring
 
- ignoring things


level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules
level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules


- used by raku
swift/raku


TODO:
TODO:
Line 81: Line 50:


Non-Unicode compatibility
Non-Unicode compatibility
- fs
- getenv
- paths
- etc
- reversability
- ttf
- pua


- preserving data
- preserving data
[[Category:Research]]
[[Category:Research]]

Revision as of 06:48, 30 August 2022

This is a WIP page, take nothing here as final.


what are strings

things you can do with strings:

- sort

- match

- search

- normalize

- serialize

- case map

- properties

- breaking/segmentation

- reversing


level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte

filesystem/unix/C

level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness

windows

level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them

python

level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported.

???

level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules

swift/raku

TODO:

languages/locales

Non-Unicode compatibility

- preserving data