Unicode guide
This is a WIP page, take nothing here as final.
Introduction
- felt pretty smart
- had a cool solution proposed on how to fix everything
- sat down and read the unicode string
- wholly inadequate as a toolkit for unicode text processing
Strings
- how they're encoded
- ops you can perform on them
- unicode strings are c strings but unicode
Low-level APIs
- filesystems
- windows
- unix
- pattern of mutating data
- should be non-destructive
OLD SHIT TO REMOVE LOLOLOL:
ASCII strings
TODO: title
TODO: i neeed to make a list of things you would want to do with a string and what it requires (locale, code points, etc)
- devs just assumed the encoding was ASCII compatible
stuff i just wrote to friends:
developers often didn't know the encoding of whatever text they have so they just kinda ignored 'unknown' bytes and stuck to ASCII bytes, which worked as most character encodings were ASCII-compatible
unicode was invented to have a standard character set developers could reason about
so then languages started to add support for unicode strings which are a collection of unicode characters. this is pretty good, a+ for that, this solves working with characters
buuut you still have two unsolved issues:
- what locale is a unicode string in?
- how do you deal with non-unicode data?
you may ask 'when would you ever need to deal with non-unicode data'? and the answer is when using linux or windows of course! linux and windows store their strings as bytes! yay!
python/rust solve dealing with non-unicode data by smuggling encoding errors in to unicode strings
this effectively lets you embed arbitrary bytes in to unicode strings. fucking weird. bad for security too probably
mojibake
so then you're back to square one, you have a string of unknown encoding and no information of the locale the string is in
it's also possible to decode unknown bytes as valid unicode but invalid text by accident, with encodings that accidentally line up
i have to think more about it but the core problem here in both situations is there's no out of band reliable way to know what locale and encoding a string is
i'm not sure what value unicode strings have
basically:
- strings are sequences of values
- values may be unicode code points, bytes or metadata
- metadata contains info about the language/locale
of course you might have some more object oriented way to do this, such as strings containing encoding and locale and being able to chain them together
if you can't decode a string properly
you don't want to just be sending random bytes to places that aren't prepared for that, that's all. so when you're sending data to an interface you need to strip the bytes or convert them back to an encoding the system can deal with
a problem is that when you don't know the input's encoding then you could try decoding it as utf-8 or something else. but if that fails you could just
on windows for example they say their directory paths are utf-16, but in reality they're just a 16-bit string array. so what do you do when you get an error decoding?
- crash and not be useful
- save those errors and ignore them while doing stuff but send them back to windows later
in this case it's fine because the worst that can happen is you'll get mojibake
but if you say, read a windows path and send it over the network, you might send invalid bytes
Unicode
TODO: is this unicode specific? or a rant on STRINGS?
- character set
- what is unicode
- bytes
- code points
- characters
- grapehen
- locales
- splitting things by space?
- nightmare windows APIs
- normalization
- CLDR
- languages, rich data, paragraphs, etc
- length
- languages
Language strings
- c strings
- bytestring
- higher level strings
- js strings
- etc?
Idea dump
unicode handling across languages
perl unicode
- OS bytes
- char/wchar
- bytes
- characters
- utf-8
- utf-8b
- wtf-8
- opaqueness
- locales
- non-unicode
- bytes as strings kinda works better
- round trips
- perl
- c
- scheme
- formatting bytes/etc
- native format as utf-8? what?
- bytes -> maybe unicode -> unicode -> graphemes/text/etc
user-perceived character / grapheme cluster
- scripts
- wchar
- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
- runes
- rust char
https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl