Unicode guide: Difference between revisions

From JookWiki
(Add notes I sent to friends)
(Add TODO)
Line 1: Line 1:
'''This is a WIP page, take nothing here as final.'''
'''This is a WIP page, take nothing here as final.'''


== Bytestrings ==
== ASCII strings ==
Programming traditionally used the concept of opaque strings for handling text. The idea is simple:
TODO: title


* A string is an array of bytes
TODO: i neeed to make a list of things you would want to do with a string and what it requires (locale, code points, etc)
 
Usually you can do the following operations with strings:
 
* Split it in to multiple strings
* Count how many characters are in the string
*Convert the string to uppercase
* Convert the string to lowercase
*Compare it to other strings
*Sort a list of strings
The rules used for these operations are specified by a locale which defines a language, geographic region and any other small differences. This is because character sets can be shared between languages but still have different rules.


- devs just assumed the encoding was ASCII compatible
- devs just assumed the encoding was ASCII compatible

Revision as of 21:17, 13 March 2022

This is a WIP page, take nothing here as final.

ASCII strings

TODO: title

TODO: i neeed to make a list of things you would want to do with a string and what it requires (locale, code points, etc)

- devs just assumed the encoding was ASCII compatible

stuff i just wrote to friends:

developers often didn't know the encoding of whatever text they have so they just kinda ignored 'unknown' bytes and stuck to ASCII bytes, which worked as most character encodings were ASCII-compatible

unicode was invented to have a standard character set developers could reason about

so then languages started to add support for unicode strings which are a collection of unicode characters. this is pretty good, a+ for that, this solves working with characters

buuut you still have two unsolved issues:

- what locale is a unicode string in?

- how do you deal with non-unicode data?

you may ask 'when would you ever need to deal with non-unicode data'? and the answer is when using linux or windows of course! linux and windows store their strings as bytes! yay!

python/rust solve dealing with non-unicode data by smuggling encoding errors in to unicode strings

this effectively lets you embed arbitrary bytes in to unicode strings. fucking weird. bad for security too probably

mojibake

so then you're back to square one, you have a string of unknown encoding and no information of the locale the string is in

it's also possible to decode unknown bytes as valid unicode but invalid text by accident, with encodings that accidentally line up

i have to think more about it but the core problem here in both situations is there's no out of band reliable way to know what locale and encoding a string is

i'm not sure what value unicode strings have

basically:

- strings are sequences of values

- values may be unicode code points, bytes or metadata

- metadata contains info about the language/locale

of course you might have some more object oriented way to do this, such as strings containing encoding and locale and being able to chain them together

if you can't decode a string properly

you don't want to just be sending random bytes to places that aren't prepared for that, that's all. so when you're sending data to an interface you need to strip the bytes or convert them back to an encoding the system can deal with

a problem is that when you don't know the input's encoding then you could try decoding it as utf-8 or something else. but if that fails you could just

on windows for example they say their directory paths are utf-16, but in reality they're just a 16-bit string array. so what do you do when you get an error decoding?

- crash and not be useful

- save those errors and ignore them while doing stuff but send them back to windows later

in this case it's fine because the worst that can happen is you'll get mojibake

but if you say, read a windows path and send it over the network, you might send invalid bytes

Unicode

TODO: is this unicode specific? or a rant on STRINGS?

- character set

- what is unicode

- bytes

- code points

- characters

- grapehen

- locales

- splitting things by space?

- nightmare windows APIs

- normalization

- CLDR

- languages, rich data, paragraphs, etc

- length

- languages

Language strings

- c strings

- bytestring

- higher level strings

- js strings

- etc?

Idea dump

unicode handling across languages

perl unicode

- OS bytes

- char/wchar

- bytes

- characters

- utf-8

- utf-8b

- wtf-8

- opaqueness

- locales

- non-unicode

- bytes as strings kinda works better

- round trips

- perl

- c

- scheme

- formatting bytes/etc

- native format as utf-8? what?

- bytes -> maybe unicode -> unicode -> graphemes/text/etc

user-perceived character / grapheme cluster

- scripts

- wchar

- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

- runes

- rust char

https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl