Unicode guide: Difference between revisions

Revision as of 20:01, 18 March 2022

This is a WIP page, take nothing here as final.

Introduction

Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.

Unicode refresher

- refresher on unicode:

- character set, encodings, CLDR, grapheme, normalization, collation, locale

- unicode strings?

ASCII strings

- encoding-neutral but really it's ascii

- character set

- strings

- ops

- OS APIs provide strings

- simple, english based

- works with ascii-compatible encodings

- you don't have to learn anything complicated

Unicode strings

- utf-8

- OS APIs

- string APIs make less sense

- locale tagging

- utf8b

- bytestrings

- poorly defined semantics

Non-destructive text processing

- clear, unicode definitions

- rich text

- multiple versions

- metadata

- non-reversible

@@ Line 2: / Line 2: @@
 == Introduction ==
-- felt pretty smart
+Over the past decade it's been increasingly common to see programming languages add Unicode support: Specifically, support for Unicode strings. This is a good step, but it's not nearly complete and often done in a buggy way. Hopefully in this page I can show what's wrong with this approach and provide some solutions.
-- had a cool solution proposed on how to fix everything
+== Unicode refresher ==
+- refresher on unicode:
-- sat down and read the unicode string
+- character set, encodings, CLDR, grapheme, normalization, collation, locale
-- wholly inadequate as a toolkit for unicode text processing
+- unicode strings?
-== Strings ==
-- how they're encoded
-- ops you can perform on them
-- unicode strings are c strings but unicode
-== Low-level APIs ==
-- filesystems
-- windows
-- unix
-- pattern of mutating data
-- should be non-destructive
-OLD SHIT TO REMOVE LOLOLOL:
 == ASCII strings ==
-TODO: title
+- encoding-neutral but really it's ascii
-TODO: i neeed to make a list of things you would want to do with a string and what it requires (locale, code points, etc)
-- devs just assumed the encoding was ASCII compatible
-stuff i just wrote to friends:
-developers often didn't know the encoding of whatever text they have so they just kinda ignored 'unknown' bytes and stuck to ASCII bytes, which worked as most character encodings were ASCII-compatible
-unicode was invented to have a standard character set developers could reason about
-so then languages started to add support for unicode strings which are a collection of unicode characters. this is pretty good, a+ for that, this solves working with characters
-buuut you still have two unsolved issues:
-- what locale is a unicode string in?
-- how do you deal with non-unicode data?
-you may ask 'when would you ever need to deal with non-unicode data'? and the answer is when using linux or windows of course! linux and windows store their strings as bytes! yay!
-python/rust solve dealing with non-unicode data by smuggling encoding errors in to unicode strings
-this effectively lets you embed arbitrary bytes in to unicode strings. fucking weird. bad for security too probably
-mojibake
-so then you're back to square one, you have a string of unknown encoding and no information of the locale the string is in
-it's also possible to decode unknown bytes as valid unicode but invalid text by accident, with encodings that accidentally line up
-i have to think more about it but the core problem here in both situations is there's no out of band reliable way to know what locale and encoding a string is
-i'm not sure what value unicode strings have
-basically:
-- strings are sequences of values
-- values may be unicode code points, bytes or metadata
-- metadata contains info about the language/locale
-of course you might have some more object oriented way to do this, such as strings containing encoding and locale and being able to chain them together
-if you can't decode a string properly
-you don't want to just be sending random bytes to places that aren't prepared for that, that's all. so when you're sending data to an interface you need to strip the bytes or convert them back to an encoding the system can deal with
-a problem is that when you don't know the input's encoding then you could try decoding it as utf-8 or something else. but if that fails you could just
-on windows for example they say their directory paths are utf-16, but in reality they're just a 16-bit string array. so what do you do when you get an error decoding?
-- crash and not be useful
-- save those errors and ignore them while doing stuff but send them back to windows later
-in this case it's fine because the worst that can happen is you'll get mojibake
-but if you say, read a windows path and send it over the network, you might send invalid bytes
-== Unicode ==
-TODO: is this unicode specific? or a rant on STRINGS?
 - character set
-- what is unicode
+- strings
-- bytes
-- code points
-- characters
-- grapehen
-- locales
-- splitting things by space?
-- nightmare windows APIs
-- normalization
-- CLDR
-- languages, rich data, paragraphs, etc
-- length
-- languages
-== Language strings ==
-- c strings
-- bytestring
-- higher level strings
-- js strings
-- etc?
-== Idea dump ==
-unicode handling across languages
-perl unicode
+- ops
-- OS bytes
+- OS APIs provide strings
-- char/wchar
+- simple, english based
-- bytes
+- works with ascii-compatible encodings
-- characters
+- you don't have to learn anything complicated
+== Unicode strings ==
 - utf-8
-- utf-8b
+- OS APIs
-- wtf-8
-- opaqueness
-- locales
-- non-unicode
-- bytes as strings kinda works better
-- round trips
-- perl
-- c
-- scheme
-- formatting bytes/etc
+- string APIs make less sense
-- native format as utf-8? what?
+- locale tagging
-- bytes -> maybe unicode -> unicode -> graphemes/text/etc
+- utf8b
-user-perceived character / grapheme cluster
+- bytestrings
-- scripts
+- poorly defined semantics
-- wchar
+== Non-destructive text processing ==
+- clear, unicode definitions
-- <nowiki>https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default</nowiki>
+- rich text
-- runes
+- multiple versions
-- rust char
+- metadata
-<nowiki>https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl</nowiki>
+- non-reversible