Unicode guide/Implementations

This page is my attempt to document my research on unique or popular Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.

I apologize for not attaching sources to all of this, as always double check this in case things have changed.

C and C++

C and C++ provide limited functionality related to text handling.

Character type: 8-bit, 16-bit or 32-bit, encoding not defined
Byte strings: No, just regular arrays
Internal encoding: None
String encoding: Depends on locale
Supports bytes in strings: Depends on locale encoding
Supports surrogates in strings: Depends on locale encoding
Supports invalid code points in strings: Depends on locale encoding
Supports normalizing strings: No
Supports querying character properties: No
Supports breaking by code point: No
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: Yes
Supports Unicode regex extensions: Not applicable, no regex included
Classifies by: Locale information, only supports single characters
Collates by: Locale information, supports arbitrary strings
Converts case by: Locale information, only supports single characters
Locale tailoring is done by: Current locale
Wraps operating system APIs with Unicode ones: No

This could be classified as 'Unicode agnostic' however classification and case conversion is limited to single characters. As a result this is just broken even with the limited functionality it provides.

Different platforms usually provide clearer definition:

On POSIX, characters are usually 8-bit ASCII-compatible values
On Windows, characters are 16-bit UTF-16-compatible values

Lua

Lua describes itself as 'encoding-agnostic', whatever that is. It certainly handles ASCII well.

Character type: Byte, encoding not defined
Byte strings: No
Internal encoding: None
String encoding: Undefined
Supports bytes in strings: Depends on encoding
Supports surrogates in strings: Depends on encoding
Supports invalid code points in strings: Depends on encoding
Supports normalizing strings: No
Supports querying character properties: No
Supports breaking by code point: Yes if encoded in UTF-8
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: No
Supports Unicode regex extensions: Not applicable, no regex at all
Classifies by: C APIs, maybe by locale, only supports 8-bit characters
Collates by: Doesn't provide an API for this
Converts case by: C APIs, maybe by locale, only supports 8-bit characters
Locale tailoring is done by: Per-process C locale
Wraps operating system APIs with Unicode ones: No

Overall there's no clear path here from reading bytes to handling Unicode.

Bonus points for the string.reverse function that will break Unicode strings.

Python

Python spent an enormous amount of time adding Unicode support in version 3.

Character type: Unicode code point
Byte strings: Yes
Internal encoding: 8-bits, 16-bits or 32-bit depending on the string
String encoding: Unicode code points
Supports bytes in strings: Yes, encoded as surrogates
Supports surrogates in strings: Yes
Supports invalid code points in strings: No
Supports normalizing strings: Yes
Supports querying character properties: Yes
Supports breaking by code point: Yes
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: Yes
Supports Unicode regex extensions: No
Classifies by: Unicode properties
Collates by: Doesn't provide an API for this
Converts case by: Unicode database
Locale tailoring is done by: Doesn't provide an API for this
Wraps operating system APIs with Unicode ones: Yes, with invalid bytes encoded as surrogates

This is better than most languages but still not most people would want.

Rust

Rust seems to define its API around UTF-8.

Character type: Unicode scalar
Byte strings: Yes
Internal encoding: UTF-8
String encoding: UTF-8
Supports bytes in strings: No
Supports surrogates in strings: No
Supports invalid code units in strings: No
Supports normalizing strings: No
Supports querying character properties: No
Supports breaking by code point: Yes
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: Whitespace only
Supports encoding and decoding to other encodings: UTF-16 only
Supports Unicode regex extensions: Not applicable, doesn't include regex
Classifies by: Unicode properties
Collates by: Doesn't provide an API for this
Converts case by: Unicode database
Locale tailoring is done by: Doesn't provide an API for this
Wraps operating system APIs with Unicode ones: Optional

It seems more well formed than Python's, but the lack of normalization confuses me.

Java

Java is one of the early adopters of Unicode.

Character type: 16-bit
Byte strings: Yes
Internal encoding: 16-bit integers
String encoding: UTF-16
Supports bytes in strings: No
Supports surrogates in strings: No
Supports invalid code units in strings: Yes
Supports normalizing strings: Yes
Supports querying character properties: Yes
Supports breaking by code point: Yes
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: Yes
Supports Unicode regex extensions: Yes
Classifies by: Unicode properties
Collates by: Unicode properties and locale
Converts case by: Unicode database and locale
Locale tailoring is done by: Providing a Locale object
Wraps operating system APIs with Unicode ones: Yes

Not great, especially since you can't use non-Unicode file paths.

JavaScript

JavaScript follows Java's model but with a few less features.

Character type: 16-bit
Byte strings: No
Internal encoding: 16-bit integers
String encoding: UTF-16
Supports bytes in strings: No
Supports surrogates in strings: No
Supports invalid code units in strings: Yes
Supports normalizing strings: Yes
Supports querying character properties: No
Supports breaking by code point: Yes
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: No
Supports Unicode regex extensions: Yes
Classifies by: Unicode properties
Collates by: Unicode properties and locale
Converts case by: Unicode database and locale
Locale tailoring is done by: Providing a locale object
Wraps operating system APIs with Unicode ones: Yes

Swift

Swift has an interesting idea about defining characters as extended grapheme clusters.

Character type: Multiple Unicode scalars making an extended grapheme cluster
Byte strings: Yes
Internal encoding: 16-bit integers
String encoding: Extended grapheme cluster
Supports bytes in strings: No
Supports surrogates in strings: No
Supports invalid code units in strings: No
Supports normalizing strings: Automatic for equality, otherwise I don't know
Supports querying character properties: No
Supports breaking by code point: Yes
Supports breaking by extended grapheme cluster: Yes
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: Yes
Supports Unicode regex extensions: I don't know
Classifies by: I don't know
Collates by: I don't know
Converts case by: I don't know
Locale tailoring is done by: I don't know
Wraps operating system APIs with Unicode ones: I don't know

Documentation on Swift is kind of hard to read and vague, so I can't find answer a few of these questions.

Go

Go has a kind of unopinionated take here.

Character type: 32-bit integer
Byte strings: Yes
Internal encoding: None
String encoding: Unspecified, probably UTF-8
Supports bytes in strings: Yes
Supports surrogates in strings: No for UTF-8 strings
Supports invalid code units in strings: No for UTF-8 strings
Supports normalizing strings: No
Supports querying character properties: Yes
Supports breaking by code point: Yes for UTF-8 text
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: No
Supports Unicode regex extensions: Yes
Classifies by: Unicode properties
Collates by: Doesn't provide an API for this
Converts case by: Unicode properties
Locale tailoring is done by: Doesn't provide an API for this
Wraps operating system APIs with Unicode ones: No

I would probably put this between Lua and C for bad interfaces.

Tcl

Squirrel

Perl

Ruby

Elixir

- erlang too?

Raku

Haskell

PHP

- narrow APIs

- mbstring