Unicode guide/Implementations: Difference between revisions

Revision as of 23:21, 19 March 2022

This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.

I apologize for not attaching sources to all of this, I've had to dig in to the source code. My best advice here it to

C and C++

I'm not going to address wide characters because they have the same issues.

Character type: Byte, encoding not defined
Bytestrings: No, just regular arrays
Internal encoding: None
String encoding: Locale-dependant, ASCII-compatible
Supports bytes in strings: Depends on locale encoding
Supports surrogates in strings: Depends on locale encoding
Supports invalid code points in strings: Depends on locale encoding
Supports normalizing strings: No
Supports querying code point properties: No
Supports breaking by code point: No
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: Yes
Supports Unicode regex extensions: Not applicable, no regex included
Classifies by: Locale information, only supports 8-bit characters
Collates by: Locale information, supports arbitrary strings
Converts case by: Locale information, only supports 8-bit characters
Locale tailoring is done by: Current locale

This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides.

- POSIX, windows

- wide APIs

Lua

Lua describes itself as 'encoding-agnostic', whatever that is. It certainly handles ASCII well.

Character type: Byte, encoding not defined
Bytestrings: No
Internal encoding: None
String encoding: Undefined
Supports bytes in strings: Depends on encoding
Supports surrogates in strings: Depends on encoding
Supports invalid code points in strings: Depends on encoding
Supports normalizing strings: No
Supports querying code point properties: No
Supports breaking by code point: Yes if encoded in UTF-8
Supports breaking by extended grapheme cluster: No
Supports breaking by text boundaries: No
Supports encoding and decoding to other encodings: No
Supports Unicode regex extensions: Not applicable, no regex at all
Classifies by: C APIs, maybe by locale, only supports 8-bit characters
Collates by: Doesn't provide an API for this
Converts case by: C APIs, maybe by locale, only supports 8-bit characters
Locale tailoring is done by: Per-process C locale

Overall there's no clear path here from reading bytes to handling Unicode.

Bonus points for the string.reverse function that will break Unicode strings.

Python 2

- check if this is worth mentioning

Rust

Java

Swift

Go

Kotlin

Python 3

python 3

Tcl

Squirrel

Perl

Ruby

Zig

Elixir

- erlang too?

Raku

Haskell

PHP

- narrow APIs

- mbstring

JavaScript

@@ Line 4: / Line 4: @@
 == C and C++ ==
-- C/C++ unicode
-Support varies according to platform, see the POSIX and Windows sections below.
-== POSIX ==
 I'm not going to address wide characters because they have the same issues.
 * Character type: Byte, encoding not defined
-* Bytestrings: Yes, as arrays
+* Bytestrings: No, just regular arrays
 * Internal encoding: None
 * String encoding: Locale-dependant, ASCII-compatible
@@ Line 24: / Line 19: @@
 * Supports breaking by text boundaries: No
 * Supports encoding and decoding to other encodings: Yes
-* Supports Unicode regex extensions: No
+* Supports Unicode regex extensions: Not applicable, no regex included
 * Classifies by: Locale information, only supports 8-bit characters
 * Collates by: Locale information, supports arbitrary strings
 * Converts case by: Locale information, only supports 8-bit characters
-* Locale tailoring is done by: Per-thread POSIX locale
+* Locale tailoring is done by: Current locale
 This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides.
+- POSIX, windows
+- wide APIs
 == Lua ==
@@ Line 59: / Line 58: @@
 == Python 2 ==
+- check if this is worth mentioning
-== Windows ==
-- wide APIs
 == Rust ==
@@ Line 73: / Line 70: @@
 == Kotlin ==
-== Python ==
+== Python 3 ==
 python 3