Unicode guide/Implementations: Difference between revisions

From JookWiki
(Add rust)
(→‎Java: Add section)
Line 66: Line 66:
* Internal encoding: 8-bits, 16-bits or 32-bit depending on the string
* Internal encoding: 8-bits, 16-bits or 32-bit depending on the string
* String encoding: Unicode code points
* String encoding: Unicode code points
* Supports bytes in strings: Yes, using PEP 383
* Supports bytes in strings: Yes, encoded as surrogates
* Supports surrogates in strings: Yes
* Supports surrogates in strings: Yes
* Supports invalid code points in strings: No
* Supports invalid code points in strings: No
Line 85: Line 85:


== Rust ==
== Rust ==
Not sure exactly what Rust is doing here.
Rust seems to define its API around UTF-8.
*Character type: Unicode code point
*Character type: Unicode scalar
* Byte strings: Yes
* Byte strings: Yes
* Internal encoding: UTF-8
* Internal encoding: UTF-8
Line 104: Line 104:
* Converts case by: Unicode database
* Converts case by: Unicode database
* Locale tailoring is done by: Doesn't provide an API for this
* Locale tailoring is done by: Doesn't provide an API for this
*Wraps operating system APIs with Unicode ones: No
*Wraps operating system APIs with Unicode ones: Optional
It seems more well formed than Python's, but the lack of normalization confuses me.
It seems more well formed than Python's, but the lack of normalization confuses me.


== Java ==
== Java ==
Java is one of the early adopters of Unicode.
* Character type: 16-bit
* Byte strings: Yes
* Internal encoding: 16-bit integers
* String encoding: UTF-16
* Supports bytes in strings: No
* Supports surrogates in strings: Yes
* Supports invalid code points in strings: Yes
* Supports normalizing strings: Yes
* Supports querying character properties: Yes
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties and locale
* Collates by: Unicode properties and locale
* Converts case by: Unicode database and locale
* Locale tailoring is done by: Providing a Locale object
* Wraps operating system APIs with Unicode ones: Yes
Not great, especially since you can't use non-Unicode file paths.


== Swift ==
== Swift ==

Revision as of 03:44, 20 March 2022

This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.

I apologize for not attaching sources to all of this, I've had to dig in to the source code. My best advice here it to

C and C++

C and C++ provide limited functionality related to text handling.

  • Character type: 8-bit, 16-bit or 32-bit, encoding not defined
  • Byte strings: No, just regular arrays
  • Internal encoding: None
  • String encoding: Depends on locale
  • Supports bytes in strings: Depends on locale encoding
  • Supports surrogates in strings: Depends on locale encoding
  • Supports invalid code points in strings: Depends on locale encoding
  • Supports normalizing strings: No
  • Supports querying character properties: No
  • Supports breaking by code point: No
  • Supports breaking by extended grapheme cluster: No
  • Supports breaking by text boundaries: No
  • Supports encoding and decoding to other encodings: Yes
  • Supports Unicode regex extensions: Not applicable, no regex included
  • Classifies by: Locale information, only supports single characters
  • Collates by: Locale information, supports arbitrary strings
  • Converts case by: Locale information, only supports single characters
  • Locale tailoring is done by: Current locale
  • Wraps operating system APIs with Unicode ones: No

This could be classified as 'Unicode agnostic' however classification and case conversion is limited to single characters. As a result this is just broken even with the limited functionality it provides.

Different platforms usually provide clearer definition:

  • On POSIX, characters are usually 8-bit ASCII-compatible values
  • On Windows, characters are 16-bit UTF-16-compatible values

Lua

Lua describes itself as 'encoding-agnostic', whatever that is. It certainly handles ASCII well.

  • Character type: Byte, encoding not defined
  • Byte strings: No
  • Internal encoding: None
  • String encoding: Undefined
  • Supports bytes in strings: Depends on encoding
  • Supports surrogates in strings: Depends on encoding
  • Supports invalid code points in strings: Depends on encoding
  • Supports normalizing strings: No
  • Supports querying character properties: No
  • Supports breaking by code point: Yes if encoded in UTF-8
  • Supports breaking by extended grapheme cluster: No
  • Supports breaking by text boundaries: No
  • Supports encoding and decoding to other encodings: No
  • Supports Unicode regex extensions: Not applicable, no regex at all
  • Classifies by: C APIs, maybe by locale, only supports 8-bit characters
  • Collates by: Doesn't provide an API for this
  • Converts case by: C APIs, maybe by locale, only supports 8-bit characters
  • Locale tailoring is done by: Per-process C locale
  • Wraps operating system APIs with Unicode ones: No

Overall there's no clear path here from reading bytes to handling Unicode.

Bonus points for the string.reverse function that will break Unicode strings.

Python

Python spent an enormous amount of time adding Unicode support in version 3.

  • Character type: Unicode code point
  • Byte strings: Yes
  • Internal encoding: 8-bits, 16-bits or 32-bit depending on the string
  • String encoding: Unicode code points
  • Supports bytes in strings: Yes, encoded as surrogates
  • Supports surrogates in strings: Yes
  • Supports invalid code points in strings: No
  • Supports normalizing strings: Yes
  • Supports querying character properties: Yes
  • Supports breaking by code point: Yes
  • Supports breaking by extended grapheme cluster: No
  • Supports breaking by text boundaries: No
  • Supports encoding and decoding to other encodings: Yes
  • Supports Unicode regex extensions: No
  • Classifies by: Unicode properties
  • Collates by: Doesn't provide an API for this
  • Converts case by: Unicode database
  • Locale tailoring is done by: Doesn't provide an API for this
  • Wraps operating system APIs with Unicode ones: Yes, with invalid bytes encoded as surrogates

This is better than most languages but still not most people would want.

Rust

Rust seems to define its API around UTF-8.

  • Character type: Unicode scalar
  • Byte strings: Yes
  • Internal encoding: UTF-8
  • String encoding: UTF-8
  • Supports bytes in strings: No
  • Supports surrogates in strings: No
  • Supports invalid code points in strings: No
  • Supports normalizing strings: No
  • Supports querying character properties: No
  • Supports breaking by code point: Yes
  • Supports breaking by extended grapheme cluster: No
  • Supports breaking by text boundaries: Whitespace only
  • Supports encoding and decoding to other encodings: UTF-16 only
  • Supports Unicode regex extensions: Not applicable, doesn't include regex
  • Classifies by: Unicode properties
  • Collates by: Doesn't provide an API for this
  • Converts case by: Unicode database
  • Locale tailoring is done by: Doesn't provide an API for this
  • Wraps operating system APIs with Unicode ones: Optional

It seems more well formed than Python's, but the lack of normalization confuses me.

Java

Java is one of the early adopters of Unicode.

  • Character type: 16-bit
  • Byte strings: Yes
  • Internal encoding: 16-bit integers
  • String encoding: UTF-16
  • Supports bytes in strings: No
  • Supports surrogates in strings: Yes
  • Supports invalid code points in strings: Yes
  • Supports normalizing strings: Yes
  • Supports querying character properties: Yes
  • Supports breaking by code point: Yes
  • Supports breaking by extended grapheme cluster: No
  • Supports breaking by text boundaries: No
  • Supports encoding and decoding to other encodings: Yes
  • Supports Unicode regex extensions: Yes
  • Classifies by: Unicode properties and locale
  • Collates by: Unicode properties and locale
  • Converts case by: Unicode database and locale
  • Locale tailoring is done by: Providing a Locale object
  • Wraps operating system APIs with Unicode ones: Yes

Not great, especially since you can't use non-Unicode file paths.

Swift

Go

Kotlin

Tcl

Squirrel

Perl

Ruby

Zig

Elixir

- erlang too?

Raku

Haskell

PHP

- narrow APIs

- mbstring

JavaScript