Editing Unicode guide/Implementations

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
This page is my attempt to document my research on unique or popular Unicode implementations supported in various languages and software, in order to get a good feel for the main page I'm writing. This is mostly in note form to avoid things getting out of control and only includes implementations I find interesting.
This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.


I apologize for not attaching sources to all of this, as always double check this in case things have changed.
I apologize for not attaching sources to all of this, I've had to dig in to the source code. My best advice here it to


== C and C++ ==
== C and C++ ==
Line 18: Line 18:
* Supports breaking by text boundaries: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: No (not applicable in C as it has no regex)
* Supports Unicode regex extensions: Not applicable, no regex included
* Classifies by: Locale information, only supports single characters
* Classifies by: Locale information, only supports single characters
* Collates by: Locale information, supports arbitrary strings
* Collates by: Locale information, supports arbitrary strings
Line 31: Line 31:
* On POSIX, characters are usually 8-bit ASCII-compatible values
* On POSIX, characters are usually 8-bit ASCII-compatible values
* On Windows, characters are 16-bit UTF-16-compatible values
* On Windows, characters are 16-bit UTF-16-compatible values
Actual support for locales depends on your libc implementation, and this affects most languages that run on your computer. For example, on Linux glibc seems to be the only libc that supports uppercasing Unicode text.


== Lua ==
== Lua ==
Line 60: Line 59:
Bonus points for the string.reverse function that will break Unicode strings.
Bonus points for the string.reverse function that will break Unicode strings.


== Python ==
== Python 2 ==
Python spent an enormous amount of time adding Unicode support in version 3.
- check if this is worth mentioning
 
*Character type: Unicode code point
* Byte strings: Yes
* Internal encoding: 8-bits, 16-bits or 32-bit depending on the string
* String encoding: Unicode code points
* Supports bytes in strings: Yes, encoded as surrogates
* Supports surrogates in strings: Yes
* Supports invalid code points in strings: No
* Supports normalizing strings: Yes
* Supports querying character properties: Yes
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: No
* Classifies by: Unicode properties
* Collates by: Doesn't provide an API for this
* Converts case by: Unicode database
* Locale tailoring is done by: Doesn't provide an API for this
*Wraps operating system APIs with Unicode ones: Yes, with invalid bytes encoded as surrogates
 
This is better than most languages but still not most people would want.


== Rust ==
== Rust ==
Rust seems to define its API around UTF-8.
*Character type: Unicode scalar
* Byte strings: Yes
* Internal encoding: UTF-8
* String encoding: UTF-8
* Supports bytes in strings: No
* Supports surrogates in strings: No
* Supports invalid code units in strings: No
* Supports normalizing strings: No
* Supports querying character properties: No
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: Whitespace only
* Supports encoding and decoding to other encodings: UTF-16 only
* Supports Unicode regex extensions: Not applicable, doesn't include regex
* Classifies by: Unicode properties
* Collates by: Doesn't provide an API for this
* Converts case by: Unicode database
* Locale tailoring is done by: Doesn't provide an API for this
*Wraps operating system APIs with Unicode ones: Optional
It seems more well formed than Python's, but the lack of normalization confuses me.


== Java ==
== Java ==
Java is one of the early adopters of Unicode.


* Character type: 16-bit
== Swift ==
* Byte strings: Yes
* Internal encoding: 16-bit integers
* String encoding: UTF-16
* Supports bytes in strings: No
* Supports surrogates in strings: No
* Supports invalid code units in strings: Yes
* Supports normalizing strings: Yes
* Supports querying character properties: Yes
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties
* Collates by: Unicode properties and locale
* Converts case by: Unicode database and locale
* Locale tailoring is done by: Providing a Locale object
* Wraps operating system APIs with Unicode ones: Yes


Not great, especially since you can't use non-Unicode file paths.
== Go ==


== JavaScript ==
== Kotlin ==
JavaScript follows Java's model but with a few less features.
*Character type: 16-bit
* Byte strings: No
* Internal encoding: 16-bit integers
* String encoding: UTF-16
* Supports bytes in strings: No
* Supports surrogates in strings: No
* Supports invalid code units in strings: Yes
* Supports normalizing strings: Yes
* Supports querying character properties: No
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: No
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties
* Collates by: Unicode properties and locale
* Converts case by: Unicode database and locale
* Locale tailoring is done by: Providing a locale object
* Wraps operating system APIs with Unicode ones: Yes
== Swift ==
Swift has an interesting idea about defining characters as extended grapheme clusters.
* Character type: Multiple Unicode scalars making an extended grapheme cluster
* Byte strings: Yes
* Internal encoding: 16-bit integers
* String encoding: Extended grapheme cluster
* Supports bytes in strings: No
* Supports surrogates in strings: No
* Supports invalid code units in strings: No
* Supports normalizing strings: Automatic for equality, otherwise I don't know
* Supports querying character properties: No
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: Yes
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: I don't know
* Classifies by: I don't know
* Collates by: I don't know
* Converts case by: I don't know
* Locale tailoring is done by: I don't know
* Wraps operating system APIs with Unicode ones: I don't know


Documentation on Swift is kind of hard to read and vague, so I can't find answer a few of these questions.
== Python 3 ==
python 3


== Go ==
== Tcl ==
Go has a kind of unopinionated take here.
*Character type: 32-bit integer, Unicode code point
* Byte strings: Yes
* Internal encoding: None
* String encoding: Unspecified, probably UTF-8
* Supports bytes in strings: Yes
* Supports surrogates in strings: No for UTF-8 strings
* Supports invalid code units in strings: No for UTF-8 strings
* Supports normalizing strings: No
* Supports querying character properties: Yes
* Supports breaking by code point: Yes for UTF-8 text
* Supports breaking by extended grapheme cluster: No
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: No
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties
* Collates by: Doesn't provide an API for this
* Converts case by: Unicode properties
* Locale tailoring is done by: Doesn't provide an API for this
* Wraps operating system APIs with Unicode ones: No


I would probably put this between Lua and C for bad interfaces.
== Squirrel ==


Bonus points for deciding shortening 'code point' to 'rune' is worth having to look up what a rune is.
== Perl ==


== Ruby ==
== Ruby ==
Ruby hasn't really had a major refactoring for Unicode like contemporary languages.


*Character type: Arbitrarily large integers, unspecified character set
== Zig ==
* Byte strings: Yes
* Internal encoding: None
* String encoding: Any
* Supports bytes in strings: No for UTF-8/UTF-16/UTF-32 strings
* Supports surrogates in strings: No for UTF-8 strings
* Supports invalid code units in strings: No for UTF-8 strings
* Supports normalizing strings: Yes
* Supports querying character properties: Yes
* Supports breaking by code point: Yes for UTF-8 text
* Supports breaking by extended grapheme cluster: Yes
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties
* Collates by: Doesn't provide an API for this
* Converts case by: Unicode properties with Turkic language support
* Locale tailoring is done by: Doesn't provide an API for this
* Wraps operating system APIs with Unicode ones: No


If you use UTF-8 strings only in Ruby then you might be able to have a good experience.
== Elixir ==
- erlang too?


== Erlang ==
== Raku ==
Erlang is kind of an alien language compared to all the above.


*Character type: Integer, Unicode code point
== Haskell ==
* Byte strings: Yes
* Internal encoding: I don't know
* String encoding: A mix of integers or UTF-8 strings
* Supports bytes in strings: Yes
* Supports surrogates in strings: Yes
* Supports invalid code units in strings: Yes
* Supports normalizing strings: Yes
* Supports querying character properties: No
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: Yes
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: Yes
* Classifies by: Doesn't provide an API for this
* Collates by: Doesn't provide an API for this
* Converts case by: I don't know
* Locale tailoring is done by: Doesn't provide an API for this
* Wraps operating system APIs with Unicode ones: No


Unicode support seems fairly limited and confusing here.
== PHP ==
- narrow APIs


== Raku ==
- mbstring
Raku seems to have had the most thought put in to its Unicode support.


*Character type: 32-bit integer
== JavaScript ==
* Byte strings: Yes
* Internal encoding: A mix of signed integers of various sizes
* String encoding: Normalized grapheme clusters
* Supports bytes in strings: Yes
* Supports surrogates in strings: I don't know
* Supports invalid code units in strings: I don't know
* Supports normalizing strings: Yes
* Supports querying character properties: Yes
* Supports breaking by code point: Yes
* Supports breaking by extended grapheme cluster: Yes
* Supports breaking by text boundaries: No
* Supports encoding and decoding to other encodings: Yes
* Supports Unicode regex extensions: Yes
* Classifies by: Unicode properties
* Collates by: Unicode properties
* Converts case by: I don't know
* Locale tailoring is done by: Doesn't provide an API for this
* Wraps operating system APIs with Unicode ones: Yes, with UTF-C8 to escape bytes
 
Interesting it doesn't contain wrappers like isalpha, isupper, etc.
[[Category:Research]]
[[Category:Research]]
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see JookWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)