Unicode guide/Implementations: Difference between revisions
(Add languages and implementations) |
(Add POSIX) |
||
Line 1: | Line 1: | ||
This page is my attempt to document my research on Unicode | This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control. | ||
== | == C and C++ == | ||
Support varies according to platform, see the POSIX and Windows sections below. | |||
== POSIX (C) == | |||
I'm not going to address wide characters because they have the same issues. | |||
* Character type: Byte, encoding not defined | |||
- | * Bytestrings: Yes, regular strings | ||
* Internal encoding: None | |||
* String encoding: Locale-dependant, ASCII-compatible | |||
* Supports bytes in strings: Depends on locale encoding | |||
* Supports surrogates in strings: Depends on locale encoding | |||
* Supports invalid code points in strings: Depends on locale encoding | |||
* Supports normalizing strings: No | |||
* Supports querying code point properties: No | |||
* Supports breaking by code point: No | |||
* Supports breaking by extended grapheme cluster: No | |||
* Supports breaking by text boundaries: No | |||
* Supports encoding and decoding to other encodings: Yes | |||
* Supports Unicode regex extensions: No | |||
* Classifies by: Locale information, only supports 8-bit characters | |||
* Collates by: Locale information, supports arbitrary strings | |||
* Converts case by: Locale information, only supports 8-bit characters | |||
* Locale tailoring is done by: Per-thread POSIX locale | |||
This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides. | |||
== Windows == | == Windows == | ||
- wide APIs | - wide APIs | ||
Revision as of 22:46, 19 March 2022
This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.
C and C++
Support varies according to platform, see the POSIX and Windows sections below.
POSIX (C)
I'm not going to address wide characters because they have the same issues.
- Character type: Byte, encoding not defined
- Bytestrings: Yes, regular strings
- Internal encoding: None
- String encoding: Locale-dependant, ASCII-compatible
- Supports bytes in strings: Depends on locale encoding
- Supports surrogates in strings: Depends on locale encoding
- Supports invalid code points in strings: Depends on locale encoding
- Supports normalizing strings: No
- Supports querying code point properties: No
- Supports breaking by code point: No
- Supports breaking by extended grapheme cluster: No
- Supports breaking by text boundaries: No
- Supports encoding and decoding to other encodings: Yes
- Supports Unicode regex extensions: No
- Classifies by: Locale information, only supports 8-bit characters
- Collates by: Locale information, supports arbitrary strings
- Converts case by: Locale information, only supports 8-bit characters
- Locale tailoring is done by: Per-thread POSIX locale
This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides.
Windows
- wide APIs
Rust
Java
Swift
Go
Kotlin
Python
python 2
python 3
Tcl
Lua
Squirrel
Perl
Ruby
Zig
Elixir
- erlang too?
Raku
Haskell
PHP
- narrow APIs
- mbstring