Unicode guide/Implementations: Difference between revisions
(Merge POSIX and C and C++ section) |
(→C and C++: Clarify characters) |
||
Line 4: | Line 4: | ||
== C and C++ == | == C and C++ == | ||
C and C++ provide limited functionality related to text handling. | |||
*Character type: 8-bit, 16-bit or 32-bit, encoding not defined | |||
* Character type: | |||
* Bytestrings: No, just regular arrays | * Bytestrings: No, just regular arrays | ||
* Internal encoding: None | * Internal encoding: None | ||
* String encoding: | * String encoding: Depends on locale | ||
* Supports bytes in strings: Depends on locale encoding | * Supports bytes in strings: Depends on locale encoding | ||
* Supports surrogates in strings: Depends on locale encoding | * Supports surrogates in strings: Depends on locale encoding | ||
Line 20: | Line 19: | ||
* Supports encoding and decoding to other encodings: Yes | * Supports encoding and decoding to other encodings: Yes | ||
* Supports Unicode regex extensions: Not applicable, no regex included | * Supports Unicode regex extensions: Not applicable, no regex included | ||
* Classifies by: Locale information, only supports | * Classifies by: Locale information, only supports single characters | ||
* Collates by: Locale information, supports arbitrary strings | * Collates by: Locale information, supports arbitrary strings | ||
* Converts case by: Locale information, only supports | * Converts case by: Locale information, only supports single characters | ||
* Locale tailoring is done by: Current locale | * Locale tailoring is done by: Current locale | ||
This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides. | This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides. | ||
- POSIX, | - POSIX 8-bit limit, regex | ||
- | - windows 16-bit limit | ||
== Lua == | == Lua == |
Revision as of 23:27, 19 March 2022
This page is my attempt to document my research on unique Unicode implementations supported in various languages and software. This is mostly in note form to avoid things getting out of control.
I apologize for not attaching sources to all of this, I've had to dig in to the source code. My best advice here it to
C and C++
C and C++ provide limited functionality related to text handling.
- Character type: 8-bit, 16-bit or 32-bit, encoding not defined
- Bytestrings: No, just regular arrays
- Internal encoding: None
- String encoding: Depends on locale
- Supports bytes in strings: Depends on locale encoding
- Supports surrogates in strings: Depends on locale encoding
- Supports invalid code points in strings: Depends on locale encoding
- Supports normalizing strings: No
- Supports querying code point properties: No
- Supports breaking by code point: No
- Supports breaking by extended grapheme cluster: No
- Supports breaking by text boundaries: No
- Supports encoding and decoding to other encodings: Yes
- Supports Unicode regex extensions: Not applicable, no regex included
- Classifies by: Locale information, only supports single characters
- Collates by: Locale information, supports arbitrary strings
- Converts case by: Locale information, only supports single characters
- Locale tailoring is done by: Current locale
This could be classified as 'Unicode agnostic' however classification and case conversion is limited to 8-bit characters. As a result this is just broken even with the limited functionality it provides.
- POSIX 8-bit limit, regex
- windows 16-bit limit
Lua
Lua describes itself as 'encoding-agnostic', whatever that is. It certainly handles ASCII well.
- Character type: Byte, encoding not defined
- Bytestrings: No
- Internal encoding: None
- String encoding: Undefined
- Supports bytes in strings: Depends on encoding
- Supports surrogates in strings: Depends on encoding
- Supports invalid code points in strings: Depends on encoding
- Supports normalizing strings: No
- Supports querying code point properties: No
- Supports breaking by code point: Yes if encoded in UTF-8
- Supports breaking by extended grapheme cluster: No
- Supports breaking by text boundaries: No
- Supports encoding and decoding to other encodings: No
- Supports Unicode regex extensions: Not applicable, no regex at all
- Classifies by: C APIs, maybe by locale, only supports 8-bit characters
- Collates by: Doesn't provide an API for this
- Converts case by: C APIs, maybe by locale, only supports 8-bit characters
- Locale tailoring is done by: Per-process C locale
Overall there's no clear path here from reading bytes to handling Unicode.
Bonus points for the string.reverse function that will break Unicode strings.
Python 2
- check if this is worth mentioning
Rust
Java
Swift
Go
Kotlin
Python 3
python 3
Tcl
Squirrel
Perl
Ruby
Zig
Elixir
- erlang too?
Raku
Haskell
PHP
- narrow APIs
- mbstring