Unicode guide/Implementations
This page is my attempt to document my research on Unicode string implementations supported in various languages and software.
Classifications
Here's a quick list of things I'll be classifying:
- Bytestring support
- Internal encoding
- String encoding
- Character type
- OS API encoding/type
- Supports bytes in strings
- Can encode/decode to other encodings
- How breaking by code points, graphene, words, paragraphs, etc is done
- How ordering works
- How upper/lower/folding case works
- How finding works
- How regex works
- How locale tailoring is done
- bare encoding/runes (java, windows, wchar, rust, go, javascript, ruby, kotlin, zig, elixir)
- codepoint based (python, haskell, perl, tcl)
- grapheme-based (swift, raku) which lets you convert a string to codepoints?
- normalized (raku)
- bytestrings
- wchar
- windows
- rust
- java
- swift
- go
- kotlin
- java
- python, utf8b
- tcl
- linux/unix
- javascript
- perl
- ruby
- zig
- raku
- haskell
- elixir
- ICU
C and C++
Python 2
Lua
PHP (ignoring mbstring)
POSIX APIs
Windows narrow APIs
DOS APIs
squirrel