Unicode guide/Implementations: Difference between revisions

From JookWiki
(Add Squirrel)
(Add classifications)
Line 2: Line 2:


== Classifications ==
== Classifications ==
- bytestrings (applies to most?)
Here's a quick list of things I'll be classifying:
 
* Bytestring support
* Internal encoding
* String encoding
* Character type
* OS API encoding/type
* Supports bytes in strings
* Can encode/decode to other encodings
* How breaking by code points, graphene, words, paragraphs, etc is done
* How ordering works
* How upper/lower/folding case works
* How finding works
* How regex works
* How locale tailoring is done


- bare encoding/runes (java, windows, wchar, rust, go, javascript, ruby, kotlin, zig, elixir)
- bare encoding/runes (java, windows, wchar, rust, go, javascript, ruby, kotlin, zig, elixir)
Line 11: Line 25:


- normalized (raku)
- normalized (raku)
research and categorize the following:
- filesystem/OS APIs being broken?
- surrogates in valid strings (python utf8b)
- bytes in strings? (utf8-c8, utf8b)
- string apis? (encoding/decoding, code points, normalization, graphenes, segmentation, ordering, comparing, breaking, case folding, finding, regex)


- bytestrings
- bytestrings

Revision as of 18:33, 19 March 2022

This page is my attempt to document my research on Unicode string implementations supported in various languages and software.

Classifications

Here's a quick list of things I'll be classifying:

  • Bytestring support
  • Internal encoding
  • String encoding
  • Character type
  • OS API encoding/type
  • Supports bytes in strings
  • Can encode/decode to other encodings
  • How breaking by code points, graphene, words, paragraphs, etc is done
  • How ordering works
  • How upper/lower/folding case works
  • How finding works
  • How regex works
  • How locale tailoring is done

- bare encoding/runes (java, windows, wchar, rust, go, javascript, ruby, kotlin, zig, elixir)

- codepoint based (python, haskell, perl, tcl)

- grapheme-based (swift, raku) which lets you convert a string to codepoints?

- normalized (raku)

- bytestrings

- wchar

- windows

- rust

- java

- swift

- go

- kotlin

- java

- python, utf8b

- tcl

- linux/unix

- javascript

- perl

- ruby

- zig

- raku

- haskell

- elixir

- ICU

C and C++

Python 2

Lua

PHP (ignoring mbstring)

POSIX APIs

Windows narrow APIs

DOS APIs

squirrel