Unicode guide: Difference between revisions

From JookWiki
(Background section)
(Organize a bit more)
Line 3: Line 3:
SUMMARY
SUMMARY


== Background ==
== Strings ==
- what is unicode
- character sets
 
- strings


- utf-8
- utf-8


- ebdic/ascii
- ebdic/ascii
- strings


- upper
- upper


- lower
- lower
- length
- locales


- OS APIs
- OS APIs


== Unicode ==
- what is unicode
- bytes
- code points
- characters
- grapehen
- locales
- splitting things by space?
- nightmare windows APIs
- normalization
- CLDR
- languages, rich data, paragraphs, etc
- length
- languages
== Idea dump ==
unicode handling across languages
unicode handling across languages


perl unicode
perl unicode
- char/wchar


- bytes
- bytes
Line 31: Line 65:


- wtf-8
- wtf-8
- splitting things by space?


- opaqueness
- opaqueness


- locales
- locales
- nightmare windows APIs


- non-unicode
- non-unicode
Line 51: Line 81:


- scheme
- scheme
- languages


- formatting bytes/etc
- formatting bytes/etc
Line 59: Line 87:


- bytes -> maybe unicode -> unicode -> graphemes/text/etc
- bytes -> maybe unicode -> unicode -> graphemes/text/etc
- code points


user-perceived character / grapheme cluster
user-perceived character / grapheme cluster
- languages, rich data, paragraphs, etc


- scripts
- scripts


- wchar
- wchar
- unicode characters


- <nowiki>https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default</nowiki>
- <nowiki>https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default</nowiki>
- normalization
- length
- upper
- lower
- CLDR


- runes
- runes

Revision as of 19:35, 7 March 2022

This is a WIP page, take nothing here as final.

SUMMARY

Strings

- character sets

- strings

- utf-8

- ebdic/ascii

- upper

- lower

- length

- locales

- OS APIs

Unicode

- what is unicode

- bytes

- code points

- characters

- grapehen

- locales

- splitting things by space?

- nightmare windows APIs

- normalization

- CLDR

- languages, rich data, paragraphs, etc

- length

- languages

Idea dump

unicode handling across languages

perl unicode

- char/wchar

- bytes

- characters

- utf-8

- utf-8b

- wtf-8

- opaqueness

- locales

- non-unicode

- bytes as strings kinda works better

- round trips

- perl

- c

- scheme

- formatting bytes/etc

- native format as utf-8? what?

- bytes -> maybe unicode -> unicode -> graphemes/text/etc

user-perceived character / grapheme cluster

- scripts

- wchar

- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

- runes

- rust char

https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl