Unicode guide: Difference between revisions
(Organize a bit more) |
(→Strings: Add character sets) |
||
Line 1: | Line 1: | ||
'''This is a WIP page, take nothing here as final.''' | '''This is a WIP page, take nothing here as final.''' | ||
== Character sets == | |||
Programming generally uses the concept of character sets for handling text. The idea is simple: | |||
* A character set is a collection of written symbols such as numbers, letters, punctuations or spaces | |||
* A character is a reference to a specific symbol in that set | |||
* A string is an array of characters | |||
Usually you can do the following things with characters: | |||
* Convert it to uppercase | |||
* Convert it to lowercase | |||
Usually you can do the following things with strings: | |||
* Split it in to multiple strings | |||
* Count how many characters are in the string | |||
- | - how to represent | ||
- | - locales | ||
- | - concrete examples using ASCII of doing the tasks | ||
- | - concrete examples using ISO-8859-1 or something | ||
== Unicode == | == Unicode == | ||
Line 53: | Line 57: | ||
perl unicode | perl unicode | ||
- OS bytes | |||
- char/wchar | - char/wchar |
Revision as of 04:17, 9 March 2022
This is a WIP page, take nothing here as final.
Character sets
Programming generally uses the concept of character sets for handling text. The idea is simple:
- A character set is a collection of written symbols such as numbers, letters, punctuations or spaces
- A character is a reference to a specific symbol in that set
- A string is an array of characters
Usually you can do the following things with characters:
- Convert it to uppercase
- Convert it to lowercase
Usually you can do the following things with strings:
- Split it in to multiple strings
- Count how many characters are in the string
- how to represent
- locales
- concrete examples using ASCII of doing the tasks
- concrete examples using ISO-8859-1 or something
Unicode
- what is unicode
- bytes
- code points
- characters
- grapehen
- locales
- splitting things by space?
- nightmare windows APIs
- normalization
- CLDR
- languages, rich data, paragraphs, etc
- length
- languages
Idea dump
unicode handling across languages
perl unicode
- OS bytes
- char/wchar
- bytes
- characters
- utf-8
- utf-8b
- wtf-8
- opaqueness
- locales
- non-unicode
- bytes as strings kinda works better
- round trips
- perl
- c
- scheme
- formatting bytes/etc
- native format as utf-8? what?
- bytes -> maybe unicode -> unicode -> graphemes/text/etc
user-perceived character / grapheme cluster
- scripts
- wchar
- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
- runes
- rust char
https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl