Unicode guide: Difference between revisions

From JookWiki
(More writing)
Line 18: Line 18:
The rules used for these operations are specified by a locale which defines a language, geographic region and any other small differences. This is because character sets can be shared between languages but still have different rules.
The rules used for these operations are specified by a locale which defines a language, geographic region and any other small differences. This is because character sets can be shared between languages but still have different rules.


== Concrete example: ISO-8859-1 ==
== Unicode ==
Because different languages can share a character set but have different rules
- character set
 
Most of these operations depend on the
 
- how to represent


- locales
- concrete examples using ASCII of doing the tasks
- concrete examples using ISO-8859-1 or something
== Unicode ==
- what is unicode
- what is unicode



Revision as of 20:30, 9 March 2022

This is a WIP page, take nothing here as final.

Character sets

Programming generally uses the concept of character sets for handling text. The idea is simple:

  • A character set is a collection of written symbols such as numbers, letters, punctuations or spaces
  • A character is a reference to a specific symbol in that set
  • A string is an array of characters

Usually you can do the following operations with strings:

  • Split it in to multiple strings
  • Count how many characters are in the string
  • Convert it to uppercase
  • Convert it to lowercase
  • Compare it to other strings
  • Sort a list of strings

The rules used for these operations are specified by a locale which defines a language, geographic region and any other small differences. This is because character sets can be shared between languages but still have different rules.

Unicode

- character set

- what is unicode

- bytes

- code points

- characters

- grapehen

- locales

- splitting things by space?

- nightmare windows APIs

- normalization

- CLDR

- languages, rich data, paragraphs, etc

- length

- languages

Idea dump

unicode handling across languages

perl unicode

- OS bytes

- char/wchar

- bytes

- characters

- utf-8

- utf-8b

- wtf-8

- opaqueness

- locales

- non-unicode

- bytes as strings kinda works better

- round trips

- perl

- c

- scheme

- formatting bytes/etc

- native format as utf-8? what?

- bytes -> maybe unicode -> unicode -> graphemes/text/etc

user-perceived character / grapheme cluster

- scripts

- wchar

- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

- runes

- rust char

https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl