Unicode guide: Difference between revisions

From JookWiki
(→‎Character sets: Clarify uppercase and lowercase)
(More)
Line 46: Line 46:


- languages
- languages
== Language strings ==
- c strings
- bytestring
- higher level strings
- js strings
- etc?


== Idea dump ==
== Idea dump ==

Revision as of 06:21, 10 March 2022

This is a WIP page, take nothing here as final.

Character sets

Programming generally uses the concept of character sets for handling text. The idea is simple:

  • A character set is a collection of written symbols such as numbers, letters, punctuations or spaces
  • A character is a reference to a specific symbol in that set
  • A string is an array of characters

Usually you can do the following operations with strings:

  • Split it in to multiple strings
  • Count how many characters are in the string
  • Convert the string to uppercase
  • Convert the string to lowercase
  • Compare it to other strings
  • Sort a list of strings

The rules used for these operations are specified by a locale which defines a language, geographic region and any other small differences. This is because character sets can be shared between languages but still have different rules.

Unicode

- character set

- what is unicode

- bytes

- code points

- characters

- grapehen

- locales

- splitting things by space?

- nightmare windows APIs

- normalization

- CLDR

- languages, rich data, paragraphs, etc

- length

- languages

Language strings

- c strings

- bytestring

- higher level strings

- js strings

- etc?

Idea dump

unicode handling across languages

perl unicode

- OS bytes

- char/wchar

- bytes

- characters

- utf-8

- utf-8b

- wtf-8

- opaqueness

- locales

- non-unicode

- bytes as strings kinda works better

- round trips

- perl

- c

- scheme

- formatting bytes/etc

- native format as utf-8? what?

- bytes -> maybe unicode -> unicode -> graphemes/text/etc

user-perceived character / grapheme cluster

- scripts

- wchar

- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

- runes

- rust char

https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl