Unicode guide: Difference between revisions

VisualWikitext

Revision as of 04:17, 9 March 2022

This is a WIP page, take nothing here as final.

Character sets

Programming generally uses the concept of character sets for handling text. The idea is simple:

A character set is a collection of written symbols such as numbers, letters, punctuations or spaces
A character is a reference to a specific symbol in that set
A string is an array of characters

Usually you can do the following things with characters:

Convert it to uppercase
Convert it to lowercase

Usually you can do the following things with strings:

Split it in to multiple strings
Count how many characters are in the string

- how to represent

- locales

- concrete examples using ASCII of doing the tasks

- concrete examples using ISO-8859-1 or something

Unicode

- what is unicode

- bytes

- code points

- characters

- grapehen

- locales

- splitting things by space?

- nightmare windows APIs

- normalization

- CLDR

- languages, rich data, paragraphs, etc

- length

- languages

Idea dump

unicode handling across languages

perl unicode

- OS bytes

- char/wchar

- bytes

- characters

- utf-8

- utf-8b

- wtf-8

- opaqueness

- locales

- non-unicode

- bytes as strings kinda works better

- round trips

- perl

- c

- scheme

- formatting bytes/etc

- native format as utf-8? what?

- bytes -> maybe unicode -> unicode -> graphemes/text/etc

user-perceived character / grapheme cluster

- scripts

- wchar

- https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

- runes

- rust char

https://stackoverflow.com/questions/12450750/how-can-i-work-with-raw-bytes-in-perl

@@ Line 1: / Line 1: @@
 '''This is a WIP page, take nothing here as final.'''
-SUMMARY
+== Character sets ==
+Programming generally uses the concept of character sets for handling text. The idea is simple:
-== Strings ==
+* A character set is a collection of written symbols such as numbers, letters, punctuations or spaces
-- character sets
+* A character is a reference to a specific symbol in that set
+* A string is an array of characters
-- strings
+Usually you can do the following things with characters:
-- utf-8
+* Convert it to uppercase
+* Convert it to lowercase
-- ebdic/ascii
+Usually you can do the following things with strings:
-- upper
+* Split it in to multiple strings
+* Count how many characters are in the string
-- lower
+- how to represent
-- length
+- locales
-- locales
+- concrete examples using ASCII of doing the tasks
-- OS APIs
+- concrete examples using ISO-8859-1 or something
 == Unicode ==
@@ Line 53: / Line 57: @@
 perl unicode
+- OS bytes
 - char/wchar