Jump to content
Toggle sidebar
JookWiki
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Navigation
Main page
Recent changes
Random page
All pages
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information
Editing
Unicode guide
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
More
Read
Edit
Edit source
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Abstraction levels == - bytes - code units - code points - segmented text - unicode strings may be encoded data, code units, code points, non-surrogate code points, or mixed data TODO: locale information/rich text === Level 1: Bytes === level 1: bytes. you can compare, search, splitting, sorting. your basic unit is the byte filesystem/unix/C utf-8b === Level 2: Code units === level 2: code units. your basic unit is the smallest unit of your unicode encoding: a byte for utf-8, a 16-bit int for UTF-16, a 32-bit int for UTF-32. you can compare, search, splitting, sort. to get to this point you have to handle endianness windows === Level 3: Unicode scalars === level 3: unicode scalars. your basic unit is a number between 0x0 and 0x1fffff inclusive, with some ranges for surrogates not allowed. to get tho this point you have to decode utf-8, utf-16 or utf-32. you can compare, search, split, etc but it's important to note that these are just numbers. there's no meaning attached to them python TODO: Code points can be noncharacters or reserved characters, uh-oh === Level 4: Unicode characters === level 4: unicode characters: your basic unit is a code point that your runtime recognizes and is willing to interpret using a copy of the unicode database. results vary according to the supported unicode version. you can normalize, compare, match, search, and splitting, case map strings. locale specific operations may be provided. to get these the runtime needs to check if the characters are supported. ??? TODO: noncharacters === Level 5: Segmented text === level 5: unicode texts: your basic unit is a string of unicode characters of some amount, such as a word, paragraph, grapheme cluster. to get these you need to convert from a string of unicode characters with breaking/segmentation rules swift/raku
Summary:
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see
JookWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To edit this page, please answer the question that appears below (
more info
):
Who owns this wiki?
Cancel
Editing help
(opens in new window)