Jump to content
Toggle sidebar
JookWiki
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Navigation
Main page
Recent changes
Random page
All pages
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information
Editing
Unicode guide
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
More
Read
Edit
Edit source
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Encodings == Storing an arbitrary code point requires an unsigned 21-bit number. This a problem for a few reasons: * Modern computers would store this in a 32-bit number * Storing a load of 32-bit numbers is space inefficient Modern development environments break encoded Unicode text in to sequences of one or more code units: * Unix strings use 8-bit code units * Windows strings use 16-bit code units * Java and JavaScript strings use 16-bit code units The Unicode standard defines encoding forms that transform between code points and code units: * UTF-8 which uses 8-bit code units * UTF-16 which uses 16-bit code units * UTF-32 which uses 32-bit code units These encoding forms encode all valid code points except surrogate code points, even UTF-32 which is otherwise a straight representation of code points as 32-bit integers. The standard then defines encoding schemes that transform between code units and bytes: * UTF-8 which is the same as its encoding form * UTF-16LE and UTF-16BE which use different byte orders * UTF-32LE and UTF-32BE which use different byte orders * UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection * UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection The byte order mark is actually the Unicode character U+FEFF [https://util.unicode.org/UnicodeJsps/character.jsp?a=FEFF&B1=Show ZERO WIDTH NO-BREAK SPACE], but interpreted as a byte order mark for UTF-16 and UTF-32 when present at the start of encoded text. The initial U+FEFF code point is added and removed during decoding and encoding, but any other U+FEFF code points are kept. Some software treat the byte order mark as a signature to detect which Unicode encoding text is using, if using Unicode at all. Software that does this may require UTF-8 text to include a byte order mark despite the encoding not needing it. Unicode also offers the ability to gracefully handle decoding failures. This is done by having decoders to substitute invalid data with the U+FFFD [https://util.unicode.org/UnicodeJsps/character.jsp?a=FFFD&B1=Show REPLACEMENT CHARACTER] code point. This character may also be used as a fallback when unable to display a character, or when unable to convert non-Unicode text to Unicode. All of these encodings may seem overwhelming, but in practice the only two encodings used are UTF-8 and UTF-16. The reason for this split is historical: The first edition of Unicode had a 16-bit codespace and used a fixed-width 16-bit encoding named UCS-2. The first adopters of Unicode such as Java and Windows chose to represent Unicode with UCS-2 while software that required backwards compatibility such as Unix used UTF-8 and treated Unicode as just another character set. The second edition of Unicode increased the codespace to 21-bit and introduced UTF-32 as its fixed-width encoding. UCS-2 was succeeded by the variable-width UTF-16 encoding we have today. A portion of the codespace was reserved as 'surrogate' code points to preserve compatibility between UCS-2 and UTF-16: These code points are seen as valid code points by UCS-2 systems but decoded as 21-bit code points by UTF-16. Lots of time is spent discussing which encoding is the better variable-width encoding and which you should use in new projects. In practice the encoding you use is likely already decided by the tools you use and cultures or APIs you interact with.
Summary:
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see
JookWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To edit this page, please answer the question that appears below (
more info
):
Who owns this wiki?
Cancel
Editing help
(opens in new window)