Editing Unicode guide (section)

== Encodings ==
Storing an arbitrary code point requires an unsigned 21-bit number. This a problem for a few reasons:

* Modern computers would store this in a 32-bit number
* Storing a load of 32-bit numbers is space inefficient
Modern development environments break encoded Unicode text in to sequences of one or more code units:

* Unix strings use 8-bit code units
* Windows strings use 16-bit code units
* Java and JavaScript strings use 16-bit code units

The Unicode standard defines encoding forms that transform between code points and code units:

* UTF-8 which uses 8-bit code units
* UTF-16 which uses 16-bit code units
* UTF-32 which uses 32-bit code units
These encoding forms encode all valid code points except surrogate code points, even UTF-32 which is otherwise a straight representation of code points as 32-bit integers.

The standard then defines encoding schemes that transform between code units and bytes:

* UTF-8 which is the same as its encoding form
* UTF-16LE and UTF-16BE which use different byte orders
* UTF-32LE and UTF-32BE which use different byte orders
* UTF-16 which is either UTF-16LE or UTF-16BE with a byte order mark for detection
* UTF-32 which is either UTF-32LE or UTF-32BE with a byte order mark for detection
The byte order mark is actually the Unicode character U+FEFF [https://util.unicode.org/UnicodeJsps/character.jsp?a=FEFF&B1=Show ZERO WIDTH NO-BREAK SPACE], but interpreted as a byte order mark for UTF-16 and UTF-32 when present at the start of encoded text. The initial U+FEFF code point is added and removed during decoding and encoding, but any other U+FEFF code points are kept.

Some software treat the byte order mark as a signature to detect which Unicode encoding text is using, if using Unicode at all. Software that does this may require UTF-8 text to include a byte order mark despite the encoding not needing it.

Unicode also offers the ability to gracefully handle decoding failures. This is done by having decoders to substitute invalid data with the U+FFFD [https://util.unicode.org/UnicodeJsps/character.jsp?a=FFFD&B1=Show REPLACEMENT CHARACTER] code point. This character may also be used as a fallback when unable to display a character, or when unable to convert non-Unicode text to Unicode.

All of these encodings may seem overwhelming, but in practice the only two encodings used are UTF-8 and UTF-16. The reason for this split is historical: 

The first edition of Unicode had a 16-bit codespace and used a fixed-width 16-bit encoding named UCS-2. The first adopters of Unicode such as Java and Windows chose to represent Unicode with UCS-2 while software that required backwards compatibility such as Unix used UTF-8 and treated Unicode as just another character set.

The second edition of Unicode increased the codespace to 21-bit and introduced UTF-32 as its fixed-width encoding. UCS-2 was succeeded by the variable-width UTF-16 encoding we have today. A portion of the codespace was reserved as 'surrogate' code points to preserve compatibility between UCS-2 and UTF-16: These code points are seen as valid code points by UCS-2 systems but decoded as 21-bit code points by UTF-16.

Lots of time is spent discussing which encoding is the better variable-width encoding and which you should use in new projects. In practice the encoding you use is likely already decided by the tools you use and cultures or APIs you interact with.