Jump to content
Toggle sidebar
JookWiki
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Navigation
Main page
Recent changes
Random page
All pages
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information
Editing
Unicode guide
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
More
Read
Edit
Edit source
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Normalization == An identical sequence of abstract characters may be represented using multiple different encoded character sequences. This can be due to an abstract character being encoded multiple times, or being encodable using multiple other encoded characters. An easy example is that the ohm symbol may be represented as any of the following: * U+03A9 "Ξ©": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%CE%A9&B1=Show GREEK CAPITAL LETTER OMEGA] * U+2126 "Ξ©": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%84%A6&B1=Show OHM SIGN] For a harder example the abstract character "Γ©" may be represented as: * U+00E9 "Γ©": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%C3%A9&B1=Show LATIN SMALL LETTER E WITH ACUTE] But it can also be represented as: * U+0065 "e": [https://util.unicode.org/UnicodeJsps/character.jsp?a=e&B1=Show LATIN SMALL LETTER E] * U+0301 " Μ": [https://util.unicode.org/UnicodeJsps/character.jsp?a=301&B1=Show COMBINING ACUTE ACCENT] These are all the same abstract character but can be encoded multiple ways. One is a precomposed character, one is two characters: A base character and a combining character. This makes comparing these for equality very difficult. To solve this Unicode has a normalization algorithm that can transform a coded character sequence in such a way that it ensure all sequences of the same abstract characters are represented by the same coded character sequence. This works in a series of steps: The first step is decomposition: Each encoded character is recursively mapped to one or more encoded character sequences that are defined to be equivalent. For the most part this uses a mapping defined in the Unicode database, but special rules are required to decompose Hangul syllables. The simple example above of LATIN SMALL LETTER E WITH ACUTE is expanded to two characters: LATIN SMALL LETTER E and COMBINING ACUTE ACCENT. The second step is re-ordering: Multiple combining characters can be attached to a base character, and often the order is based on how the character is typed. This step re-orders the combining characters to be in a specific order. Doing this step requires an unbounded buffer which can become a security hazard depending on the application. The standard defines the "Stream-Safe Text Process" which limits this step to processing 30 combining characters but creates output that isn't normalized when dealing with uncharacteristically long inputs. The third step is composition: This step is optional and does the reverse of decomposition as a form of compression. It looks at the new sequence and recursively matches character sequences in it to decomposition mappings. This step excludes many opportunities to compose: Various scripts have specific exclusions, and single encoded characters will not compose to other single encoded characters. As an example of composition, LATIN SMALL LETTER E and COMBINING ACUTE ACCENT is composed back to LATIN SMALL LETTER E WITH ACUTE. As an example of an exclusion, OHM SIGN will decompose to GREEK CAPITAL LETTER OMEGA but not compose back to OHM SIGN. When describing these steps I glossed over what it means for encoded characters to be equivalent. Unicode defines two forms of equivalent: Canonical and compatibility equivalence. Both of these equivalences require that the encoded characters represent the same abstract character. Compatibility equivalence goes a step further and defines equivalence between encoded characters that have different appearances or behaviours. This usually includes formatting and other ways to write a character, but does not include other variants of the character such as different cases. These encoded characters are all compatibly equivalent to the digit two: * U+00032 "2": [https://util.unicode.org/UnicodeJsps/character.jsp?a=2&B1=Show DIGIT TWO] * U+000B2 "Β²": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%C2%B2&B1=Show SUPERSCRIPT TWO] * U+02082 "β": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%82%82&B1=Show SUBSCRIPT TWO] * U+02461 "β‘": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%91%A1&B1=Show CIRCLED DIGIT TWO] * U+1D7D0 "π": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9D%9F%90&B1=Show MATHEMATICAL BOLD DIGIT TWO] Compatibility equivalence combines with canonical equivalence during during the decomposition step in the normalization algorithm. This creates two types of decomposition: * Canonical decomposition which uses canonical equivalence * Compatibility decomposition which uses both canonical and compatibility equivalence With all that in mind Unicode defines the following normalization forms: * Normalization Form D (NFD) uses canonical decomposition and skips recomposition * Normalization Form C (NFC) uses canonical decomposition * Normalization Form KD (NFKD) uses compatibility decomposition and skips recomposition * Normalization Form KC (NFKD) uses compatibility decomposition Normalization is stable between Unicode versions after 4.1 (released in 2005): * Normalized text from an older version stays normalized in the new version * Normalized text in a new version stays normalized to the older version if it contains only characters assigned in the older version As a developer you will normally find normalization in code that checks for equality between abstract character sequences read from elsewhere, such as usernames in databases and filenames on filesystems. This procedure is generally unnecessary for comparing text generated or manipulated within an application unless those operations are not deterministic. I also want to note that compatibility decomposition is only useful in specific text processing tasks: It does not act as a filter for malicious text that intends to look visually identical to other text that uses different abstract characters. Various security tools exist to filter these 'confusables', but these should not be used indiscriminately as they are inherently lossy algorithms. One example where compatibility equivalence is useful is useful is screen readers: Text that is formatted may be read as their compatibility equivalent values during normal reads of text, with the actual values read out verbosely later if needed. For full details on the algorithm check out the standard: [https://unicode.org/reports/tr15/ UAX #15: Unicode Normalization Forms]
Summary:
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see
JookWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To edit this page, please answer the question that appears below (
more info
):
Who owns this wiki?
Cancel
Editing help
(opens in new window)