Jump to content
Toggle sidebar
JookWiki
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Navigation
Main page
Recent changes
Random page
All pages
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information
Editing
Unicode guide
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
More
Read
Edit
Edit source
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Segmentation == Code points and character sequences aren't useful in most text processing. Higher level constructs that map to things humans perceive and reason about are required. Unicode provides a text segmentation algorithm for breaking character sequences in to groups of sentences, words and user-perceived characters. This can be used to implement many common algorithms such as counting user-perceived characters in text, inserting and removing text, or parsing text in to separate components. As an example, take the following text: "Hi! ππΌ". It consists of 6 code points: * U+00048 "H": [https://util.unicode.org/UnicodeJsps/character.jsp?a=H&B1=Show LATIN CAPITAL LETTER H] * U+00049 "I": [https://util.unicode.org/UnicodeJsps/character.jsp?a=I&B1=Show LATIN CAPITAL LETTER I] * U+00021 "!": [https://util.unicode.org/UnicodeJsps/character.jsp?a=!&B1=Show EXCLAMATION MARK] * U+00020: [https://util.unicode.org/UnicodeJsps/character.jsp?a=%20&B1=Show SPACE] * U+1F44B "π": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%8B&B1=Show WAVING HAND SIGN] *U+1F3FC " πΌ": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%8F%BC&B1=Show EMOJI MODIFIER FITZPATRICK TYPE-3] It breaks in to: * 5 user-perceived characters: "H", "i", exclamation mark, space, and white waving hand * 4 words: "Hi", exclamation mark, space, and white waving hand * 2 sentences: "Hi! " (including a trailing space), and white waving hand The default breaking algorithms do not do any kind of linguistic or locale analysis. Instead they are simple sets of rules designed to be get a useful results given arbitrary text. Some use cases considered for these rules include: * Searching and ordering text * Selecting text at different granularities * Moving cursors through text * Inserting and removing text when editing * Counting occurrences of text elements These are desirable goals in most computer programs and tolerant of edge cases: A boundary that is slightly wrong to a human usually doesn't matter in these cases as long as it is consistently wrong. For stronger segmentation guarantees these rules can be tailored for a specific applications or discarded entirely in favour of tools like natural language processing. One type of segmentation gets a lot more attention than the others: User-perceived characters. These are segmented as 'grapheme clusters' and come in two variants: Legacy and extended. Unless you need to deal with backwards compatibility, extended grapheme clusters are the ones to use. Words and sentences are by default made up of grapheme clusters. Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs. The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters. This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points. All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters. This tends to work well enough for most applications, but can create some confusing situations: * "Jose" can match with "JosΓ©" if the accent is a separate code point * The flag "π©πͺ" (regional indicators DE) matches against "<sub>π§π©πͺπΊ" (indicators BD and EU)</sub> * The unused regional indicator combinations AB and BC may render as a sole A indicator, "<sub>π§π§"</sub> (regional indicators BB) and a sole C indicator For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation] A related but separate line breaking algorithm can be found at: [https://www.unicode.org/reports/tr14/ UAX #14: Unicode Line Breaking Algorithm] You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool.
Summary:
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see
JookWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To edit this page, please answer the question that appears below (
more info
):
Who owns this wiki?
Cancel
Editing help
(opens in new window)