Editing Unicode guide (section)

== Segmentation ==
Code points and character sequences aren't useful in most text processing. Higher level constructs that map to things humans perceive and reason about are required.

Unicode provides a text segmentation algorithm for breaking character sequences in to groups of sentences, words and user-perceived characters. This can be used to implement many common algorithms such as counting user-perceived characters in text, inserting and removing text, or parsing text in to separate components.

As an example, take the following text: "Hi! 👋🏼". It consists of 6 code points:

* U+00048 "H": [https://util.unicode.org/UnicodeJsps/character.jsp?a=H&B1=Show LATIN CAPITAL LETTER H]
* U+00049 "I": [https://util.unicode.org/UnicodeJsps/character.jsp?a=I&B1=Show LATIN CAPITAL LETTER I]
* U+00021 "!": [https://util.unicode.org/UnicodeJsps/character.jsp?a=!&B1=Show EXCLAMATION MARK]
* U+00020: [https://util.unicode.org/UnicodeJsps/character.jsp?a=%20&B1=Show SPACE]
* U+1F44B "👋": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%91%8B&B1=Show WAVING HAND SIGN]
*U+1F3FC " 🏼": [https://util.unicode.org/UnicodeJsps/character.jsp?a=%F0%9F%8F%BC&B1=Show EMOJI MODIFIER FITZPATRICK TYPE-3]

It breaks in to:

* 5 user-perceived characters: "H", "i", exclamation mark, space, and white waving hand
* 4 words: "Hi", exclamation mark, space, and white waving hand
* 2 sentences: "Hi! " (including a trailing space), and white waving hand
The default breaking algorithms do not do any kind of linguistic or locale analysis. Instead they are simple sets of rules designed to be get a useful results given arbitrary text.

Some use cases considered for these rules include:

* Searching and ordering text
* Selecting text at different granularities
* Moving cursors through text
* Inserting and removing text when editing
* Counting occurrences of text elements

These are desirable goals in most computer programs and tolerant of edge cases: A boundary that is slightly wrong to a human usually doesn't matter in these cases as long as it is consistently wrong. For stronger segmentation guarantees these rules can be tailored for a specific applications or discarded entirely in favour of tools like natural language processing.

One type of segmentation gets a lot more attention than the others: User-perceived characters. These are segmented as 'grapheme clusters' and come in two variants: Legacy and extended. Unless you need to deal with backwards compatibility, extended grapheme clusters are the ones to use. Words and sentences are by default made up of grapheme clusters.      

Grapheme clusters are the closest representation you can get to the idea of a single abstract character. Some newer programming languages even default to these as the default abstraction for their strings. This turns out to work fairly well and reduces the difficulty in writing Unicode compliant programs.      

The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.      

This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points.      

All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters.      

This tends to work well enough for most applications, but can create some confusing situations:      

* "Jose" can match with "José" if the accent is a separate code point
* The flag "🇩🇪" (regional indicators DE) matches against "<sub>🇧🇩🇪🇺" (indicators BD and EU)</sub>
* The unused regional indicator combinations AB and BC may render as a sole A indicator, "<sub>🇧🇧"</sub> (regional indicators BB) and a sole C indicator

For full details on the algorithm check out the standard: [https://unicode.org/reports/tr29/ UAX #29: Unicode Text Segmentation]

A related but separate line breaking algorithm can be found at: [https://www.unicode.org/reports/tr14/ UAX #14: Unicode Line Breaking Algorithm]

You can experiment with breaks online using the [https://util.unicode.org/UnicodeJsps/breaks.jsp Unicode Utilities: Breaks] tool.