Editing Unicode guide

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 243: Line 243:
The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.       
The main downside to this approach is that string operations are no longer guaranteed to be reproducible between program environments and versions. Unicode text may be split one way on one system and another way on another, or change behaviour on system upgrade. One real world example of this would be if you're given a giant character sequence of one base character and thousands of combining characters. One system may treat this as one grapheme cluster, another may split it up during normalization in to many grapheme clusters.       


This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide. A method to serialize sequences of grapheme clusters would help here, instead of having to recompute them based on code points.       
This lack of stability isn't necessarily a bad thing. After all, the world changes and so must our tools. But it needs to be kept in mind for applications that are expecting stability traditional strings provide.       


All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters.       
All that said, many applications don't segment text using these algorithms. The most common approach is to not segment text at all and match code point sequences, or to search and map code point sequences to characters.       
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see JookWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)