Editing Unicode guide

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 276: Line 276:
The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.
The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.


TODO
- early conversion


structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions
- late conversion


unicode -> non-unicode: easy
- conversion TO unicode has a cost


non-unicode -> unicode: complicated
- conversion FROM unicode is free
 
unicode -> unicode: non-issue
 
non-unicode -> non-unicode: non-issue
 
round trips increase pain
 
separate pipelines in an application
 
complexity goes up by number of conversions in an application
 
- greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data.
 
- lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path


== Mixed strings ==
== Mixed strings ==
Line 313: Line 299:
- these are often called 'OS strings' but i would call them  
- these are often called 'OS strings' but i would call them  


- conversion from unicode only works if the string lacks surrogates and has valid codepoints
 


https://peps.python.org/pep-0383/
https://peps.python.org/pep-0383/
Line 336: Line 322:


- segmented text
- segmented text
- unicode strings may be encoded data, code units, code points, non-surrogate code points, or mixed data


TODO: locale information/rich text
TODO: locale information/rich text
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see JookWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)