Jump to content
Toggle sidebar
JookWiki
Search
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Navigation
Main page
Recent changes
Random page
All pages
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information
Editing
Unicode guide
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
More
Read
Edit
Edit source
View history
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Non-Unicode data == Although many programming languages and development tools support Unicode, we still live in a world full of non-Unicode data. This includes data in other encodings and character sets, corrupted data, or even malicious data attempting to bypass security mechanisms. This data must be handled mindfully according to an application's requirements. There are only a few ways to deal with non-Unicode data: * Don't treat the data as Unicode * Reject the data and request Unicode * Do a best effort conversion to Unicode Which action to take is heavily dependent on how important it is to preserve the original data, or how important it is to perform Unicode processing on the text. For example: * A filesystem may treat paths as bytes and not perform Unicode processing * A website may ask the user to submit a post that isn't valid Unicode * A file manager may track filenames as bytes but display them as best effort Unicode * A photo labeller may prepend Unicode dates to a non-Unicode filename The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data. TODO structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions unicode -> non-unicode: easy non-unicode -> unicode: complicated unicode -> unicode: non-issue non-unicode -> non-unicode: non-issue round trips increase pain separate pipelines in an application complexity goes up by number of conversions in an application - greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data. - lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path
Summary:
Please note that all contributions to JookWiki are considered to be released under the Creative Commons Zero (Public Domain) (see
JookWiki:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To edit this page, please answer the question that appears below (
more info
):
Who owns this wiki?
Cancel
Editing help
(opens in new window)