Editing Unicode guide (section)

== Non-Unicode data ==
Although many programming languages and development tools support Unicode, we still live in a world full of non-Unicode data. This includes data in other encodings and character sets, corrupted data, or even malicious data attempting to bypass security mechanisms. This data must be handled mindfully according to an application's requirements.

There are only a few ways to deal with non-Unicode data:

* Don't treat the data as Unicode
* Reject the data and request Unicode
* Do a best effort conversion to Unicode

Which action to take is heavily dependent on how important it is to preserve the original data, or how important it is to perform Unicode processing on the text. For example:

* A filesystem may treat paths as bytes and not perform Unicode processing
* A website may ask the user to submit a post that isn't valid Unicode
* A file manager may track filenames as bytes but display them as best effort Unicode
* A photo labeller may prepend Unicode dates to a non-Unicode filename
The decision on how to handle non-Unicode data is highly contextual and can range from simple error messages to complex mappings between non-Unicode and Unicode data.

TODO

structure an application to reduce conversions, only convert when necessary. prefer converting unicode to non-unicode. do not mix unicode and non-unicode data. avoid unecessary conversions

unicode -> non-unicode: easy

non-unicode -> unicode: complicated

unicode -> unicode: non-issue

non-unicode -> non-unicode: non-issue

round trips increase pain

separate pipelines in an application

complexity goes up by number of conversions in an application

- greedy conversion: convert early, fail hard, no best effort conversion. all data is unicode. easy to understand, fragile applications. prefer operating on unicode data.

- lazy conversion: convert only when required, allow best effort conversion, very robust. prefer operating on non-unicode data. requires tracking and mixing non-unicode data. the cost here is not from the data types but the data contents- you can guarantee a conversion won't fail if you control the data- such as appending a file path