Fixing problems and getting explanations

Ode to a Shipping Label

A poem about mojibake, whose original author might be Carlos Bueno on Facebook, shows a shipping label that serves as an excellent example for this section, addressed to the surname LóPEZ.

A package addressed to a name including "LóPEZ"

We can use ftfy not only to fix the text that was on the label, but to show us what happened to it (like the poem does):

>>> from ftfy import fix_and_explain, apply_plan
>>> shipping_label = "LóPEZ"
>>> fixed, explanation = fix_and_explain(shipping_label)
>>> fixed
'LóPEZ'

>>> explanation
[('apply', 'unescape_html'),
 ('apply', 'unescape_html'),
 ('apply', 'unescape_html'),
 ('encode', 'latin-1'),
 ('decode', 'utf-8')]

The capitalization is inconsistent because the encoding of a lowercase “ó” is in there, but everything was printed in capital letters.

The explanation may even be able to be applied to different text with the same problem:

>>> label2 = "CARRé"
>>> apply_plan(label2, explanation)
'CARRé'

Functions that fix text

The function that you’ll probably use most often is ftfy.fix_text(), which applies all the fixes it can to every line of text, and returns the fixed text.

ftfy.fix_and_explain() takes the same arguments as ftfy.fix_text(), but provides an explanation, like we saw in the first section.

Unlike ftfy.fix_text(), ftfy.fix_and_explain() doesn’t separate the text into lines that it fixes separately – because it’s looking for a unified explanation of what happened to the text, not a different one for each line.

A more targeted function is ftfy.fix_encoding_and_explain(), which only fixes problems that can be solved by encoding and decoding the text, not other problems such as HTML entities:

This function has a counterpart that returns just the fixed string, without the explanation. It still fixes the string as a whole, not line by line.

The return type of the ..._and_explain functions is a kind of NamedTuple called ExplainedText:

These explanations can be re-applied to text using apply_plan():

Showing the characters in a string

A different kind of explanation you might need is simply a breakdown of what Unicode characters a string contains. For this, ftfy provides a utility function, ftfy.explain_unicode().

A command-line utility that provides similar information, and even more detail, is lunasorcery’s utf8info.