Working with Arabic in Python

Part of working for a language services company means you get very familiar with the Unicode standard. Intimately so, to the point of awkward night-after calls. Case in point: Arabic shaping. The basic gist is that, because Arabic script is cursive, the appearance of any specific character depends on how it joins to its neighbors. For the eye-dilating details, refer to Section 8.2 of the Unicode Standard 5.1.0.

One aspect of our system uses ReportLab to create PDFs in any of the languages we support. Given Middle Eastern languages are all the rave right now (I’ll let you figure out why that might be), it didn’t take long for us to hit right-to-left (RTL) languages and, in particular, Arabic. Python’s support for RTL languages, outside of handling the Unicode, is essentially nonexistent. Since we know each passage’s language, a very crude approach is simply text.reverse().  That, however, doesn’t get you anywhere with Arabic shaping. It also makes for interesting words whenever one of these RTL passages includes an English proper noun.

In comes FriBidi and its Python offspring, PyFriBidi. Along with providing UAX #9-compliant RTL handling, the latest version of FriBidi also includes legacy Arabic shaping. “Legacy” is key here, though. If you’re going to be dealing with any Arabic script-based languages, it’s worth understanding why that is.

When displaying Arabic text, a rendering engine essentially goes through two steps. The first is to correctly order the letters based on the RTL properties of each Unicode character. This can be done without knowledge of the target font since it’s purely an interpretation of Unicode data. The next step is “shaping”, which, as previously mentioned, involves selecting the appropriate visual representation of a character based on its joining properties, neighboring characters, and ligatures. This is where the font matters since there is no separate Unicode character for each combination. The only information a Unicode character provides is what types of joining it supports (right, left, dual, and none). Beyond that, glyph selection is based on OpenType information provided by the font. Per the Unicode Standard, a font must provide a minimum number of glyph combinations if it supports an Arabic code point.

FriBidi, and PyFriBidi by extension, performs Arabic shaping with no knowledge of the target font. What it’s actually doing is replacing the original Arabic characters with code points from the legacy Arabic Presentation Forms A & B Unicode blocks, which contain a set of these glyph combinations to support older systems and applications that can’t select them during rendering. The Unicode data files provide the information necessary to map from a base Arabic character to one of these presentation forms based on the desired joins and ligatures.

This works pretty well for Arabic, the language. Problems arise with the numerous other languages that use the Arabic script but have joins that aren’t covered by these presentation form blocks. As a result, FriBidi leaves the original characters as is with these, and the final text doesn’t have the appropriate joins.

There are a few libraries with Python bindings that can handle this properly when it comes to rendering to the screen. But for integrating into a ReportLab workflow, I’m coming up empty-handed on the Python end. The most promising lead so far is IBM’s ICU Project, partly available via PyICU. We’re already using this to provide Unicode-compliant line wrapping (ever tried line-wrapping space-deficient Thai?), but its glyph selection-related portions aren’t yet available via the Python bindings.

I’d be interested in hearing how others have dealt with this. Given Python’s age, I can’t imagine we’re the first to try using it with these trickier scripts.

Leave a Comment