An interactive introduction to the terrific experience of rendering Arabic typography and its technical debt A frontend developer investigates a CSS rendering bug in Arabic text and discovers the underlying technical debt in Arabic typography on the web, tracing the problem back to historical manuscript traditions and modern font engineering challenges. An interactive introduction to the terrific experience of rendering Arabic typography and its technical debt This post was discussed in Lobsters https://lobste.rs/s/ptkd7x/interactive introduction terrific Once upon a time, a frontend ticket landed on my queue which was not properly mine, but the only other Arabic reader on the team was on leave. It went roughly as follows; a block of mixed-content Arabic prose on the customer-facing dashboard was rendering with a ragged left edge the rag falls on the left in Arabic, since the lines set out from the right margin; the ticket said "ragged right" when the design team had explicitly specified justified text. Attached were three screenshots from three browsers and a polite note from the product manager observing that the Latin-script version of the same block looked, I quote, "fine." The same six months I had closed three other tickets against the same product, each of which had presented to its filer as the only bug. A customer's name had appeared with its letters unjoined on a printed agreement, the way a sign-painter would have laid them out in 1962, because the PDF library on the receipt server pre-dated the existence of a shaping engine in its language runtime. A search index had been returning empty for accounts the customer service team could see in the database because a 2017 import had encoded twelve thousand names using fossil Unicode codepoints from 1991 instead of regular ones from 1995, and the index, very reasonably, treated the two encodings as different strings, So, that ragged-left ticket was the smallest of the four, HOWEVER, it sat on top of the same iceberg and pointed at the same thing. Here is the disagreement, reproduced live. I used random text, the original had more spacing, I'm too lazy to pick words to maximize the ragging and spacing. inside the words, never the spaces between them. It renders in your browser only because I placed every elongation by hand, a confession I will expand on below. On the left, what production ships. Tick the box to apply the one tool CSS offers, text-align: justify For these demonstrations this site ships its first webfont ever: Amiri, self-hosted, a hundred and fifty kilobytes of one man's unpaid evenings, redistributed under the OFL. That this is what it takes to show you something your operating system cannot do on its own is, I want to be clear, part of the argument. I think it is a delightful hundred and fifty kilobytes. It did look fine. I spent about half an hour with it, I walked the rendered DOM, I set text-align: justify in so many different combinations of font-family and direction declarations, and at the end of the exercise I wrote a reply explaining, more or less honestly, that the problem was not a bug in our stylesheet but the state of Arabic typography on the web. The reply took and the closure of the ticket took half an hour or so. The reasons behind it took five hundred years to pile up, and they involve a twice-mutilated vizier, a Qurʾān that vanished for four centuries, a Beirut newspaperman with a deadline, and an Egyptian physician who taught himself font engineering for fun or that what I imagine about him . Walking through these, ended up to be the most enjoyable couple of weeks in that job, and I want to go through it here too. What the scribes solved The history deserves recording because most people outside the small world of Arabic font engineering don't know it, and it is wonderful. Classical Arabic typography, by which I mean the manuscript tradition that the early printers of Istanbul and Bulaq spent their careers chasing, justifies a line of text without stretching the spaces between words at all. Stretched spaces are the Latin convention, and in Arabic they produce an effect the scribes would have found simply ugly. Instead the scribe extends the letterforms themselves along the baseline, using what is called taṭwīl or, in the modern technical vocabulary, kashida : the connecting strokes between certain pairs of letters can be lengthened, sometimes lavishly, to carry a line out to the margin. A well-set page of Naskh from the seventeenth century has every line flush at both margins, and the result is the dense, regular weave that anyone who has spent time with a good manuscript Qurʾān will recognise on sight. And this was not improvisation but a system, with a paper trail. The system was written down by Ibn Muqla, Abbasid vizier and chief calligrapher, who served three caliphs in succession and was imprisoned by two of them; the third had his right hand amputated on a charge of treasonous correspondence, and Ibn Muqla then kept writing for the next several months by lashing a reed pen to the stump of his wrist, and was rewarded for what he wrote by having his tongue cut out, and died in prison around the year 940. His body was buried three times in three different places, his daughter moving it after each interment to keep the grave out of police hands. The system he wrote down outlasted everybody who hurt him by a thousand years. It is called al-khaṭṭ al-mansūb , the proportional script; every letterform measured in rhombic dots of the reed nib, every curve a defined arc of a defined circle, the alif a fixed number of dots high and anything else derived from the alif. Within that system the elongation is a drawn stroke with its own rules, which letter pairs accept it, how the curve swells and tapers, how many elongations a line may carry, where they may sit. The scribes also justified by choosing different shapes , because most letters have alternate forms of different widths, and a skilled hand selects among them as the margin approaches. Justification, in this tradition, is not a spacing problem rather a shaping problem. The tradition Ibn Muqla started did not stay with him; it was refined, in writing, by named human beings over the following six hundred years. Ibn al-Bawwāb in Baghdad, around the year 1022, smoothed out the proportions and produced the manuscript that defined Naskh for the rest of the millennium; a single Qurʾān in his hand survives in the Chester Beatty Library in Dublin, and you can date the Persian, Ottoman, and Mamluk traditions by how closely they follow it. Yāqūt al-Mustaʿṣimī, who survived the Mongol sack of Baghdad in 1258 by climbing a minaret and continuing to write, codified what later scholars called the Six Pens, the canonical hands of Naskh , Thuluth , Muḥaqqaq , Rayḥān , Tawqīʿ , Riqāʿ , each with its own metrics, each with its own justification grammar. Then the Persian scribes invented Nastaʿlīq in the fourteenth century, a hanging script that justifies by sloping the baseline downward at the end of each phrase, which is to ordinary justification roughly what a vertical garden is to a lawn. The Ottomans developed Dīwānī for the chancery and a tightly knotted Dīwānī Jalī for the sultanic seal, both of which fill space by interleaving letters at heights ordinary baselines never visit. All of these are the same alphabet of twenty-eight letters; all of them have their own rules about which letters accept the kashida, which never do, and how the line breathes. Latin typesetting never needed any of this, because Latin letters do not hold hands. Arabic letters do, and the web, in 2026, looks at them holding hands and stretches the air between the words anyway. So now you know what the mockup card at the top of the page was doing: it was faking a page of this manuscript tradition in HTML, every line carried to the measure by the strokes and not the spaces. The fakery, since I promised a confession, is U+0640 TATWEEL characters that I placed and sized by hand. Four shapes for every letter To understand why every machine since Gutenberg has wrestled this script and mostly lost, you need one structural fact: Arabic is cursive always . There is no print-versus-handwriting distinction, no block letters. The letters connect in stone inscriptions, in manuscripts, in metal, on screens. Each letter therefore changes shape depending on its neighbours an isolated form, an initial, a medial, a final , and six letters refuse to connect forward at all, which breaks words into joined clusters and gives the script its rhythm. The shapes are not costumes over some underlying "real" letter. The positional variation is the letter. And the alphabet is bigger than Arabic the language. Persian extends it with four letters Arabic does not have پ pe, چ che, ژ zhe, گ gaf and uses two of the existing letters in subtly different forms ی for the final yāʾ, ک for kaf . Urdu adds an aspirated do-chashmī he ھ , a retroflex set ٹ ڈ ڑ , and a hanging ye barree ے , and writes most of its everyday text in Nastaʿlīq , which a Naskh-shaped font will produce as a phonetically correct but visually unrecognisable approximation. Sindhi has more again. Pashto, Kurdish, Uyghur, Kashmiri, and Punjabi each take the alphabet, add what their phonology requires, and ship. Any font that calls itself "Arabic" without consulting the Persian and Urdu communities will produce, for hundreds of millions of readers in Iran and South Asia, text that is technically rendered but functionally wrong: the kaf has the wrong terminal, the heh fuses where it shouldn't, the digits are from the wrong belt. The Noto Sans Arabic family ships separate sub-fonts to cover these NotoNaskhArabic, NotoNastaliqUrdu, NotoSansArabicUI , and OS font fallback chains usually get it right. Usually. | stored codepoint | isolated | initial | medial | final | |---|---|---|---|---| | U+0639 ʿAYN | ع | عـ | ـعـ | ـع | | U+0647 HEH | ه | هـ | ـهـ | ـه | The arrangement we eventually settled on, after decades of wrong answers, is this: the encoding stores the abstract letter, and the font supplies the shapes. Unicode gives you one codepoint for ʿayn; the font carries the four positional glyphs; a shaping engine applies the OpenType features isol , init , medi , fina , plus rlig for the ligatures the script requires, plus mark and mkmk for stacking the vowel signs at render time. An Arabic font is a small program. The text you store is its input, not its output. The word is performed fresh every time you look at it, like music from a score. The cleanest way to feel this is to assemble a word one letter at a time and watch every prior letter renegotiate its shape as the next one arrives: The wrong answers are still in the standard, fossilised, and they make excellent souvenirs. Before shaping engines existed, the 8-bit code pages of the DOS and early Windows era encoded the shapes themselves : a separate character for initial ʿayn, medial ʿayn, and so on. Unicode, which promised round-trip compatibility with anything else, had to swallow those sets whole, and they live on at U+FB50 through U+FEFF under the name Arabic Presentation Forms: several hundred codepoints that no new document should ever contain and that PDF text extractors merrily emit to this day, which is one of the reasons searching an Arabic PDF so often fails in silence. The haystack is encoded as shapes and your needle is encoded as letters. My favourite resident of the block, and one of my favourite characters in all of Unicode, is U+FDFD, ﷽ : four-word invocation, bismillāh ar-raḥmān ar-raḥīm , as a single codepoint. A monument from the era when rendering was baked into the encoding because nobody trusted the renderer to do anything, preserved forever, like a fly in amber that recites. This bites because the two encodings render identically and compare differently. The customer search bug I mentioned at the top of this article was, specifically, this: | NAME as rendered | ENCODING IN STORAGE | ACCOUNT | |---|---|---| | محمد علي | modern Unicode | EGP-9341-0021 | | ﻣﺤﻤﺪ ﻋﻠﻲ | presentation forms | EGP-2014-7732 | | سارة أحمد | modern Unicode | EGP-9341-0044 | | ﺳﺎﺭﺓ ﺃﺣﻤﺪ | presentation forms | EGP-2014-8810 | And if you want to know what the world looks like when software skips all of this, the shaping engine, the bidi algorithm, the whole apparatus, you do not have to imagine it, because an enormous amount of software still skips all of it: arabic reshaper plus python-bidi , fixes it by pre-baking the shaped forms into the string using that fossil block from the paragraph above. Three sets of digits, one continuous belt The numerals deserve their own room. Every Arabic-rendering project I have worked on has tripped on them, and most of those projects invented a private vocabulary for what went wrong instead of asking why. Most readers of this article have only ever met one set of digits and are about to meet three. The glyphs the world calls "Arabic numerals", 0 through 9, are not in fact what most Arabic readers use day to day. Egypt, Sudan, the Levant, Iraq, and the Gulf use what Unicode files under ARABIC-INDIC DIGITS ٠١٢٣٤٥٦٧٨٩, U+0660–U+0669 , which look nothing like the Latin glyphs and ship in any serious Arabic font as a separate set. The Maghreb Morocco, Algeria, Tunisia, often Libya uses the Latin glyphs and has done so since the colonial period; an Arabic newspaper in Casablanca and an Arabic newspaper in Cairo will print today's date in two visually different scripts and consider it unremarkable. Iran, Afghanistan, and Pakistan use a third set, the EXTENDED ARABIC-INDIC DIGITS ۰۱۲۳۴۵۶۷۸۹, U+06F0–U+06F9 , four of whose glyphs 4, 5, 6, 7 differ visibly from the Arabic-Indic set despite encoding the same numbers. Any banking platform that operates from Rabat to Karachi will, at some point, render the same balance three ways: The rendering choice is the easy half. The bidirectional behaviour is where the platform starts to creak, because digits are not strong characters in the algorithm. They are weak , neither strongly left-to-right like a Latin letter nor strongly right-to-left like an Arabic one, and what they do depends on whoever stood next to them most recently. The relevant rule, W2 of UAX 9, reclassifies a digit as an ARABIC NUMBER if any of the previous strong characters in the paragraph were Arabic letters, and as a EUROPEAN NUMBER otherwise. Both render their internal digits left-to-right, which is correct: numbers everywhere on Earth are read most-significant-first. But the punctuation between digits behaves differently across the two classes. A hyphen between European numbers stays glued. A hyphen between Arabic numbers floats neutral and gets reclassified again by the rules for neutrals, which look at the strong context, which is right-to-left, and the two number runs swap places around the hyphen. That is how a phone number stored as "010-1234-5678" arrives on screen as "5678-1234-010", per spec, in every browser, identically wrong. ‎ or