When the text is not really text: legacy fonts in non-English documents

On this page

When the text is not really text: legacy fonts in non-English documents

A student who is blind opens her Hindi textbook with a screen reader. The page is full of Devanagari, neatly typeset, exactly as her classmates see it. Her screen reader reads out a stream of unrelated English letters, or says nothing at all. The teacher who prepared that file looks at the same page and sees perfect Hindi. Both of them are right about what they see. The distance between those two truths is the subject of this entry, and it is the reason a large share of the Global South's documents never become accessible, no matter how much care goes into the rest of the work.

What a legacy font actually is

A PDF does not store letters. It stores instructions to paint shapes, called glyphs, and a separate table that says which character each glyph stands for. For English this table is almost always correct, so the text you see and the text a computer can read are the same.

A legacy font breaks that link on purpose. Before Unicode was widely supported, and to let people type Indian scripts on ordinary English keyboards, font makers redrew the letters. In a legacy Devanagari font, the key that types the Latin letter d paints the shape क; the key for M paints म; punctuation keys paint vowel signs. The bytes saved in the file are Roman letters. Only the picture on the screen is Hindi.¹²

So when software tries to read the text back, a screen reader, a search box, a copy-paste, a translation tool, it reads the bytes, which are Roman. The Hindi word भूमिका ("introduction") is stored as Hkwfedk and read aloud as those letters. The page looks like a language; the text underneath is not that language at all.²³

Why no one notices

The picture is always correct, because the viewer paints glyphs by their position and never consults the character table. So the document looks finished to everyone who can see it: the author, the proofreader, the official who signs it off. Nothing on screen signals that anything is wrong. The failure only appears the moment someone tries to read the text rather than look at it, and the people who do that, screen reader users and search engines, are exactly the people who were not in the room when the file was made.²⁴

This is what makes it so quietly damaging. It is not a visible defect that gets caught and fixed. It is an invisible one that ships, again and again, in textbooks and exam papers and government circulars, and is discovered only by the reader it shuts out.

Why the accessibility standards do not catch it

It would be reasonable to assume the accessibility checkers would flag this. They do not. The PDF/UA standard requires only that a font carry a character map (a ToUnicode table); it does not require that the map point to the right characters. The reference validator, veraPDF, implements that rule literally as "a character map is present," and the only values it rejects are a few technically illegal ones.⁵ A legacy font carries a perfectly well-formed map that happens to say its glyphs are Roman letters. It passes. So a document can earn a clean accessibility report and still be unreadable in its own language.⁵⁶

This is the heart of why non-English content is missing from the accessibility conversation. The tools were built where the foundation was already sound, so they check the floors above it and never the floor itself.

How widespread this is

This is not a rare edge case. Legacy fonts such as Kruti Dev, DevLys, Shree, Chanakya and their equivalents were the standard way to publish Indian-language material through the 1990s and 2000s, and they remain in heavy use in government and educational publishing today.¹² The example that prompted this work was a current National Council of Educational Research and Training (NCERT) Hindi textbook: 124,418 characters of text, not one of them in the Devanagari range. A 10th-standard Kannada mathematics textbook was the same. Two Tamil Nadu state textbooks, by contrast, were clean Unicode. The problem is large, it is uneven, and it sits squarely on the documents students are handed.

How to tell whether a document has this problem

You do not need special software to check a single file. Open the PDF, select a line of the regional-language text, copy it, and paste it into a plain text box.

If the pasted text is the same script you see on the page, the encoding is sound.
If the pasted text is a string of unrelated Roman letters and symbols (something like Hkwfedk ge lHkh), the document uses a legacy font and its text is not real.
If nothing pastes at all, or you cannot select text, the page is probably a scan, which is a different problem with a different fix (it needs optical character recognition, not re-encoding).

At scale, the same judgment can be made automatically: extract the text and measure how much of it falls in the script the document is supposed to be in. Real Hindi is mostly Devanagari characters; legacy Hindi is almost entirely Latin. The gap between the two is wide and unambiguous.⁷

The legacy font families, by script

If you maintain or audit documents, these are the font names that signal legacy, non-Unicode encoding. The list is not exhaustive, and several vendor families (Akruti, Shree-Lipi, ISM, GIST) span many scripts.¹⁸

Devanagari (Hindi, Marathi, Sanskrit): Kruti Dev, DevLys, Shree-Dev, Chanakya, Walkman-Chanakya, Shusha, Shivaji, APS, Yogesh.
Tamil: TSCII fonts, Bamini, TAB, TAM, ELCOT, Shree-Tam, Amudham, Vanavil.
Telugu: Anu, Hemalatha, Eenadu, Shree-Tel.
Kannada: Nudi (legacy), Baraha (ASCII mode), Shree-Kan, BRH.
Malayalam: ML-TT family (ML-TTKarthika, ML-TTRevathi), Manorama (MM), Shree-Mal.
Bengali and Assamese: Bijoy with SutonnyMJ, Shree-Ban, Boishakhi.
Gujarati: LMG, Shree-Guj, Saral, Terafont.
Gurmukhi (Punjabi): AnmolLipi, Asees, Joy, Satluj, GurbaniLipi.
Odia: Akruti Sarala, Sarala, Shree-Lipi, Kalinga.
Urdu (and Sindhi, Pashto): InPage encoding with Noori Nastaliq and Jameel Noori Nastaleeq.

How to fix it

A legacy-font document cannot be repaired by tagging it, adding alternative text, or running an accessibility checker over it. None of that touches the underlying bytes. The text itself has to be re-made in Unicode. There are three honest paths, in order of preference.

Convert the legacy text to Unicode. If the document has real, editable text (not a scan), the right fix is to run it through a legacy-to-Unicode converter for its exact font, then set it in a Unicode font. Mature open-source converters exist for most scripts: GUCA for Gurmukhi, ascii2unicode for Kannada, the OdiaWikimedia converter for Odia, Android-TamilUtil for Tamil, and many tools for Devanagari.⁸⁹ The caveats matter and must be checked: the converter has to know the exact font variant; it has to handle conjuncts and the reordering of vowel signs correctly; embedded English can be distorted by careless tools; and the output must always be proofread, because conversion is not guaranteed to be perfect.⁹
Re-typeset in Unicode. If the original source file is available, re-keying or re-flowing the content in a Unicode font from the start is the only path that guarantees correct encoding, reading order, and language tagging together. It costs the most effort and is worth it for documents that recur, such as templates and series textbooks.
Optical character recognition, as a fallback. If the document is a scan, or the legacy encoding cannot be reliably converted, OCR rebuilds a genuine Unicode text layer from the picture. Tesseract with the relevant Indic models handles many scripts; the output is character-accurate at best and must be proofread, and it should be produced as an image-over-text PDF so the original page is preserved.¹⁰

Whichever path is used, set the document's language correctly (the /Lang value, for the document and for any passage in a different language) so a screen reader applies the right pronunciation, and choose a Unicode-compliant font for the script: the Noto Sans family is the safe cross-platform default (Noto Sans Devanagari, Tamil, Bengali, Kannada, Telugu, Malayalam, Gujarati, Gurmukhi, Oriya), Lohit is the open-source choice, and Mangal and Nirmala UI are the common Windows fonts.¹¹¹²

"Language ready" means four things, not one

Choosing a Unicode font does not by itself fix a legacy document; a Unicode font wrapped around legacy Roman bytes is still gibberish underneath. A document in a non-English language is only truly readable to assistive technology when all four of these hold together:

The text is correct Unicode (no legacy mapping; the character table points to the real characters).
The reading order is logical, so the text is stored in the order it is spoken, not the order it happens to be drawn.
The language is declared, for the document and for each passage that switches language.
The font is Unicode-compliant for the script.

Miss any one and the document fails for its reader, even if the other three are perfect.

Why this is the foundation

Everything else in document accessibility, the tags, the headings, the alternative text, the reading order a screen reader follows, assumes that the text underneath is real, machine-readable text. For English that assumption holds by default, so the field treats it as settled and never mentions it. For most of the Global South's documents it does not hold. The text is not real, so there is nothing for the rest of accessibility to stand on.

That is why this is not a small technical matter about fonts. It is the base of all content in languages other than English. Until it is fixed, no amount of careful remediation above it can reach the reader. Fixing it is the first thing, and for a very large number of students it is the difference between a textbook that exists for them and one that does not.

Endnotes

Kruti Dev. Wikipedia, 2025. Legacy non-Unicode Devanagari font mapping glyphs onto ASCII positions; copying or sharing text without the font produces unreadable output; search and indexing are limited. https://en.wikipedia.org/wiki/Kruti_Dev ↩↩↩
"Why Extracting Hindi Text from PDFs Is So Much Harder Than English." The Digital Orientalist, 2025. Legacy fonts (Kruti Dev, Shivaji, Chanakya) store text as Roman bytes; copying yields gibberish; encodings are ad hoc and per-publisher. https://digitalorientalist.com/2025/12/02/why-extracting-hindi-text-from-pdfs-is-so-much-harder-than-english-and-how-you-can-do-it/ ↩↩↩↩
PubCom, "Fonts, Unicode, OpenType, and Accessibility," 2013. A glyph a human reads as one character is read by software as a different character when the encoding does not match. https://www.pubcom.com/blog/2013_12-03/unicode-accessibility.html ↩
PDF Association, "Glossary of accessibility terminology in PDF," 2023. Correct Unicode mappings are what allow assistive technology to interpret content; without them the content is not machine-readable. https://pdfa.org/glossary-of-accessibility-terminology-in-pdf/ ↩
veraPDF, PDF/UA Part 1 validation rules, 2024. The ToUnicode requirement (clause 7.21.7-1) is implemented as a presence check; only the values U+0000, U+FEFF and U+FFFE are rejected. https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFUA-Part-1-rules ↩↩
PDFix, "PDF accessibility validators compared," 2025. Font and encoding checks are implemented inconsistently across veraPDF, PAC, Adobe and CommonLook, with over half of test files producing conflicting results. https://pdfix.net/pdf-accessibility-validators/ ↩
Saini and Lehal, "Automatic Bilingual Legacy-Fonts Identification and Conversion System," Research in Computing Science vol. 86, 2014. Legacy Indic encodings are detected statistically from the distribution of characters in extracted text, reaching high accuracy across many encodings. https://www.rcs.cic.ipn.mx/2014_86/Automatic%20Bilingual%20Legacy-Fonts%20Identification%20and%20Conversion%20System.pdf ↩
Centre for Internet and Society (India), "Converting from non-Unicode (Nudi, Baraha) font encoding to Unicode Kannada," 2014. Legacy text stored as glyph codes makes search, sort and text-to-speech impossible; conversion requires handling dependent-vowel reordering. https://cis-india.org/openness/blog-old/converting-from-non-unicode-nudi-baraha-font-encoding-to-unicode-kannada ↩↩
Hindi font converter guidance on conversion accuracy, 2024 to 2025. Reliable conversion requires more than 180 character mappings (conjuncts, vowel signs, anusvara, visarga, halant, numerals), confirmation of the exact font variant, and preservation of embedded English; output requires proofreading. https://hindifontconverter.com/ ↩↩
University of Illinois Library, "OCR Best Practices," 2024. Quoted accuracy figures are character-level, not word-level; small or low-contrast fonts reduce accuracy; image-over-text PDFs preserve the page image with a searchable text layer. https://guides.library.illinois.edu/OCR/bestpractices ↩
Pramukh IME, "List of Unicode fonts for Indian languages," 2024. Per-script Unicode font recommendations (Noto, Lohit, Microsoft defaults). https://www.pramukhime.com/blog/indic-unicode-fonts ↩
Nirmala UI. Wikipedia, 2024. A single Windows font family covering most Indic scripts, designed for user-interface readability. https://en.wikipedia.org/wiki/Nirmala_UI ↩