The gap: why documents in most of the world's languages are left out of accessibility

On this page

The gap: why documents in most of the world's languages are left out of accessibility

Most of what has been written about making documents accessible, the standards, the checking tools, the training, the certifications, was built in English, for English and the languages written in the Latin alphabet. That work is good, and much of it carries over. But it rests on an assumption so basic that it is almost never stated: that the text in a document is real, machine-readable text. For English that assumption holds by default. For a very large share of the documents produced in the Global South, it does not, and because the assumption is never stated, its failure is never checked.

What "the gap" actually is

There are two halves to it.

The first is a knowledge gap. Search for guidance on accessible PDFs and you will find detailed material on tagging, headings, alternative text, reading order, and colour contrast, almost all of it assuming Latin-script English. Search for guidance on what to do when a Hindi, Tamil, or Bengali document's text comes out as gibberish, and you will find scattered font-conversion forums, a few academic papers, and almost nothing that joins the problem to accessibility and addresses a practitioner or a policymaker.¹² The expertise exists in fragments; it has not been gathered.

The second is a foundation gap in the documents themselves. A large part of regional-language publishing was produced, and continues to be produced, with legacy fonts that store text as Roman bytes and only paint it as the local script, or as scans with no real text layer at all.³² Either way the text is not machine-readable. So the work of accessibility, all the careful tagging and structuring, has nothing underneath it to stand on. You can do everything right above the text layer and the document still reads as nothing, or as nonsense, to a screen reader.

Why it stays invisible

A document with this problem looks completely finished. The page renders perfectly to anyone who can see it, because the viewer paints the shapes and never consults the underlying text.³ The accessibility checkers do not catch it either: the PDF/UA standard and its reference validator require only that a font carry a character map, not that the map be correct, so a document whose text decodes to gibberish can still earn a clean report.⁴⁵ The result is that the failure is discovered only by the person it excludes, the screen reader user, who was not present when the file was made or signed off. Nothing in the ordinary workflow surfaces it.

So the exclusion is not the result of anyone deciding these readers do not matter. It is the result of a blind spot that is structural: the tools and habits were built where the foundation was already sound, and they look at every floor except the one they are standing on.

Who this leaves out, and at what cost

The people on the other side of this gap are students who are blind or have low vision and use screen readers; people with print disabilities who rely on text-to-speech; and anyone who needs to search, copy, translate, or reflow a document. In the Global South they are reading, or trying to read, in languages spoken by hundreds of millions: Hindi, Bengali, Tamil, Telugu, Marathi, Urdu, Kannada, Gujarati, Malayalam, Punjabi, Odia, and many more. The documents in question are not marginal: they are school and university textbooks, examination papers, government circulars, legal and health information. A textbook that is inaccessible in this way is not a degraded experience for these readers; it is simply absent.

Hard, current numbers on how much regional-language content carries this exact defect are thin, and we will not invent them. What is well established is that legacy non-Unicode fonts were the dominant publishing method for Indian languages for two decades and remain in wide use, and that documents produced this way are not recoverable by assistive technology without re-encoding.²³ The honest summary is: the problem is large, it is concentrated in exactly the documents people most need, and it has gone largely unmeasured because no one was looking for it.

Why we treat this as the foundation, not a footnote

It would be easy to file this under "fonts," a narrow technical curiosity. That framing is precisely the mistake. Accessibility is a stack: tagging, headings, alternative text, reading order, and what a screen reader speaks all sit on top of the assumption that the text underneath is real. Pull that assumption away and nothing above it can function. For most of the Global South's documents the assumption is absent, so the readiness of the text, in whatever language it is written, is the base of everything else.

That is what this knowledge base is about. It gathers, in one place and in plain language, what goes wrong with documents in languages other than English, how to tell, how to fix it, and what governments and institutions can require so that it stops happening. It is written for the people who remediate these documents and for the people who set the policies that produce them. The goal is simple and large at the same time: that a document in any language can be read by the readers who depend on hearing it.

Endnotes

Saini and Lehal, "A Survey of Language-Detection, Font-Detection and Font-Conversion Systems for Indian Languages," 2015. The detection and conversion work exists chiefly as academic and tooling fragments rather than consolidated practitioner guidance. https://www.researchgate.net/publication/281381833 ↩
"Why Extracting Hindi Text from PDFs Is So Much Harder Than English." The Digital Orientalist, 2025. Legacy fonts were the standard for Indian-language publishing and remain in heavy use; their text is not recoverable without re-encoding. https://digitalorientalist.com/2025/12/02/why-extracting-hindi-text-from-pdfs-is-so-much-harder-than-english-and-how-you-can-do-it/ ↩↩↩
Kruti Dev. Wikipedia, 2025. Legacy non-Unicode fonts paint a script's glyphs from Roman byte positions; the rendered page is correct while the stored text is not. https://en.wikipedia.org/wiki/Kruti_Dev ↩↩↩
veraPDF, PDF/UA Part 1 validation rules, 2024. The ToUnicode requirement is implemented as a presence check, not a correctness check. https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFUA-Part-1-rules ↩
PDF Association, "Glossary of accessibility terminology in PDF," 2023. Correct Unicode mappings are what allow assistive technology to interpret content. https://pdfa.org/glossary-of-accessibility-terminology-in-pdf/ ↩