The four conditions: what makes a non-English document readable
It is tempting to think of accessibility for a regional-language document as one thing to get right. It is four, and they are independent. A document is genuinely readable by a screen reader only when all four hold together. Getting three of them perfect and missing the fourth still leaves the reader with nothing, or with the wrong thing. This entry sets out the four plainly, because they are the checklist behind everything else in this knowledge base.
1. The text is real Unicode
The first condition is that the characters stored in the file are the actual characters of the language, in the universal encoding called Unicode, and not a legacy substitute.
A legacy font stores the script as Roman bytes and only paints the local shapes, so the text underneath is gibberish to any software that reads it.1 If this condition fails, nothing else matters: there is no real text for tagging, language tagging, or a screen reader to work with. This is the foundation, covered in detail in the entry on legacy fonts.
How it fails: legacy non-Unicode fonts (Kruti Dev, DevLys, TSCII, Bijoy, InPage, and their kind); a missing or incorrect character map (ToUnicode) in an otherwise modern font; or a scanned page with no text layer at all.
How to check: copy a line of the text and paste it into a plain text box. If it comes back in the right script, this condition is met. If it comes back as Roman gibberish, the font is legacy. If nothing pastes, the page is a scan.1
2. The reading order is logical
The second condition is that the text is stored in the order it is meant to be read and spoken, not the order it happens to be drawn on the page.
This matters more for complex scripts than for English. In Devanagari, Tamil, Bengali, and others, the visual position of a glyph is often not its logical position: the short-i vowel sign is drawn before the consonant it actually follows, and consonant clusters are reshaped and reordered for display.23 If a document stores text in the order the glyphs were painted, every codepoint can be correct Unicode and the reading order still be wrong, so a screen reader speaks the syllables out of sequence and search cannot find ordinary words.4
How it fails: a generator that builds the text layer from the visual glyph order; reordered vowel signs and conjuncts left in display order; right-to-left text (Arabic, Urdu) stored reversed.
How to check: this is harder to confirm by eye. A practical signal is search: if you have to type a word in an unnatural order to find it, the stored order is visual, not logical. At scale it is detected by comparing the stored character order against the script's syllable rules.4
3. The language is declared
The third condition is that the document says what language it is in, both overall and for any passage that switches language.
A screen reader chooses its pronunciation rules and its voice from the document's declared language (the /Lang value). If no language is declared, the screen reader falls back to its default, usually English, and reads the text with English pronunciation, which turns correct Hindi or Tamil into noise.56 If the wrong language is declared, it reads with the wrong rules, equally unintelligible. And a document that mixes an English administrative wrapper with a regional-language body, very common in the Global South, needs each part tagged with its own language, or the screen reader mispronounces one of them throughout.6
How it fails: no document-level /Lang; no per-passage language on mixed-language content; a /Lang value that is wrong or not a valid language tag.
How to check: inspect the document's language setting and confirm it matches the actual language; for mixed documents, confirm each passage carries its own. Note a residual limit the reader should know: even a correct language tag only helps if the reader has a voice for that language installed.7
4. The font is Unicode-compliant for the script
The fourth condition is that the font actually used can render the script's characters correctly, including its conjuncts and combining marks, and is itself a proper Unicode font.
This is the one most people reach for first, and on its own it fixes nothing. A Unicode-compliant font wrapped around legacy Roman bytes still produces gibberish underneath; the font is necessary but not sufficient.1 Where it does matter independently is coverage: a font that claims a script but lacks specific conjuncts or rarer characters renders empty boxes (called tofu) or wrong shapes, and the reader, sighted or not, loses the content.8
How it fails: a legacy font (fails condition 1 as well); a Unicode font with incomplete coverage of the script's conjuncts; the wrong font for the script.
How to fix: use a Unicode-compliant font designed for the script, the Noto Sans family is the safe cross-platform default (Noto Sans Devanagari, Tamil, Bengali, and so on), with Lohit and the Windows fonts Mangal and Nirmala UI as alternatives.910
Why all four, and in this order
The four are a dependency chain in spirit. Real Unicode text (1) is the ground; without it the rest is moot. Logical order (2) and a declared language (3) determine whether the real text is spoken correctly. A compliant font (4) makes it visible and complete. The common and costly mistake is to treat condition 4 as the whole problem, swap in a Unicode font, and ship a document whose bytes are still legacy and still unreadable. The discipline this knowledge base asks for is to check all four, every time, starting from the foundation.
Endnotes
-
Kruti Dev (Wikipedia, 2025) and PubCom, "Fonts, Unicode, OpenType, and Accessibility" (2013). Legacy fonts store Roman bytes; a Unicode font alone does not recover legacy text. https://en.wikipedia.org/wiki/Kruti_Dev ; https://www.pubcom.com/blog/2013_12-03/unicode-accessibility.html ↩↩↩
-
Richard Ishida / W3C, "An Introduction to Indic Scripts," 2003. Glyph reordering and conjunct formation; visual order differs from logical order. https://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html ↩
-
Unicode 16.0 Core Specification, Chapter 12 (South and Central Asia). The virama model and the reordering of dependent vowel signs. 2024. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-12/ ↩
-
"Searchable PDF with Devanagari texts." Icebearsoft, 2017. The short-i vowel sign precedes its consonant in display order, so search requires the visual order; conjunct composition creates spurious word boundaries. http://icebearsoft.euweb.cz/dvngpdf/ ↩↩
-
W3C WAI, Technique PDF16 "Setting the default language using the /Lang entry," and PDF19 for passages. 2016 to 2023. https://www.w3.org/WAI/WCAG21/Techniques/pdf/PDF16 ↩
-
WebAIM, "Document and Content Language," 2024. Text read with the wrong language's pronunciation rules becomes unintelligible; mixed-language content needs per-passage tagging. https://webaim.org/techniques/language/ ↩↩
-
Adrian Roselli, "Don't Override Screen Reader Pronunciation," 2023. A correct language tag helps only if a matching voice is installed. https://adrianroselli.com/2023/04/dont-override-screen-reader-pronunciation.html ↩
-
SymbolFYI, "Tofu: Why Characters Show as Empty Rectangles," 2024. Missing glyph coverage produces empty boxes when no font in the fallback chain covers the character. https://symbolfyi.com/guides/tofu-missing-glyphs/ ↩
-
Pramukh IME, "List of Unicode fonts for Indian languages," 2024. Per-script Unicode font recommendations. https://www.pramukhime.com/blog/indic-unicode-fonts ↩
-
Nirmala UI (Wikipedia, 2024). A single Windows family covering most Indic scripts. https://en.wikipedia.org/wiki/Nirmala_UI ↩