Remediating a regional-language book: the workflow

On this page

Remediating a regional-language book: the workflow

This entry is the practical sequence for taking a document in a non-English language and making it readable by assistive technology. It assumes you have read the four conditions; the work here is to satisfy all four, in an order that does not waste effort. The single most important idea is the order itself: fix the foundation (real text) first, because every later step depends on it, and doing the later steps first means doing them twice.

Step 0: find out what kind of problem you have

Before any fixing, triage. The remedy is completely different depending on which of three states the document is in, and they are easy to tell apart.

Copy a line of the regional-language text and paste it into a plain text box.
It pastes in the correct script , the text is real Unicode. Skip to Step 3 (this is an ordinary tagging job).
It pastes as Roman gibberish , the document uses a legacy font. Go to Step 1.
Nothing pastes, or you cannot select text , the page is a scan with no text layer. Go to Step 2.

Do this triage per document, and on mixed documents per section, because a single file can combine real text, legacy text, and scanned pages.¹

Step 1: legacy text, convert it to Unicode

If the text is editable but legacy-encoded, the fix is to convert it to Unicode, then re-make the document.

Identify the exact legacy font from the file (its name, or by which converter's output comes out correct). The conversion depends on the precise variant, Kruti Dev 010 is not the same mapping as another Kruti Dev variant or as Shree-Dev.²
Run the text through a converter for that font. Mature, mostly open-source converters exist per script: GUCA for Gurmukhi, ascii2unicode for Kannada, the OdiaWikimedia converter for Odia, Android-TamilUtil for Tamil, and several tools for Devanagari.³⁴ Devanagari has the strongest ecosystem; coverage for other scripts is thinner and needs more care.
Check the hard parts. Good conversion has to handle conjuncts and the reordering of vowel signs, and must not mangle English text embedded in the regional-language content. Weaker tools drop or misplace conjuncts, matras, and the nukta.²⁵
Proofread. Legacy conversion is not guaranteed perfect; treat any "100 percent accurate" claim with caution and have a reader of the language check the output before it is called done.²
Re-make the document with the converted Unicode text in a Unicode-compliant font (Step 4 below), then tag it (Step 3).

If the original source file (in a word processor or layout tool) still exists, re-typesetting it in Unicode from that source is more reliable than converting the PDF, and it is the right choice for documents that recur, such as templates and textbook series.

Step 2: scanned pages, rebuild the text with OCR

If the page is a scanned image with no text layer, conversion has nothing to work on. Use optical character recognition to create a genuine Unicode text layer from the picture.

Use an OCR engine with a model for the script: Tesseract with the relevant Indic language data handles many scripts, and PaddleOCR handles complex layouts.⁶
Produce an "image-over-text" PDF, the original page image kept on top, a searchable Unicode text layer underneath, so the document looks unchanged while becoming readable.⁷
Proofread. OCR accuracy figures are character-level, not word-level, and they fall on small type, low contrast, and dense conjunct scripts, so the output needs checking, especially for the scripts that are hardest to recognise.⁷

OCR is a recovery method, not a first choice. If the text is editable legacy text, Step 1 is faster and more accurate than re-recognising a picture of it.

Step 3: structure the document (tagging and reading order)

Once the text is real Unicode (whether it always was, or you produced it in Step 1 or 2), the document needs the ordinary accessibility structure that the general Document Accessibility Guide covers: a tag tree, headings, lists, tables, alternative text for images, and a correct logical reading order.

For regional-language documents pay particular attention to reading order. Complex scripts can leave text in visual rather than logical order, so confirm that the tagged reading order matches how the text is spoken, not how the glyphs were drawn.⁸ This is the second of the four conditions, and it is the one most easily missed after the encoding is fixed.

Step 4: declare the language and choose the right font

Two finishing steps that are cheap and mandatory.

Declare the language. Set the document's default language, and set a per-passage language wherever the document switches (for example an English heading over a Hindi body). Without this a screen reader reads everything with its default pronunciation, usually English, and correct text becomes noise.⁹
Use a Unicode-compliant font for the script. The Noto Sans family is the safe cross-platform default (Noto Sans Devanagari, Tamil, Bengali, Kannada, Telugu, Malayalam, Gujarati, Gurmukhi, Oriya); Lohit is the open-source option; Mangal and Nirmala UI are common on Windows.¹⁰¹¹ Remember that the font alone never fixes legacy text; it only matters once Step 1 has made the text real.

Step 5: verify

Re-run the four conditions on the finished document, and where possible do a short screen-reader spot check in the document's language. The verification is not "does it look right", which it always did, but "does it read right": paste-test the text again, confirm the language is declared, and listen to a passage if a voice for the language is available. A document that passes all four conditions and reads correctly aloud is done; one that passes a visual check is not.

The order matters most

If you take one thing from this entry, take the order. Real text first, then structure, then language and font, then verify by reading rather than by looking. Reverse it, tag and style a legacy document before fixing its encoding, and you will have built a careful, compliant-looking structure on top of text that no reader can hear, and you will have to do it all again. The foundation comes first because everything else stands on it.

Endnotes

"Why Extracting Hindi Text from PDFs Is So Much Harder Than English." The Digital Orientalist, 2025. Documents can mix real, legacy, and scanned content; per-publisher legacy encodings. https://digitalorientalist.com/2025/12/02/why-extracting-hindi-text-from-pdfs-is-so-much-harder-than-english-and-how-you-can-do-it/ ↩
Hindi font converter guidance, 2024 to 2025. Conversion depends on the exact font variant; requires handling conjuncts, vowel signs, anusvara, visarga, halant, and numerals; embedded English must be preserved; output must be proofread. https://hindifontconverter.com/ ↩↩↩
GUCA (Gurmukhi Unicode Conversion Application), ascii2unicode (Kannada), and the OdiaWikimedia Converter. 2014 to 2024. https://guca.sourceforge.net/applications/guca/ ; https://github.com/aravindavk/ascii2unicode ; https://github.com/OdiaWikimedia/Converter ↩
Android-TamilUtil (Unicode and Bamini, Anjal, TAB, TAM, TSCII). 2024. https://github.com/mayooresan/Android-TamilUtil ↩
Centre for Internet and Society (India), "Converting from non-Unicode (Nudi, Baraha) font encoding to Unicode Kannada," 2014. Conversion must handle dependent-vowel reordering. https://cis-india.org/openness/blog-old/converting-from-non-unicode-nudi-baraha-font-encoding-to-unicode-kannada ↩
Tesseract OCR project, with Indic script models; PaddleOCR for complex layouts. 2025. https://tesseractocr.org/ ↩
University of Illinois Library, "OCR Best Practices," 2024. Accuracy figures are character-level; small or low-contrast type reduces accuracy; image-over-text preserves the page image with a searchable layer. https://guides.library.illinois.edu/OCR/bestpractices ↩↩
"Searchable PDF with Devanagari texts." Icebearsoft, 2017. Complex scripts can leave text in visual rather than logical order. http://icebearsoft.euweb.cz/dvngpdf/ ↩
WebAIM, "Document and Content Language," 2024. Undeclared or wrong language causes the wrong pronunciation rules; mixed content needs per-passage tagging. https://webaim.org/techniques/language/ ↩
Pramukh IME, "List of Unicode fonts for Indian languages," 2024. https://www.pramukhime.com/blog/indic-unicode-fonts ↩
Nirmala UI (Wikipedia, 2024). A single Windows family covering most Indic scripts. https://en.wikipedia.org/wiki/Nirmala_UI ↩