Can a Vision LLM Faithfully Transcribe Two Million Words?

A close look at conversion accuracy across nine tools and the hidden problem of “silent normalization”.

Introduction

Libraries, publishers and accessible-media producers convert a great deal of their material from PDF into the formats their readers depend on, for example braille, DAISY, or EPUB. It is often an unrewarding starting point: structure missing or broken, reading order unreliable, headings and lists unmarked, images without alternative text — all of which must be repaired before a usable accessible format can be produced, and all of which, left uncorrected, degrade the experience for the person who finally reads with a screen reader, refreshable braille display, or reflowed text. A new generation of AI-powered conversion tools is beginning to change that calculus. These tools can recognise text on poor and low-resolution scans, reconstruct a logical reading order and strip away artefacts such as running headers and watermarks, encode mathematics as LaTeX or MathML, and mark up the headings, lists and tables that make a document navigable. Done well, this makes the production of accessible formats far more efficient — and the end result better for the reader — as our companion benchmark, PDF Conversions Put to the Test, sets out in detail.

But a conversion is only useful if it is faithful. The research question we set out to answer was narrow and practical:

Can a vision LLM accurately reproduce the text of a long PDF — without hallucinating, editorialising, or silently “improving” what it reads?

A model that quietly corrects a typo, swaps a word for a synonym, or fixes a misspelled name is not transcribing; it is editing. For accessible document-conversion purposes , that is a defect, not a feature, even when the model believes the “correction” is an improvement.

And this concern is not confined to a single standalone model. A growing number of PDF-conversion pipelines now place a large language model at the heart of their workflow — Mistral’s OCR service, Marker’s neural pipeline, and accessibility-focused tools such as Equalify Reflow all lean on an LLM to interpret layout and produce clean output. Wherever an LLM sits in the pipeline, the same editorialising tendency can follow. So while our study began with one multimodal LLM (Google’s Gemini), the question it raises applies to the whole emerging class of LLM-assisted conversion tools — and, as we will see, the tendency showed up in every one of them.

The test document and the ground truth

We used a single, demanding source: a 317-page report, Turning the Tide Together, the final report of the Joint Federal/Provincial Mass Casualty Commission. It is a long, formally structured document with footnotes, tables, pull quotes, and proper nouns — exactly the kind of material that exposes conversion weaknesses.

Because the PDF was a tagged PDF, it carried an embedded text layer. We assumed that exporting that text with Acrobat Pro would give us a reliable “ground truth” reference to measure conversions against. The model under primary test was Gemini Flash 3.5, prompted to convert the PDF to structured Markdown.

An important caveat emerged immediately, and it shaped the whole analysis: the assumed ground truth was not actually ground truth. As we will see, the reference — drawn from the PDF’s embedded text layer — itself contained systematic errors. So the role of arbiter passed from the Acrobat export to the only thing that could settle a disagreement: the text visually printed on the page, checked by eye. (We later confirmed, by copying directly from the PDF and using a second extraction tool, that the faults are in the PDF’s text layer rather than introduced by Acrobat’s export.) This is worth stating plainly, because it is a key lesson of the study — the presence of a text layer is no guarantee of accuracy: OCR or other image-based reading can yield more faithful results.

Method

Comparing two conversions of the same document is harder than it sounds, because the two outputs inevitably differ in ways that are not errors: the Gemini output included page-boundary markers and Markdown structure, paragraphs were split at different points, and so on. A naïve character-by-character diff would drown in this noise.

So we used an alignment-based approach designed to compare only genuinely corresponding text:

We extracted paragraphs from both documents and kept only those of at least 100 characters, which screens out page numbers, running headers, and short structural fragments.
We excluded tables, which the tools may build differently.
We matched each reference paragraph to its counterpart in the Gemini output, so that corresponding blocks were compared.
Within each matched pair we computed a word-level diff, and then categorised every difference.

This yielded 1,404 aligned text blocks covering roughly 70,000 words of comparable prose — a substantial sample of the document’s body text.

What we found: most “errors” were not the model’s

The raw word-difference rate between the Acrobat reference and the Gemini conversion was 3.34%. Taken at face value, that would suggest the model erred on roughly one word in thirty.

It did not. When we categorised the differences, the overwhelming majority traced back to faults in the PDF’s embedded text layer (reproduced faithfully by the Acrobat export), not to Gemini:

Hyphenation artifacts (~478 cases): words split across line breaks, such as estab-lished or oppor-tunities, that the export preserved as literal hyphens. Gemini correctly rejoined them.
Systematic glyph mis-mapping (~175 cases): the embedded text layer systematically lowercased capital V, I, and X at the character level — a ToUnicode mapping fault. This produced errors like COvID-19, SUv, NExUS, volume for Volume, and inquiry for Inquiry. We confirmed this fault lives in the PDF’s own text layer, not in Acrobat’s export, in two ways: by copying directly from the original PDF, and — more conclusively — by extracting the text with an entirely different tool (MuPDF), which reproduced the same COvID and SUv glyph faults. The page images, by contrast, are correct, which is why every image-based method (PaddleOCR and the vision LLMs alike) read COVID and SUV
Word-joining, truncation, and stray-asterisk faults (about six distinct cases): beyond the systematic glyph and hyphenation problems, the text layer carried a small number of discrete word-level errors where the stored text simply disagrees with the printed page. Spaces were lost (“Scotiashouldmakein-personconflictresolutiontraining” for “Scotia should make in-person conflict resolution training”; “MassCasualtyCommission”; the French “commissiondespertesmassives”), a word was truncated (“hal” for “hall”), and footnote-reference asterisks were glued onto the words they followed (“efficient.*”, “realistic.*”). We confirmed these are text-layer faults, not export quirks, because MuPDF reproduced them and every image-based tool read the page correctly.

In short, the bulk of the 3.34% was the reference being wrong and Gemini being right. The headline number was an artifact of an imperfect ground truth.

It is worth pausing on what that broken text layer means in human terms. The embedded text layer is not an abstraction — it is precisely what a screen reader user hears when they open this PDF. Every fault we catalogued above is, for that reader, the actual document: they encounter “COvID,” “SUv,” and “MassCasualtyCommission” run together, the truncated “hal,” and the footnote asterisks read aloud mid-sentence. The sighted reader sees a clean, correct page; the screen reader user receives the corrupted layer beneath it. This is the quiet accessibility crisis that makes faithful conversion matter in the first place — and it is also why a tool that reads the page image rather than the broken text layer is not merely more accurate in the abstract but can deliver to a screen reader user what the page actually says.

The real concern: silent normalization

Filtering out the Acrobat-side faults left a very small set of differences that were genuinely the model changing the source text. These were rare — a handful across 70,000 words — but they are the findings that matter, because they reveal the model’s behaviour.

A methodological note is important here. We verified each candidate deviation by hand against both the original PDF and the conversion’s own text and discarded any alignment artifacts. The cases below are the ones that survived that check.

Gemini’s deviations clustered into one behaviour: silent normalization. The model was not transcribing what it saw; it was tidying it up.

Where the source read “enable theme to provide,” Gemini output “them” — correcting a typo.
Where the source read “would came back to visit,” Gemini output “come” — correcting grammar.
Where the source read “Mi’kmag’ki,” Gemini output “Mi’kmaq’ki” — altering the spelling of a proper noun (for what it is worth, both are correct).
And in one case the model introduced an outright error: a surname printed as “Mae” was rendered as “Mac.”

Most of these “corrections” make the text more readable. That is precisely the problem. For faithful document conversion we do not want the model deciding what the author meant; we want what the author wrote. A system that silently fixes a typo, or alters a name, cannot be trusted to reproduce a record verbatim.

Strengthening the prompt

The original conversion prompt was detailed about structure, footnotes, tables, and figures — but it never actually instructed the model to transcribe verbatim. We added an explicit, high-priority fidelity rule near the top of the prompt and reinforced it in the output contract. In essence:

Transcribe the source verbatim, character for character. Do not correct spelling, grammar, punctuation, or word choice, even when the text is clearly wrong. Do not substitute synonyms or normalize the spelling of names and places. The only permitted changes are those the prompt explicitly directs (removing running headers and page numbers, rejoining end-of-line hyphenation, and applying Markdown structure). Fidelity to the source always outranks readability.

We included the carve-out for end-of-line hyphenation: rejoining estab-lished into established is a layout correction, not a content change, and we did want that.

Results after the change

We re-ran the conversion with the strengthened prompt and repeated the same alignment-and-diff analysis. The fidelity instruction resolved three of the four genuine deviations:

Mae → Mac — fixed; the surname is now reproduced as printed.
theme → them — fixed; the source typo is now reproduced verbatim.
Mi’kmag’ki — fixed; the proper noun is left as printed.

One deviation persisted in this run: the grammar correction came → come. No new deviations were introduced, and the most serious problem — the introduced error in the surname — was eliminated. (As the next section shows, the residual tidying did not vanish entirely; it shifted from run to run.)

To test that this was reproducible rather than a lucky single run, we ran the strengthened-prompt conversion five times and analysed each against the source. The important result held across all five runs, but the strengthened prompt reduced rather than eliminated the model’s tidying instinct: most runs still made one or two small normalisations, and they were not the same ones each time. Two runs corrected the grammar typo came → come; one of those also normalised the proper noun Mi’kmag’ki → Mi’kmaw’ki; another run preserved “came” but instead changed the typo provided → provide (“organizations to provided support services”); a fourth altered the spelling of another Mi’kmaq term (We’kopekwitk → We’rekopekwitk); and one run, encouragingly, was perfectly clean. In other words, the prompt made the model more reliably faithful, while a residue of zero-to-two-word grammatical tidy-ups per run proved stubborn.

Comparison with other conversion tools

A vision LLM is only one way to extract text from a PDF. To put its performance in context, we ran the same source through several other tools. It is worth being precise about what these are, because they are not all the same kind of system:

Gemini Flash is a general-purpose vision LLM prompted to transcribe.
Marker is a neural, image-based document-conversion pipeline built on the Surya family of deep-learning models. It renders each page as an image and applies specialised text-detection, recognition, layout, and reading-order models; it is not classical OCR.
Mistral OCR is an AI/LLM-powered multimodal document-understanding service that outputs Markdown and reasons about layout — again, explicitly distinct from classical OCR.
Equalify Reflow is a hybrid pipeline: IBM Docling performs a first-pass extraction using lightweight models and the PDF’s existing structure, after which a multimodal LLM (Claude) edits each page against its rendered image, with a final pass to stitch pages together. It is purpose-built for accessibility, targeting clean, reflowable Markdown rather than print fidelity.
PaddleOCR is, despite the “OCR” name, a vision-language model (PaddleOCR-VL 1.5): deep-learning text detection and recognition feeding a language-model decoder. It reads the rendered page like the other image-based tools, but the language-model decoder gives it the same tendency to “improve” the text that the LLM pipelines have.
MuPDF is not OCR at all: it extracts the embedded text layer directly from the PDF’s internal structure, the same source Acrobat’s export draws on. It performs no recognition and no interpretation.
ABBYY FineReader is the one traditional OCR baseline: a classical feature- and pattern-recognition engine with no language model, and therefore no tendency to “improve” the text it reads.

These tools fall into three families, and the distinction turns out to be the spine of the whole comparison. Text-layer extractors (MuPDF, and Acrobat’s export) read the PDF’s stored text directly: they are perfectly faithful to that text, but they inherit whatever faults it contains. Image OCR without a language model (classical ABBYY FineReader 14) reads the rendered page, so it escapes a broken text layer, but it can misrecognise individual characters and has no capacity (or temptation) to reword. Vision LLMs (Gemini, and the LLM-driven pipelines Marker, Mistral, and Reflow) also read the page, make very few character-level errors, but bring an interpretive faculty that can quietly normalise or rewrite the source.

This spread is useful precisely because the tools sit at different points on a spectrum from “pure transcription” to “language-model interpretation.” A traditional OCR engine has no notion of what a sentence should say; a vision LLM does, and may act on it. The interesting question is whether that interpretive capacity helps (determining reading order, discarding artifacts, rejoining hyphenation, reading through a broken text layer) or hurts (silently rewriting the source).

What each tool did

Repeating the alignment-and-diff analysis against each conversion, and verifying every candidate difference by hand against the original PDF, the pattern was consistent with the fidelity story above — the more a tool “understands” text, the more it is tempted to normalise it.

In total we examined nine tools (including Acrobat itself, the tool we had assumed would supply the reference). Because the LLM-based pipelines can vary from run to run, we ran each of them five times: Gemini five times with the fidelity prompt, Marker seven times (balanced five times, plus one each in accurate and fast modes), Mistral five times, Reflow five times, and PaddleOCR-VL five times. The deterministic tools — the text-layer extractors and the non-language-model OCR engines — produce identical output every run and so were run once. The table below summarises the verified, genuine errors each produced — that is, after removing differences traced to the broken text layer, to hyphenation, and to paragraph-alignment artifacts. To make the counts exhaustive rather than spot-checked, every word-level candidate from every conversion was put through a consensus test: the word as read by the deterministic tools was compared against the word the conversion produced at the same anchored position. Where the faithful tools agreed and the conversion differed, the difference is a genuine deviation; where the conversion in fact matched them, it was an alignment artifact and discarded. Finally, errors are visually confirmed against the original PDF page, and the type of error matters as much as the count.

Conversion	Family	Reads	Verified genuine errors	Character of the errors
Acrobat Pro (reference)	Text-layer extraction	Text layer	— (the reference)	Inherits text-layer faults: V/I/X glyph mis-mapping, hyphenation, ~6 word-level join/truncation faults
MuPDF	Text-layer extraction	Text layer	Same faults as Acrobat	Reproduces the text layer verbatim, including the COvID/SUv glyph errors
ABBYY FineReader 14	Traditional OCR	Page image	7 (English only) → 1 (with French added)	Accented/non-English characters dropped (Résumé→Resume); with French enabled, only a single non-French diacritic remained (Utøya→Utoya)
PaddleOCR	Image OCR, no language model	Page image	8-12 per run	Character misreads and digits (11:30→:30); silent smoothing (5-8)
Gemini Flash 3.5 — original prompt	Vision LLM	Page image	4	Silent normalisation (3) + one introduced error (Mae→Mac)
Gemini Flash 3.5 — fidelity prompt (5 runs)	Vision LLM	Page image	0–2 per run	Minor grammar/typo/proper-noun-spelling tidy-ups only (came→come, provided→provide, Mi’kmag’ki→Mi’kmaw’ki, We’kopekwitk→We’rekopekwitk); one run perfectly clean; no names, places, or facts altered in any run
Marker (balanced ×5, accurate, fast)	Neural CV pipeline	Page image	7 normalisations (every run) + intermittent character misreads	Seven normalisations reproduced in all seven runs (denturist→dentist, Michaella→Michaela, Grammies→Grammie’s, theme→them, came→come, that→than, provided→provide); plus character misreads on Mi’kmaq terms in 3 of 5 balanced runs, worst in fast mode
Mistral OCR (×5 runs)	LLM OCR service	Page image	hallucination in all 5 runs	Portapique→Portuguese in every run (5, 3, 3, 5, 3 times), Commr.→Comm. (×3) and theme→them identical across all runs
Equalify Reflow (×5 runs)	Hybrid (Docling + Claude)	Page image	theme→them + ~260 quote conversions every run; 0–3 different paraphrases per run	Consistent: theme→them and double→single quotes. Variable: synonym paraphrases differing each run (reading→hearing, wide-ranging→far-ranging, firehall→firehouse, comradery→camaraderie, –→frames, counters→counter) and a factual error (Highway 4→2) in run 1 only

A note on counting. When first run with English only, ABBYY appeared to produce more discrete errors than the fidelity-prompted Gemini runs — but those were almost all dropped accents caused by a language misconfiguration, and re-running with French enabled cut its genuine errors to a single character (see below).

Gemini Flash 3.5 (strengthened prompt, five runs). The most faithful on substance. Across five independent runs no names, places, or content words were altered. What remained was a small, shifting residue of one or two grammatical tidy-ups per run — came → come, provided → provide, and the spelling of Mi’kmaq proper nouns — with one of the five runs perfectly clean. The prompt made the model safe on meaning, but not perfectly literal.

Marker (seven runs: balanced ×5, accurate, fast). Faithful on structure but consistently willing to “correct.” The most striking result is the reproducibility of its normalisations: seven silent rewrites appeared in every one of the seven runs, across all three cost tiers — the real occupation denturist to dentist, the name Michaella to Michaela, Grammies to Grammie’s, and the source typos theme → them, came → come, that → than, and provided → provide. This determinism confirms the normalising behaviour originates in the pipeline’s language model, not its recognition stage, and that no cost tier removes it. The cost tiers differed only in a layer of character-level noise on top. That noise was run-variable rather than cleanly tied to mode: a cluster of misreads on the report’s Mi’kmaq-language terms (for example qame’kewaq → qame’kewaak) appeared in three of the five balanced runs and in accurate mode, but not in the others — a reminder that a single run can mislead either way. Fast mode was the noisiest of all, adding its own Mi’kmaq misreads, a British-to-American spelling change (decentring → decentering), and even a corrupted URL (commissiondespertesmassives → commissiondespertemassives). Paying more did not buy fidelity, because the rewrites seem to be a fixed property of the model; and paying less added character damage.

Mistral OCR (five runs). Rendered the place name Portapique — the community where the events of April 2020 began — as “Portuguese.” Crucially, five runs showed this is not a fluke: the hallucination appeared in every run, though the number of times varied (five, three, three, five, three). The abbreviation changes (Commr. → Comm., the report’s term for “Commissioner,” three times) and the theme → them normalisation were identical across all five runs.

Equalify Reflow (five runs). Showed a systematic stylistic infidelity rather than a scattering of word errors: it converted the source’s double quotation marks to single quotation marks throughout the document (roughly 260 conversions in every run), altering the punctuation of every quoted passage. It is worth being even-handed about the quotation-mark change, because how much it matters depends on the use case. For some, swapping the document’s double quotation marks for single ones is cosmetic and harmless.

Running Reflow five times revealed a more serious pattern beneath the quotation marks: of all the tools, it was the most willing not merely to normalise but to paraphrase — and it did so differently on each run. Only two behaviours were consistent across all five runs: the theme → them fix and the quotation-mark conversion. Everything else changed. The first run introduced the factual error Highway 4 → Highway 2; the other runs read “Highway 4” correctly but introduced their own, entirely different substitutions. One run turned “Ms. MacLeod had been reading about the shootings” into “hearing about the shootings” and “wide-ranging consequences” into “far-ranging.” Another changed “the competition and the comradery” to “camaraderie,” “approached the firehall” to “firehouse,” and rewrote a sentence built around an em-dash — “This context – the harms caused…” — as “This context frames the harms…,” inventing a verb the source never used. Yet another run quietly singularised “counters” to “counter.” One of the five runs, by contrast, made no paraphrase at all beyond the standard theme → them. These are not OCR errors or typo fixes; they are the edits of something behaving like a writer rather than a transcriber.

PaddleOCR-VL 1.5 (5 runs). Pairs deep-learning text recognition with a language-model decoder — and that decoder gives it the same tendency to “improve” text that the LLM pipelines have. Because it reads the rendered page image it produced COVID and SUV correctly where the text-layer extractors did not. But it made two kinds of error. The first is visible glyph garble of the classic-OCR sort: Mi’kmaq-language terms such as ankuo’mkeweletuk and tlitpi’aq were mis-spelled, Cobequid became Cobeguid, and digits dropped out of figures (11:30 → :30, 2020 → 0). The second is the same silent smoothing the LLM tools make: it nudged that to than, include to includes, and provided to provide toward the more grammatical form — and, changing the meaning rather than just the grammar, disserved to dissolved — even though the page reads the awkward original. These “corrections” look like helpful fixes but are errors against the source; only by checking the page do they show as wrong. The behaviour most likely comes not from reasoning but from the decoder simply favouring the statistically more probable word. Across five passes PaddleOCR-VL produced roughly eight to twelve genuine errors per run and was not deterministic; the digit-drops and some Mi’kmaq misreads appeared in some passes but not others.

MuPDF (text-layer extraction). Not OCR at all; it reads the PDF’s stored text directly. It was, predictably, the most literally faithful tool to the source text — reproducing theme, came, Mae, Highway 4, and Portapique verbatim — but for exactly that reason it also inherited the text layer’s faults, reproducing the same COvID and SUv glyph errors that the image-based tools corrected. Its output also carried more page furniture (running headers, page numbers) that a clean conversion would strip. MuPDF is the clearest illustration of the trade-off: a tool that changes nothing is perfectly faithful to a flawed source, errors included.

ABBYY FineReader 14 (traditional OCR). The control behaved exactly as a non-language-model engine should. It was the most faithful tool to the source’s wording — preserving theme, came, Mae, Highway 4, and Portapique verbatim, with no normalisation, no substitutions, and no hallucinations. Like PaddleOCR, it read through the broken text layer and recovered COVID and SUV correctly every time. Its only genuine errors were character-level misreads of a single, narrow kind: accented and non-English characters. It stripped or mis-read diacritics — Résumé → Resume, également → egalement, cliché → cliche, Métis → Metis — misrecognised the ç in français as a p (franpais), and flattened the Norwegian ø in Utøya to Utoya. Seven such errors across roughly 68,000 words.

These accent losses, it turned out, reflected the OCR engine’s language configuration rather than an inherent limitation. The document contains French and Mi’kmaq passages, and the first pass had been run with English only, so characters outside the expected alphabet were discarded. We re-ran FineReader with French added as a recognition language, and all six French/accented errors disappeared (Résumé, français, également, cliché, and both Métis now came through correctly), leaving a single residual error — Utøya → Utoya, the Norwegian ø, which French does not cover. With the right language settings, then, ABBYY’s genuine error count on this 68,000-word document fell from seven to a single mechanical misread of one non-French diacritic.

Synthesis

Taken together, the nine tools sort cleanly along the three families, and the errors they make differ not just in number but in kind.

Text-layer extractors (MuPDF, Acrobat) are perfectly faithful to the stored text but inherit its faults: they alone reproduced the glyph errors and mistakes in the text layer.

OCR tools (ABBYY FineReader 14) ignore the broken text layer and read the image. They don’t reword, preserving source typos verbatim. Their failures are character-level misrecognitions.

Vision LLMs (Gemini, and the LLM-driven pipelines Marker, Mistral, and Reflow) read the page accurately and make very few character errors — but they bring an interpretive faculty that the other tools lack, and with it the temptation to “improve” the text. Every LLM-based tool made the same theme → them normalisation. Several went further and introduced confident content errors. These errors are dangerous because they are invisible: they read as perfectly correct, and only a comparison against the source reveals them.

This is the heart of the matter because faithfulness is what counts. The encouraging finding is that the interpretive behavior of Gemini is controllable: an explicit verbatim instruction eliminated most of the deviations, while keeping the model’s advantages of clean structure, correct reading of the page, and immunity to the broken text layer. The instruction did not make the model perfectly literal — the impulse to fix an obvious typo proved stubborn — but it made it safe on everything that matters. The vision LLM did not need to become less intelligent to become more faithful; it needed to be told that fidelity was the job.

Conclusion

This study initially stumbled on the easy assumption that a tagged PDF provides a trustworthy textual ground truth. In the 317-page report, the apparent 3.34% error rate largely measured defects in the PDF’s own text layer, not failures by the vision model. Once those faults were checked against the page itself, the central result was clear: page-image reading can be more faithful than the embedded text, and a well-prompted vision LLM can transcribe long-form prose with remarkably few genuine departures from the source.

The most important conclusion, then, is not that vision LLMs simply outperform every other tool. Traditional OCR and image-based OCR were also highly faithful, and in some respects more literal. On the original research question, real lesson is clear: any LLM-based pipeline must be explicitly constrained not to edit what it reads. Used that way, a vision LLM offers something genuinely powerful — clean, structured, accessible output without inheriting the hidden corruption of a broken PDF, but only if fidelity is treated as a requirement, not an assumption.