PDF Conversions Put to the Test

Benchmarking how well today’s conversion services turn PDFs into accessible documents

DAISY Consortium – AI Special Interest Group | Presented at the DAISY Technical Meeting 2026, National Library of Norway, Oslo, 11–12 June 2026

The challenge we set out to address

Despite decades of progress in digital publishing, an enormous amount of the world’s information still reaches readers as PDF – and much of it is inaccessible. Structure is missing or broken, the reading order is unreliable, headings and lists are not marked up, and images carry no alternative text. For people who read with a screen reader or who rely on reflowed, restructured text, a PDF is often a locked door.

A new generation of conversion tools promises to open that door. AI-powered services that understand a page – not merely extract its text – now compete alongside established OCR engines and PDF libraries. The DAISY AI Special Interest Group set out to answer a practical question on behalf of the accessibility community:

How well do today’s PDF conversion services actually perform against the needs of accessible publishing, and where do they fall short?

To answer it fairly we benchmarked seven conversion services – spanning AI APIs, neural pipelines, OCR engines and a traditional PDF library – against a community-developed quality framework, using a shared set of test documents and a consensus scoring process.

The working group

This was a collaborative effort across DAISY member organisations, bringing together lived experience and expertise in production, accessibility and conversion technology:

Ashoka Bandula, DAISY Lanka Foundation
Bart Donders, Dedicon
Basile Mignonneau, Association Valentin Haüy (AVH)
Cherag Mobedji, Bookshare
Hyun-Young Kim, Boin IT
Laura Brady, NNELS
Nicolas Pavie, Association Valentin Haüy (AVH)
Prashant Verma, DAISY Consortium
Rafael Martins, Fundação Dorina
Richard Orme, DAISY Consortium
Terhi Manninen, Finnish Federation of the Visually Impaired

How we worked

The group followed a simple four-step method, designed so that the results would be transparent and reproducible.

Step 1 – Choose the criteria

The DAISY Consortium PDF Conversion Quality Framework, developed by the AI Special Interest Group, defines 30 quality criteria. For this comparison the group selected 12 that matter most for accessible reading, grouped into five themes:

Text fidelity: accurate extraction of Latin text from both searchable (digital) and scanned PDFs, including accented and special characters.
Structure and layout: logical reading order, removal of artefacts such as running headers and footers, and correct recognition of character styles (bold, italic, underline).
Headings and lists: heading levels properly identified and maintained; ordered, unordered and nested lists correctly marked up.
Complex content: tables converted with their structure intact, mathematical expressions converted to LaTeX or MathML, and footnotes and endnotes handled appropriately.
Images: images extracted cleanly without distortion or inappropriate cropping, and any existing alternative text preserved.

Step 2 – Choose the test documents

Four PDF documents were selected so that each criterion is exercised by material that genuinely stresses it:

Dog information – A PDF generated from a Word document. Used to evaluate conversion of Latin text from a searchable PDF, heading hierarchy, list structures, and alt-text retention.
Biodiversity register – A low resolution PDF of an Indian textbook complete with watermarks on each page. Used to evaluate conversion of Latin text, artefact removal, character styles, and image integrity.
Calculus excerpt – An excerpt of a publication from Open Stacks. Used to evaluate equation conversion.
Uncovering the New Accessibility Crisis in Scholarly PDFs – A journal article. Used to evaluate the reading order, note formatting, and table conversion criteria.

Step 3 – Choose the conversion services

Seven services were put under the microscope, chosen to represent the spread of approaches now available, from a traditional PDF library to the latest multimodal AI models: MuPDF 1.26.12, Adobe Acrobat 26.001.21529, Mistral AI (mistral-ocr-2512), Marker, PaddleOCR-VL-1.5, Anthropic Claude Opus 4.7/4.8, and Google Gemini 3.1 Pro (with some testing with Gemini Flash 3.5).

Step 4 – Review and agree the results

Each criterion was assessed on a three-point scale – not met, partially met, or met – and the results were compiled into a shared spreadsheet. Most results were straightforward to determine, but on judgement calls where assessors disagreed, the difference triggered a discussion to reach consensus. Where a service failed a criterion, the group investigated whether the fault lay with the service itself or with a downstream step in the conversion, so that the recorded score reflected the tool’s true potential.

Results

Scoring each service as a weighted percentage across the 12 criteria produced a clear ranking. Three services stood out at the top, a free option performed creditably in the middle, and a traditional text-extraction library trailed the field.

Conversion service	Score	Indicative cost	In a nutshell
Marker	92%	$4 / 1,000 pages (fast & balanced); $6 (accurate)	Best of the cohort; passed every criterion except for retaining source alt text.
Anthropic Claude Opus 4.7/4.8	92%	Usage-based	Joint highest; full marks except source alt text; output can be steered by prompt.
Google Gemini 3.1 Pro / Flash 3.5	92%	$6.82 / 1,000 pages	Joint highest; flexible but more fragile and slower than dedicated services.
PaddleOCR-VL-1.5	67%	$0 (hosted, 20,000 pages/day limit)	Strong free option; loses character styles and nested lists; images were sometimes missing in our trial runs.
Adobe Acrobat 26.001	58%	$19.99 / month subscription	Only tool to keep source alt text; follows visual layout, weak on scanned PDFs.
Mistral OCR (mistral-ocr-2512)	58%	$1 / 1,000 pages	Very fast; drops footnotes and character styles; poor at removing artefacts such as running footers, headings were sometimes flattened.
MuPDF 1.26.12	25%	$0	Text extraction only; no OCR for scanned PDFs; not for accessible conversion.

Scores are weighted averages across the 12 criteria (maximum 100%). Costs are indicative and were correct at the time of testing.

What the numbers tell us

AI has raised the bar. Marker, Claude and Gemini each scored 92%, handling text, reading order, headings, lists, character styles, footnotes, equations, tables and images well on both digital and scanned PDFs.
Scanned PDFs separate the field. Tools that read the rendered page (the AI services and OCR engines) coped with image-only PDFs; MuPDF, which only extracts an existing text layer, could not, and scored lowest at 25%.
Alternative text is the universal Achilles heel. Only Adobe Acrobat preserved the source PDF’s alt text. The AI services regenerate their own descriptions instead of carrying the author’s across – useful, but not the same thing.
Free can be good enough. PaddleOCR reached 67% at no cost, making it a credible option where budget is the deciding factor. Note that Marker can also be self-hosted.
Flexibility comes with fragility. The general-purpose LLMs can be instructed to produce bespoke output, but proved slower and less predictable than dedicated services, occasionally returning recitation or prohibited-content errors.

The full criterion-by-criterion results are available as an interactive web page where readers can adjust the weighting of each criterion, show or hide individual services, sort the results and read the assessors’ comment behind every score.

A deeper question: do AI tools editorialise?

The benchmark measured how well each service captures structure and content. It raised a subtler concern that we pursued in a separate study, described in full in the accompanying article “Can a Vision LLM Faithfully Transcribe a 317-Page PDF?”. A conversion is only useful if it is faithful: a tool that quietly fixes a typo, swaps a word for a synonym or “corrects” a name is no longer transcribing – it is editing. For converting PDFs for accessibility purposes that is a defect, not a feature.

We converted a demanding 317-page report – Turning the Tide Together, with footnotes, tables, proper nouns and passages in English, French and Anishinaabe– and compared the output of nine tools against the page itself, running each non-deterministic tool five times and verifying every variation by hand. From measuring differences over 2 million words, two findings stand out.

First, the reference can be more broken than the tools. The PDF’s own embedded text layer carried systematic faults – hyphenation splits, a glyph fault that lowercased capital V, I and X (so “COVID” became “COvID”), words run together, and typos that were not present in the visually rendered layer. Tools that read the text layer reproduced these errors, vision-based tools that read the page represented the text as displayed in the PDF.

Second, every LLM-based tool editorialised. The tendency to “improve” the text showed up in all of them, and verifying against the source was the only reliable way to catch it:

Gemini (original prompt) silently corrected typos and grammar and, in one case, changed a surname from “Mae” to “Mac.”. Before we adjusted the prompt there were 4 edits across 70,000 words.
Marker made seven identical silent rewrites in every run – for example “denturist” to “dentist” and “Michaella” to “Michaela” – across all cost tiers.
Mistral OCR hallucinated the place name “Portapique” as “Portuguese” in all five runs.
Equalify Reflow (Docling + Claude) changed roughly 260 double quotation marks to single ones every run, and paraphrased the text differently each time (for instance “reading” to “hearing”).
PaddleOCR, ABBYY FineReader, MuPDF and Acrobat – the tools without a language model – never reworded the text. Their errors were character misreads.

Encouragingly, the behaviour is rare and controllable. Adding an explicit instruction to transcribe the source verbatim removed Gemini’s serious deviations: across five runs there were zero to one grammatical tidy-ups. Fidelity, it turns out, is mostly a prompt-level decision rather than an inherent limitation – the model does not need to be less capable to be more faithful; it needs to be told more firmly that the author’s words are what matter.

Conclusions

The benchmark answers our question with a clear verdict: today’s AI-powered conversion services have decisively raised the bar for turning PDFs into accessible documents. Four advantages stand out, each borne out by the results:

Recognition of difficult source material. The leading tools used vision language models to accurately recovered text from searchable PDFs and low-resolution, watermarked textbook scans.
Layout analysis and artefact removal. That same page-level understanding lets these tools reconstruct a logical reading order and discard the furniture that often clutters a PDF — running headers, footers and watermarks — delivering the content, in the order it should be read.
Encoding of mathematical content. On the calculus excerpt, the strongest services converted equations into LaTeX or MathML rather than flattening them into unreadable text or converting them into images.
Semantic document understanding. Beyond the words, the best tools identified and marked up the structure that makes a document navigable and ready for format conversion: heading levels, ordered, unordered and nested lists, tables with their cell relationships intact, character styles, and footnotes.

Taken together, these capabilities make the production of accessible formats markedly more efficient and the end result higher quality: work that once demanded extensive manual restructuring can increasingly be carried by the conversion step itself.

Not all solutions are created equal — but used with an awareness of both their strengths and weaknesses, AI-powered conversion is now a genuinely powerful route to accessible documents.

Next steps

We will extend the testing to cover non-Latin scripts.
We will scale up the mathematics test to cover a wider range of equations.
We will bring more quality criteria into the evaluation from the full 30-criterion framework.
We will add results for further conversion solutions, such as Reflow, Tensorlake, Doctly, and MAI-Image paired with MAI-Thinking.
We explore automating parts of the testing so that new services and model versions can be evaluated quickly and consistently.

Further resources

Interactive results explorer – the full, adjustable results table with assessor comments
Companion study – “Can a Vision LLM Faithfully Transcribe Two Million Words? A close look at conversion accuracy across nine tools and the hidden problem of silent normalization.”
DAISY Technical Meeting 2026, National Library of Norway, Oslo, 11–12 June 2026