Improving Image Descriptions Through Classification

When an accessibility tool generates alt text, the temptation is to send the image to a vision model with a single, generic prompt and hope for the best. That works, up to a point. A photograph of a beach and a heatmap of gene expression are very different artefacts, and asking for “concise alt text” gives you two answers of roughly the same shape, when what you actually want is two very different kinds of description.

Fido ships with decent prompts for generating alt text and extended descriptions. But to create even better image descriptions the user can enable the image classification feature. Then, before Fido writes any description, it classifies the image against a structured taxonomy, then chooses a description prompt tuned to that class. This article explains the classification strategy we landed on after some experimentation, the taxonomy we use, and where we think there is still room to improve.

Why classification, and why not just one big prompt?

You could put collection of image types into a system prompt and ask the model to describe the image and pick a class in one go. And that’s what we did in Fido’s first generation image classification strategy, but it has problems:

The single prompt grows long and expensive, and the model often optimises for the description and treats the class as an afterthought.
Long prompts with many category names invite hallucinated, “close enough” labels.
There is no way to recover when the model is uncertain — you get one answer and have to live with it.

We chose to separate concerns: classify first, then describe. The classification step is small and cheap; the description step is then a focused, class‑specific prompt that does one job well.

The classification approach

Walk the tree, one level at a time

The classifier descends the taxonomy level by level. At the root, it considers the top‑level classes. If the model picks, say, “Chart”, the next call considers only Chart’s children (for example: Area chart, Bar chart, Donut chart, Heatmap, Histogram, Line chart, Pie chart, etc.). We keep descending until either there are no further sub‑classes or the model declines to pick a more specific one. The final result is a path like `Chart / Bar chart`.

Walking the tree keeps each prompt short, keeps the choices distinct, and lets each level focus on the discriminations that matter at that level. It also gives us a graceful way to stop early: a chart that doesn’t clearly fit any of Chart’s sub‑classes can still be classified as “Chart” without forcing the model to invent a sub‑type.

Lowering the Temperature

Every classification call is made with `temperature=0`. Image description benefits from a little creative variation, but classification does not — we want the same image to land in the same bucket on every run. Temperature 0 also makes the system easier to evaluate, because outputs are stable enough to diff between prompt revisions.

For models that don’t honour a temperature parameter (some newer reasoning models drop or ignore it), the call falls back gracefully rather than failing.

A one‑sentence description, generated for free

One approach we adopted in our prompt is asking the model to do two things in a single response:

Write one sentence describing what it sees in the image.
On the next line, output a single JSON object with its classification.

Example:

text

A circular graphic divided into coloured segments with percentage labels.

{"category": "Chart", "alternative": "Diagram", "confidence": "medium"}

That sentence pays off in two ways:

It anchors the model in what it is actually looking at. Asking for the JSON alone invites snap judgements; asking for a description first nudges the model into observing before classifying. This is a small applied dose of chain‑of‑thought, without the cost or noise of a full reasoning trace.
We can reuse the sentence if we need a second pass.

The structured reply is read from the model output together with the description line: the description is taken from the text before the JSON object, and the JSON supplies category, alternative, and confidence.

Confidence ratings, and a better confidence signal

The JSON we ask for has three fields:

category — the name of the chosen class, or the literal string `NONE_OF_THE_ABOVE`.
alternative — the second‑closest class name, or `null` if the model had no hesitation.
confidence — `”high”`, `”medium”`, or `”low”`.

Self‑reported confidence is a useful but imperfect signal. Models are often overconfident, and a flat `”high”` doesn’t tell you much when it is the default answer. The `alternative` field, on the other hand, is a more honest indicator: if the model voluntarily names a runner‑up, you know there was meaningful ambiguity, even if it claimed high confidence. We treat the presence of an alternative as a second‑pass trigger in its own right, independent of the confidence value.

One pass when the model is sure, two when it is not

Putting that together, the decision logic at each level is:

First pass. Send the image and the list of candidate classes at the current level. Ask for a description sentence followed by the JSON object.
Accept immediately if all three conditions hold: confidence is `”high”`, `alternative` is null, and `category` is a real listed name (not `NONE_OF_THE_ABOVE`).
Otherwise, run a second pass. This is where the description sentence earns its keep. We feed the model’s own first‑line description back to it as context:

A prior analysis described this image as:

"A circular graphic divided into coloured segments with percentage labels."

Use this description alongside the image to choose the most appropriate category.

The rules in the second pass are the same as in the first but the model now has a verbalised summary of what it saw to argue with. The model can confirm its own description and then make a more confident choice, or correct the description (“This is actually a ring chart with five segments, not a generic circular graphic.”) and then pick a different category.

Why this combination works

Each piece pulls its weight:

Setting the temperature to 0 keeps the answers stable.
Tree‑structured prompts keep each call focused on a small, well‑defined set of choices.
A one‑sentence pre‑description gives the model a moment to look before it leaps, and gives us a free piece of context for the second pass.
Confidence plus alternative gives us two complementary uncertainty signals; we trust the more honest one (`alternative`) at least as much as the self‑rated one.
The second pass with prior description is the cheapest possible re‑think — same image, same candidates, just one extra paragraph of context — and it converts low‑confidence guesses into well‑grounded answers.
Parent fallbacks and guarded reuse of a clear first pass keep the classifier from throwing away a good partial path when the model stumbles on a deeper level or when confirmation adds noise instead of clarity.

Fido’s in-box taxonomy

Whilst Fido enables the user to create an image taxonomy to suit their own publication types or preferences, Fido comes with a set of classifications and prompts. These can be used as-is or adapted by the user.

The image class hierarchy that ships with Fido began with the list of 28 image types described in the DIAGRAM Image Description Guidelines. These were then extended by adding the Scholarly Image Taxonomy developed by the STM Association Alt-Text Accessibility Task & Finish Group. The STM taxonomy is a SKOS vocabulary that organises image types into branches like Chart, Plot, Map, Microscopy, Medical imagery, Musical notation, Relational diagram, Technical illustration, and so on. We pulled that vocabulary directly from STM’s public API into a working JSON file to add it to Fido’s taxonomy.

STM’s scope is scholarly publishing, but the range of images we need to describe in publications is broader. So beyond the STM base we have extended the taxonomy with new branches and sub‑classes covering this broader publishing scope. To do this we identified a range of digital publication repositories covering classic literature, open education resources, and reference materials, and using Claude’s Opus 4.7 model set about identifying other image types that would benefit from tailored prompts.

The current taxonomy contains 25 top‑level branches and 468 image types in total. Each image type can have its own description prompt — for example, the Bar chart prompt asks the model to identify the chart type, state the title and axis labels, and summarise the trend. That is very different from the Photograph prompt, which asks for subject, setting, action, and notable details. The whole point of classifying first is to unlock these tuned prompts.

Next Steps: Moving from Heuristics to Validation

While our initial testing shows that the current classification implementation is robust when handled on a case-by-case basis, we are moving into a rigorous validation phase. Our goal is to ensure that Fido’s logic holds up under the pressure of high-volume, real-world variety.

1. Large-Scale Benchmarking

We have compiled a diverse evaluation set of 500 images and are currently performing a manual classification of this set to create a “ground truth” baseline. We will then run these images through Fido to measure how often the automated path matches the human expert path.

2. Taxonomy Optimization

The current taxonomy is comprehensive, boasting 468 types. However, we are investigating the “law of diminishing returns.” By analysing the results of hundreds of descriptions, we want to answer two critical questions:

Coverage: Are there frequent “None of the Above” results that point toward missing categories?
Efficiency: Could a taxonomy one-tenth the size produce descriptions of comparable quality?

3. Tuning for Excellence

Ultimately, classification is a means to an end: excellent image descriptions. We will be reviewing the final alt-text outputs of our 500-image test to see if specific sub-classes actually yield better accessibility outcomes, or if broader categories provide enough context for the model to succeed.

In summary

Fido’s image classifier has been informative to build. We hope this write-up is useful to anyone else trying to classify images with current LLMs.