Exploring Artificial Intelligence: Image Descriptions
Artificial Intelligence is currently a hot topic, appearing regularly in the news and often with the promise of changing the world… one way or the other.
We are going to examine a few aspects of AI over different articles, but first a disclaimer: this is a fast-moving topic under active and competitive development so the results we experience in May 2024 will likely be very different in a short period of time.
In this article we’re going to explore how AI can support image descriptions, as one of the more challenging areas of accessibility for publishers working to update their backlists of previously released titles.
Many AI tools offer the ability to analyse an image and provide an automatically generated description, with this functionality increasingly appearing in mainstream editing and design environments. But is the descriptive content they generate of any use?
We have performed a series of tests with a variety of different images on a wide range of tools, but for this article we used the following image, which has a human generated description of:
“Image shows a large brick wall spray painted blue, with a large green and pink fish approaching a bright yellow and black fishing lure with its mouth open. Three other fish are shown swimming together at the bottom of the image.”
This image was selected because of its complexity, with the potential a variety of different interpretations. Testing with simpler images can certainly deliver more reliable results, but this doesn’t necessarily provide an accurate representation of the capabilities and limitations of the AI tools.
When we presented this image to a number of AI tools the responses varied significantly, and one of the results was simply a work of fiction:
“A beautiful sunset over a calm ocean with vibrant orange and pink hues reflecting on the water.”
This is an example of something called AI Hallucination, where content appears to simply be made up. In current AI tools, and particularly the free services, it is relatively common for inaccurate content to be generated rather than the tool admitting that it isn’t sure how to describe something. Because all the generated descriptions are presented with the same level of confidence, there is no automated way to flag certain images that require additional checks. This means that there is a significant risk of hallucinations slipping through in batch processed images.
Many of the AI tools were able to identify elements of the image but introduced inaccuracies in the detail:
“A mural of a fish and fish on a hook.”
But some were able to provide a fairly comprehensive level of information:
“A vibrant street art mural on a brick wall depicting a green fish with red eyes, leaping towards a bee on a hook, with blue background and small fish below.”
Once again there are a few inaccuracies, the fish isn’t leaping and there isn’t a bee, but the description is going in the right direction.
However, accurately describing the content of an image is only part of the challenge. In many instances it’s important to understand the context of the image in order to provide a meaningful description. If our sample image had come from a book chapter about street art it might prove more relevant to describe the painting method, the gradients of background color representing deep water by darker blues, or the energetic depiction of the fish.
In an effort to provide better descriptions, AI tools are being developed that ingest both the image and associated page content for analysis, to understand the relevance of the image within the wider publication and provide contextually relevant description.
We’ll end this quick review with the description offered by one of most well-known commercial services:
“The image shows a vibrant street art mural painted on a blue brick wall. It depicts a large, cartoonish green fish with a pink underbelly and exaggerated features, including a prominent red eye and black markings around its face. The fish is open-mouthed, appearing to chase a yellow and black insect on a hook that dangles from above. Behind the fish, there is a school of smaller, pale blue fish swimming in formation. The art style is bold and colorful, emphasizing dynamic motion and a playful interpretation of marine life.”
This detailed and surprisingly accurate description of the image was created by ChatGPT-4 without any additional prompts and probably represents one of the more capable services currently available.
AI is still very much a developing technology, one that offers lots of potential, but also one that needs to be used with care. It appears that we are currently some distance from a robust system for generating accurate and relevant image descriptions, but the promise of AI to support and enhance accessibility remains very exciting, and one which we continue to monitor with interest.