Using Artificial Intelligence for Image Descriptions

Gregorio Pellegrino

Using Artificial Intelligence for Image Descriptions

Among the various requirements of the specifications for the creation of accessible digital content (WCAG and EPUB Accessibility Guidelines), the alternative description of images is probably the most difficult for publishers as it requires a very specific knowledge and an adequate time to be done with accuracy. Many content producers do not yet have the adequate knowledge (and time) to produce them, thus limiting the level of accessibility of content, especially when they want to create the accessible version of backlist titles, not initially designed to be accessible.

In Fondazione LIA we work side by side with publishers to support them in the production of born accessible publications.

photograph of Gregorio Pellegrino presenting at DPUB Summit


Fondazione LIA is an Italian no profit organization focusing its activities on the promotion and support  of accessibility of digital publishing content; it is a quite special one because its members are on the one hand content producers (publishers and AIE, the Italian Publishers Association) and on the other hand,  organization representing the visual impaired readers (UICI, the Italian Blind and Visually Impaired Union);  we are thus able to act as a bridge between the two worlds, trying to reconcile the mainstream publishing production processes with the needs of accessibility.

While collaborating with an educational publisher wishing to create a pilot project of a textbook, we had the opportunity to face this challenging issue in order to create a fully accessible publication of a complex layout book. This is how we began to question how we might simplify and possibly automate the process of describing images.

The pilot project on the automatic generation of alternative descriptions of images through the use of Artificial Intelligence technologies, presented during the Digital Publishing Summit 2019 in Paris falls within the scope of the research and development activities the Foundation carries out, often in collaboration with Italian Universities or Research centers.

As Chief Accessibility Officer of Fondazione LIA, software engineer and for sure, as technology enthusiast, I am very fascinated by the approach to machine learning and artificial intelligence that is increasingly characterizing scientific research, an area I have been trained on and informed in recent years.

Thus I asked myself how to use Artificial Intelligence to automate the alternative description of images in the publishing world, also taking in account that large technological operators (Microsoft, Google, Amazon, Facebook, etc.) have begun to offer services based on artificial neural networks and machine learning to add the automatic description of the photographs published in their platforms. Compared to the use in other industries, we have realized that the complexity of the images in the publishing world is high and, therefore, the normal solutions available on the market are not enough on their own.

Starting from these considerations, we developed a research project working in collaboration with Tommaso Dringoli, a graduate student of University of Siena, to test the use of some artificial intelligence algorithms available on the market in order to automatically generate the alternative description of images in the digital publishing field.

Starting point was the definition of a template for the creation of alternative descriptions of the images structured in two complementary elements:

  1. image category: a taxonomy of categories of images (for example: art, comic, drawing, logo, photograph, etc.) to be used to classify the different kinds of images;
  2. description of the image: representing the description of the figure’s content.

For the first element of the description – the image category – we tried different approaches which led us to the use the Cloud AutoML Vision tool by Google. This service – available online and accessible through a web interface – allows to train a machine learning algorithm from an initial dataset of manually catalogued images.

We trained the algorithm by uploading 1,000 images for each of the 12 different categories (12,000 image dataset): 80% of them were used for training, 10% to optimize the model’s hyperparameters (validation set), the remaining 10% to evaluate the model (test set).

Once trained, it was possible to use the service to upload new images and for each of them it returns an identified category.

For the second element, the description of the image, we evaluated different services available on the market by analyzing the strengths, costs and effectiveness. We realized that at the moment there is not such a strong service capable to create appropriate descriptions for all categories of images we identified, consequently we selected two services:

  • Microsoft Computer Vision for the description of the photographs;
  • Google Cloud Vision API to identify known entities such as logos, flags, works of art, etc. or to use Optical Character Recognition of the images that are text based.

For some categories of images such as comic strips, complex map and signatures we decided not to consider the image description, because the outputs obtained, while testing the service, were too imprecise or random.

Following up the choice of the services, we have developed a command line tool that receives in input a file EPUB, extracts all the images available in it, and automatically creates the full alternative description including the two elements: image category and image description.

Finally, we tested the prototype on some EPUB flies provided by the publishers, obtaining the following results:

  • automatically generated image category: 42% accuracy;
  • automatically generated image description: 50% accuracy.

We think that the accuracy of the image category could be improved by refining the initial training dataset of Cloud AutoML Vision, while the image description requires the evolution of the algorithms currently available on the market.

However, taking into account the speed of this technological development, we plan to do new tests within six months or a year to explore if the accuracy has improved.

One of the most interesting result of the work done in this pilot is that we realized the image recognition algorithms available on the market are optimized for photographs, while they are not able to describe other images (drawings, works of art, logos, etc..).

This is something that is very important to consider as most of the graphic content and of the images available in complex layout books (schoolbooks, academic, scientific and professional publications), are not photographs but drawing or illustrations such as graphics, infographics, complex images, diagrams, scientific schemas, etc.; for these types of images a new generation of algorithms shall be the required.

A crucial point for us is to create more awareness on the relevance of producing correct image description: that’s why, in the context of the activities Fondazione LIA will carry out in the autumn we will organize a meet-up inviting illustrators, graphic designers, publishers and experts of specialist organizations producing accessible publications to discuss this topic and to share their experience within a project called MICA– Milano per la cultura dell’accessibilità (Milan for the culture of accessibility) realized thanks to the collaboration of Fondazione Cariplo.

The use of AI technology to improve the accessibility of images is not currently a viable solution for the publishing industry, so building in description authoring to contracts and workflows is currently the only practical approach. Yet, the potential of this technology is clear, and through the use of improved algorithms, enlarged data sets and perhaps analyzing the image within the context of any surrounding text, the accuracy and quality of automatically generated images has the potential to improve significantly and offers promise for the future.


Article by Gregorio Pellegrino, Chief Accessibility Officer, LIA

Adapted from a presentation given at the 2019 DPUB Summit which is available to watch on YouTube: