AI Text To Speech Cost Comparison
Following on from our article An Introduction to AI Speech, we return to our series exploring the benefits and impact of AI on accessible content production. We know from many of our members and the wider DAISY community that there is much interest in exploring what AI can offer accessible publishing, so in this series we aim to research and answer some of the common questions.
When it comes to synthetic audio, AI technologies promise to deliver significantly improved quality, and some of the samples available are certainly quite impressive, but with a wide variety of platforms, service levels, and features, what is the actual cost of converting a title and how do some of the industry leaders compare?
We investigated the cost of using AI voice services to convert a single typical book to synthetic audio, selecting five leading services:
Each of the services offer a variety of voices with pricing determined by factors such as quality, naturalness, customization, and additional features like bilingual or multilingual support. Key differences also stem from the AI models used to generate them.
Before we get into the details a quick disclaimer: AI is a rapidly evolving technology and the services we explored regularly change their options and pricing. The details in this article were accurate in May 2025 when published.
What We Tested
We set out to identify the baseline cost for one-off conversions of a typical title, enabling us to directly compare the different services and provides an indication of the maximum costs likely to be incurred during experimentation with these services. The ultimate cost of services converting many titles will vary depending on the volume of titles produced and in many cases for high volumes the services are open to discussion.
All the services we tested have some form of web interface, which often came with limitations, as well as API (Application Programming Interface) connectivity for applications to communicate directly and programmatically generate audio with minimal human intervention. AI API support is already available in some DAISY tools, with Google Gemini and Microsoft Azure currently supported by DAISY Pipeline and support for other platforms is planned.
Audio Generation
For this comparison, we focused on mid-range voices from each service with a goal of benefiting from the AI text to speech at an optimal cost. Higher quality voice options are available from each service for people wanting to test the latest generation of voice services on the market, but these typically come at significantly higher costs.
We selected the Neural voice for the three following services: Microsoft Azure, Google Gemini, Amazon Polly while selecting the standard voices for Eleven Labs’ and OpenAI.
As for the pricing model, Amazon Polly, Microsoft Azure, OpenAI and Google Gemini offer a pay-as-you-go model based on the number of characters processed and offer flexible pricing models based on factors such as usage and voice selection. ElevenLabs has a different approach, only offering a subscription-based model.
Most of the services also offer a trial period, which for our purposes was excluded from the calculations, but in some cases is enough to convert one or two books.
We chose to test a “typical novel” of 90,000 words or 423,000 characters in length. We generated clips using the first chapter of Pride and Prejudice to provide a sample of the voice quality.
Listen to samples from each service:
- Amazon Polly Sample:
- Eleven Labs Sample:
- Google Sample:
- Microsoft Azure Sample:
- OpenAI Sample:
Costs Breakdown
When comparing the five services, for the base level AI voices the cost per book figures are quite close, with one exception.
Amazon Polly | ElevenLabs | Google Gemini | Microsoft Azure | OpenAI | |
Neural titles in free trial | 2 | – | 2 | 1 | – |
Cost per title standard TTS | $1.69 | – | $1.69 | – | – |
Cost per title neural TTS | $6.77 | $29.08 | $6.77 | $6.35 | $6.35 |
Cost per title HD / Generative TTS | $12.69 | – | $12.69 | – | $12.69 |
Not all of the services have comparable voices, and even where some voices are listed the cost per voice is not always public. We also have a difference in pricing structure for ElevenLabs who currently only offer a subscription service. The average cost per title ranges from $21 to $35 depending on annual subscription level and opting to pay monthly or annually. The Pro level subscription of $82.50 USD per month is the first to provide enough credits for 2 book conversions per month, and because the first two months are free provides an average book conversion cost $29.08 per title, increasing to $35.36 per title from the second year.
We found that Amazon Polly, Google Gemini, Microsoft Azure and OpenAI all offer comparable services and have almost identical pricing. They all operate on a pay as you go type system allowing purchase of one million character credits at a time, which is enough to support the conversion of 2 average novels. The neural TTS cost per title in Microsoft Azure and OpenAI is $6.35, while Amazon Polly and Google Gemini come in at a cost of $6.77 USD. The HD or Generative TTS offered by Amazon, Google and OpenAI all come in at the same cost per title of $12.69, and while Microsoft don’t currently publish the cost of their HD voice it might be reasonable to expect it to cost a similar rate.
Other Considerations
AI voices are improving all the time, but as with all text to speech services, some words will be mispronounced and require correction. This should be a simple process of regenerating the impacted sentence or paragraph, but the costs for a single book conversion indicated above are unlikely to be the final cost for a quality audiobook. Titles requiring multiple corrections throughout the text could end up two or three times the price of a single book conversion.
The prices we listed are for single language titles. Some services charge more for multilingual voices, in the case of ElevenLabs this is twice the cost of a single language voice.
Control over the audio generation varies considerably, with Microsoft, Amazon and Google supporting several features of Synthetic Speech Markup Language (SSML), ElevenLabs supporting a few SSML features, and OpenAI offering more limited bespoke controls to adjust the way audio is produced. Inconsistency in the support and implementation of SSML will currently require text markup to be service specific. If the same title is to be sent to a different service it will likely require significant rework.
Conclusion
From this comparison we can see that most of the leading services offer very similar pricing, while ElevenLabs adopts a different payment model and is slightly more expensive.
When used at scale, all the services encourage potential users to get in contact, so it may be possible to negotiate a more preferable rate for volumes higher than those listed on their sites.
But it’s important to remember that AI is still a developing technology. The cutting edge and highest cost voices of today are likely to be the entry level voices in less than a year. The cost of services is also likely to change over time. We can already see competitive pricing between several of the services, and this level of competition will hopefully see costs decreasing over time as the market and true costs of production are better understood.
Looking ahead, this is not a technology that has to remain in cloud services. There are already local version of AI TTS offering more limited services than their cloud counterparts. As the technology evolves, we are likely to see licensing of software for local production, with the potential to significantly reduce the conversion cost per title.