An Introduction to AI Speech
Artificial Intelligence is a new technology that is advancing so rapidly it appears to be constantly in the news. Reports often raise questions and concerns about what it is, how it works, is it ethical and, are we missing out on a big opportunity if we don’t adopt it? To help address some of these questions, we are starting a series of articles to dig deep into AI to better understand the implication on accessible publishing and any potential for the future of accessible publishing. We will raise questions, invite experts to address concerns, and open a conversation with the wider accessibility community about how we can keep up and make best use of this emerging technology.
Join us on this journey of discovery as we learn together about the technology behind AI. In this article, we will explore AI voices and in later pieces will dig deeper into the practicalities of how the technology works and might best be deployed.
What is an AI voice?
Text to speech technologies have been rapidly developing since their emergence on early computers. To highlight the evolution of these voices the following poem titled Technology Evolution is voiced by three Microsoft voices: Sam which was the default voice on Windows XP, George which is a Windows 11 voice, and Seraphina which is an AI voice available through Microsoft Azure AI Speech Studio.
The voices in the clip were created using default settings, with additional work it’s possible to improve the output.
Seraphina, the final voice in the clip is just one example of voice technology based on machine learning, which is a field within Artificial Intelligence.
How are voices made using AI technology?
These are based on language specific voice models created using huge amounts of audio recording to develop a synthetic voice template that sounds very accurate to mimic a human voice in terms of tone, pitch and intonation to make sound as natural and intelligible as possible.
The foundation of an AI speech is a voice model. Models are created by processing a very large quantity of sample speech in the target language. Through analysis of the text and a transcript this training process can help the model learn not just how to pronounce different words, but how to do so in a naturally sounding way. The training method differs between companies and in most cases is a closely guarded secret. The way text is processed to generate audio improves through rounds of iterative training and feedback, many of which are fully automatic.
As with other AI models, this isn’t a conventional software application that a developer can adjust. True AI systems mimic elements of the human brain to build and strengthen connections between concepts. The result is something that works in an almost mystical way that nobody fully understands enough to adjust directly, but further iterations of training can help improve and fine-tune the performance and output.
Models can be voice specific, but they can also provide a framework for other voices. This enables the rapid development of new voices without the need to develop a new model every time. This is how voice cloning services can take a relatively short sample of audio and apply it to an existing model to create a new synthetic voice.
What is the difference between AI voice and a good quality synthetic voice?
The audio clip linked above clearly demonstrates that synthetic speech has seen remarkable advancement over the years, from a very robotic voice to a more advanced clear and fluent quality voice. So, what is the difference between an AI voice and a good traditional synthetic voice?
Traditional synthetic speech applied linguistic pronunciation models to text, breaking down components of a word into its phonetic elements and applying linguistic and statistical models to generate audio. Early voices like Microsoft Sam are examples of this parametric text to speech, where the audio is entirely computer generated, which is what gives it a more robotic quality.
Microsoft George is an example of Concatenative text to speech which uses the same initial process as Sam, but utilizes pre-recorded human speech, rendering the audio fragments for each phoneme to construct a word. This process has improved substantially over time to make synthetic speech sound more natural, but in essence is still built using a similar, albeit much more complex, system.
Because AI voice models are developed directly from human speech samples, the audio created from the start is more natural sounding as it mimics qualities of the original. Unlike the concatenative method, AI speech isn’t formed of human audio fragments from the training materials but instead is able to create new audio based on experiences in the training process.
AI voices are often cloud based services, which means they can use vast processing power and complex systems to analyze text and rapidly generate audio. But this requires a good internet connection and typically a subscription or payment for the service being used. So, while conventional synthetic speech may be less naturally sounding it is usually significantly cheaper to produce as it can be generated locally on any computer.
How good are AI voices?
Voice quality has improved tremendously by utilizing AI. As a result, some audiobook publishers are now publishing AI audiobooks. Is it human-like? Can it convey the same emotions, performance, and quality as a human voice? This remains debatable.
Although the quality of AI voices is advancing rapidly and has become as clear and fluent as a human voice, it may not be delivering the same performance, nuances of human emotions imitation of different dialects, accents and speech patterns, which really makes it lacking in terms of authenticity and listener engagement, making it difficult for them to connect with it, as they would with a human narrator. But this is an area which is likely to improve in the coming years.
Looking ahead!
Although AI voices have increased the quality of text to speech output over earlier technologies, many would agree that the technology still has an opportunity to improve and sound more human like, would be able to totally replace human voices? Could it be a solution for some types of content? Would it be handier for multilingual support? As indicated earlier this is a topic we will continue to explore, and we will try to address some of these questions in subsequent articles.
We’re also integrating support for AI voice technology in some DAISY tools, with Pipeline now able to connect directly to Azure and Google AI voice services, allowing the wider DAISY community to test these new voice developments for themselves.
Share your AI experiences
Do you have any suggestions for future topics, are you experimenting with AI? We would like to hear from you. Please get in touch and share your thoughts or experiences.
Related resources:
- What is an AI voice generator and how does it work? | ElevenLabs
- How Are AI Voices Made? – YouTube
- How to make an AI voice – WellSaid
- Notes on synthetic speech – Tink
- History of text to speech – Speechify
- Microsoft Azure AI Speech Studio
- Read The AI Revolution in Book Publishing | Leanpub
- Technology Evolution poem by Nikhil Varma
- Audiobooks and Authors: Ready for the AI revolution? – The Publishing Authority