DeepMind says it's developed much more realistic computer speech

Sep 13, 2016

This article is published in collaboration with Business Insider.

A technician makes adjustments to the "Inmoov" robot from Russia during the "Robot Ball" scientific exhibition in Moscow May 17, 2014. Picture taken May 17, 2014.

WaveNet focuses on the sound waves being produced as opposed to the language itself.

Image: REUTERS/Sergei Karpukhin

Sam Shead

Technology Reporter, Business Insider

Our Impact

What's the World Economic Forum doing to accelerate action on Emerging Technologies?

The Big Picture

Explore and monitor how Artificial Intelligence is affecting economies, industries and global issues

Stay up to date:

Emerging Technologies

Google DeepMind claims to have significantly improved computer-generated speech with its AI technology, paving the way forward for sophisticated talking machines like those seen in sci-fi films like "Her" and "Ex-Machina."

The London-based research lab,acquired by Google in 2014 for a reported £400 million,announced on Thursday that it has developed a talking computer programme called "WaveNet" that halves the quality gap that currently exists between human speech and computer speech.

Although WaveNet sounds more like a human voice than existing artificial voice generators — known as "text-to-speech" (TTS) systems — it requires too much computing power to make it practical, meaning Google won't be integrating it into its products any time soon, according to The Financial Times.

AI Landscape: Global Quarterly Financing History

Image: CB Insights

Aäron van den Oord, a research scientist, at DeepMind said: "Mimicking realistic speech has always been a major challenge, with state-of-the-art systems, composed of a complicated and long pipeline of modules, still lagging behind real human speech. Our research shows that not only can neural networks learn how to generate speech, but they can already close the gap with human performance by over 50%.

"This is a major breakthrough for text-to-speech systems, with potential uses in everything from smartphones to movies, and we're excited to publish the details for the wider research community to explore."

Unlike existing artificial voice generators, WaveNet focuses on the sound waves being produced as opposed to the language itself. It uses a neural network — a technology that tries to replicate the human brain — to analyse raw waveforms of an audio signal and model speech and other types of audio, including music.

DeepMind published sample audio recordings of WaveNet talking in English and Mandarin and it's easy to see that the audio recordings are an improvement on Google Now, Amazon's Alexa, and Apple's Siri. The company also showed off some of the music that WaveNet has been able to produced after studying solo piano music on YouTube.

Like other AI systems, WaveNet requires vast quantities of existing data to train itself. DeepMind used Google's existing TTS datasets to do this.

DeepMind, which sits under Alphabet, Google's parent company, is best-known for developing artificial intelligence systems that can master games like Space Invaders and Go. However, Google has been slow to integrate the company's technology into its products, with just one data centre efficiency project announced so far, albeit on a global scale.

For more details on WaveNet, take a look at Google DeepMind's academic paper.

Don't miss any update on this topic

Create a free account and access your personalized content collection with our latest publications and analyses.

License and Republishing

World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.

The views expressed in this article are those of the author alone and not the World Economic Forum.