AI is finding its voice and that's bad for democracy
AI can impersonate at a scale and speed that we have never known. Image: Unsplash.
- Deepfake audio clips of public figures have the potential to disrupt democracy.
- We are using AI voice synthesis, like previous generations of sound technology, for creative expression and to spread disinformation and doubt.
- What can be done to curtail the malign use of AI-generated content?
How would you react to hearing a politician you were poised to vote for privately discussing how to steal the election? Or an even-tempered candidate in a foul-mouthed tantrum? Would it change your mind? Change your vote? When audio of a plotting Slovakian politician and an angry British opposition leader circulated recently, we should have paused to listen more carefully. Synthetic voices — deepfakes — put words in their mouths.
New synthetic voices are a product of generative artificial intelligence (AI). They are trained with voice samples and given a script to deliver. Of course, computer-generated voices are not all bad. Stephen Hawking found a voice with which to reveal the universe, and Val Kilmer delivered lines in Top Gun: Maverick using AI trained on recordings made before he lost his voice to cancer. But Hawking and Kilmer sanctioned the voices and controlled what they said.
Deepfake audio has the power to deceive
We have a track record of abusing technology designed to manipulate and share sound. Impersonators masquerading as world leaders routinely embarrass politicians with prank calls, but voices can also be commandeered to lend credibility to ideas, even persuade us to do something we wouldn’t otherwise do. Towards the end of World War II, when Winston Churchill wanted to slow German troop movements, he turned to his dark arts propaganda team. Two convincing imitators took over Radio Cologne pretending to be the station’s continuity announcers and persuaded civilians to flood the streets. But aping a familiar voice takes skill and time, so convincing mimics have been a rare resource — until now. Synthetic voices can be cloned and work tirelessly, parroting anything that we give them, even tailoring their message to an audience of one. AI can impersonate at a scale and speed that we have never known.
There are already off-the-shelf voices of Joe Biden and Donald Trump on community websites like FakeYou and anyone can make them say anything. If the voice you want is not available, someone will make it for you in a day for $25. A realistic voice that has to survive close scrutiny takes more time and expertise, and needs high-grade AI to remove tell-tale imperfections, but that may be overkill when we are not paying proper attention.
Audio has unique power to lower our defences. In 2020, an unnamed Hong Kong bank manager took a phone call from one of his clients; a man he knew, and whose voice he knew well. The client wanted $35 million transferred to close an acquisition. The bank manager made the transfer. He had been duped by a synthetic voice. How far would a call have to go before you accused someone you know well of being a machine?
Voices are becoming more convincing, and the training effort is shrinking. Microsoft claims that its VALL-E service can generate a voice from a three-second sample. It finds the closest match in its library of thousands of existing voices — many based on out-of-copyright audiobooks — and tweaks it to mimic the sample. More voices will mean closer matches, and more users will provide the fuel that AI needs — feedback on how it did.
How can we trust what we hear?
Prevention will be better than cure. Microsoft has not released VALL-E openly and Resemble AI, one of a growing number of companies that produces synthetic voices for commercial use, requires the original speaker to record their consent before cloning their voice. But who will police the broader industry, and can it be done without discrediting and silencing free speech, or offering plausible deniability to anyone legitimately recorded saying something they shouldn’t?
In January 2023, China introduced rules that service providers must follow to label and police “deep synthesis” content, and the G7 nations have issued a non-binding code of conduct. Important progress, even if both are light on specifics.
Inaudible watermarks could be embedded into synthesized audio files and training data, and malign fakes detected on upload — Resemble AI’s detection service took just 200 milliseconds to decide the British clip was phony — but social media companies are always loathe to introduce friction when users post. Zohaib Ahmed, CEO of Resemble AI, sees the solution as an “antivirus for AI”, “this hidden thing that's going to be embedded into not just these distribution channels but onto your operating system itself". He draws parallels with SSL and the padlock that clearly shows users when a connection can be trusted.
How is the World Economic Forum creating guardrails for Artificial Intelligence?
TikTok recently updated its community guidelines to outlaw AI-generated content in which a public figure is “used for political or commercial endorsements or any other violations of our policies”, but if a fake gets through it will be a race to take it down or flag it before it runs wild. We know how that goes.
We are a year away from elections in 40 countries that together represent 40% of the world’s population, including the US, UK, and India, and dangerous scenarios are easy to imagine. What if a leader confesses to stealing an election? Media outlets might not be taken in, but social media will fan the flames. The Slovakian leak came during a two-day media blackout immediately preceding polling; the lie was halfway around the world before the truth was even allowed to get its boots on. The Progressive Slovakia party, whose leader was featured in the fake clip, came second in the elections but, as with all disinformation campaigns, the effect of the audio on the outcome is hard to gauge.
While we wait for services and policies that tame synthetic voices to take hold, producers and distributors need to recognize how influenced we are by what we hear and help us listen more critically. “Believe nothing of what you see, only half of what you hear.” But which half?
Don't miss any update on this topic
Create a free account and access your personalized content collection with our latest publications and analyses.
License and Republishing
World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.
The views expressed in this article are those of the author alone and not the World Economic Forum.
Stay up to date:
Generative Artificial Intelligence
The Agenda Weekly
A weekly update of the most important issues driving the global agenda
You can unsubscribe at any time using the link in our emails. For more details, review our privacy policy.
More on Emerging TechnologiesSee all
Filipe Beato and Jamie Saunders
November 21, 2024