PoliticsDisinformation Desk

Actions

AI voice cloning: How programs are learning to pick up on pitch and tone

James Betker, who developed TortoiseTTS, an open-source voice-cloning model, says voices are not as unique as people may think, making them easy to clone.
A visualization of an AI-generated voice
Posted

Voice cloning is an emerging technology powered by artificial intelligence and it's raising alarms about its potential misuse.

Earlier this year, New Hampshire voters experienced this firsthand when a deepfake mimicking President Joe Biden’s voice urged them to skip the polls ahead of the primary.

The deepfake likely needed only several seconds of the president's voice to create the clone. According to multiple AI voice cloning models, about 10 seconds of an actual voice is all that is needed to recreate it. And that can easily come from a phone call or a video from social media.

"A person's voice is really probably not that information-dense. It's not as unique as you may think," James Betker, a technical staff member at OpenAI, told Scripps News.

Betker developed TortoiseTTS, an open-source voice cloning model.

"It's actually very easy to model, very easy to learn, the distribution of all human voices from a fairly small amount of data," Betker added.

How AI voice cloning works

AI models have been trained on vast amounts of data, learning to recognize human speech. Programs analyze the data and train repeatedly, learning characteristics such as rhythm, stress, pitch and tone.

"It can look at 10 seconds of someone speaking and it has stored enough information about how humans speak with that kind of prosody and pitch. Enough information about how people speak with their processing pitch and its weights that it can just continue on," Betker said.

Imagine a trained AI model as a teacher, and the person cloning the voice to be a student. When a student asks to create a cloned voice, it starts off as white noise. The teacher scores how close the student is to sounding correct. The student tries again and again based on these scores until the student produces something close to what the teacher wants.

While this explanation is extremely simplified, the concept of generating a cloned voice is based on bit-by-bit, based on probability distributions.

"I think, at its core, it's pretty simple," Betker said. "I think the analogy of just continuing with what you're given will take you pretty far here."

There are currently some AI models that claim to only need two seconds of samples. While the results are not convincing yet, Betker says future models will need even fewer voice samples to create a convincing clone.