Microsoft's new AI can clone anyone's voice with just a 3 second audio sample

Microsoft's newly developed artificial intelligence system is capable of cloning anyone's voice by listening to just a 3-second audio example.

VIEW GALLERY - 2

Jak Connor

Tech and Science Editor

Published Jan 16, 2023 8:02 AM CST
Updated Feb 14, 2023 2:08 PM CST

1 minute & 45 seconds read time

Voice: Jak ConnorSpeed

0:00 / --:--

A new artificial intelligence system developed by Microsoft is slated to have the capability of cloning anyone's voice by just listening to a three-second audio example.

Microsoft's new AI can clone anyone's voice with just a 3 second audio sample 55

VIEW GALLERY - 2 IMAGES

The new AI is called VALL-E, and according to a newly released paper, the system is a neural codec language model that is a text-to-speech synthesizer. According to the report, VALL-E is capable of learning a specific voice and then synthesizing it to be able to say whatever is desired. Additionally, the report claims that VALL-E will be able to produce a voice identical to the example it was given while also retaining the same or a similar level of emotional tone that is heard in speech - something other AI synthesizers struggle to do successfully.

The creators of the AI system believe it will be used to power text-to-speech applications, speech editing, and audio content creation when combined with other generative language models, such as Open AI's immensely popular ChapGPT. Notably, the creators believe that VALL-E would be used for speech editing that would include taking a three-second audio example of an individual's voice and making them say something they didn't. Listen to examples of VALL-E here.

Read more: Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers
Read more: NVIDIA launches ACE Game Agent SDK Beta for in-game AI companions that run on RTX GPUs

"We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work.
During the pre-training stage, we scale up the TTS training data to 60K hours of English speech, which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis," the report states.

Microsoft's new AI can clone anyone's voice with just a 3 second audio sample

Best Deals: Hyp NASA The Eagle Has Landed Men's Crew

Similar News Stories