Microsoft's new AI can clone anyone's voice with just a 3 second audio sample

Microsoft's newly developed artificial intelligence system is capable of cloning anyone's voice by listening to just a 3-second audio example.

Microsoft's new AI can clone anyone's voice with just a 3 second audio sample
Comment IconFacebook IconX IconReddit Icon
Tech and Science Editor
Published
Updated
1 minute & 45 seconds read time
Voice: Jak Connor
0:00 / --:--
Use left and right arrow keys to seek audio.

A new artificial intelligence system developed by Microsoft is slated to have the capability of cloning anyone's voice by just listening to a three-second audio example.

Microsoft's new AI can clone anyone's voice with just a 3 second audio sample 55

The new AI is called VALL-E, and according to a newly released paper, the system is a neural codec language model that is a text-to-speech synthesizer. According to the report, VALL-E is capable of learning a specific voice and then synthesizing it to be able to say whatever is desired. Additionally, the report claims that VALL-E will be able to produce a voice identical to the example it was given while also retaining the same or a similar level of emotional tone that is heard in speech - something other AI synthesizers struggle to do successfully.

The creators of the AI system believe it will be used to power text-to-speech applications, speech editing, and audio content creation when combined with other generative language models, such as Open AI's immensely popular ChapGPT. Notably, the creators believe that VALL-E would be used for speech editing that would include taking a three-second audio example of an individual's voice and making them say something they didn't. Listen to examples of VALL-E here.

"We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work.

During the pre-training stage, we scale up the TTS training data to 60K hours of English speech, which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis," the report states.

Photo of the Hyp NASA The Eagle Has Landed Men's Crew

Best Deals: Hyp NASA The Eagle Has Landed Men's Crew

Prices last scanned 3 hours and 6 minutes ago

* Prices may be inaccurate. As an Amazon Associate, we earn from qualifying purchases. We earn affiliate commission from any Newegg or PCCG sales.

Tech and Science Editor

Email IconX IconLinkedIn Icon

Jak joined TweakTown in 2017 and has since reviewed 100s of new tech products and kept us informed daily on the latest science, space, and artificial intelligence news. Jak's love for science, space, and technology, and, more specifically, PC gaming, began at 10 years old. It was the day his dad showed him how to play Age of Empires on an old Compaq PC. Ever since that day, Jak fell in love with games and the progression of the technology industry in all its forms.

Stay Updated

Follow TweakTown for breaking tech news, reviews, and daily updates.

Add TweakTown as a preferred source on GoogleFind TweakTown on Apple News
Newsletter Subscription