A new artificial intelligence system developed by Microsoft is slated to have the capability of cloning anyone's voice by just listening to a three-second audio example.
The new AI is called VALL-E, and according to a newly released paper, the system is a neural codec language model that is a text-to-speech synthesizer. According to the report, VALL-E is capable of learning a specific voice and then synthesizing it to be able to say whatever is desired. Additionally, the report claims that VALL-E will be able to produce a voice identical to the example it was given while also retaining the same or a similar level of emotional tone that is heard in speech - something other AI synthesizers struggle to do successfully.
The creators of the AI system believe it will be used to power text-to-speech applications, speech editing, and audio content creation when combined with other generative language models, such as Open AI's immensely popular ChapGPT. Notably, the creators believe that VALL-E would be used for speech editing that would include taking a three-second audio example of an individual's voice and making them say something they didn't. Listen to examples of VALL-E here.
"We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work.
During the pre-training stage, we scale up the TTS training data to 60K hours of English speech, which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis," the report states.