Microsoft's new open-source text-to-voice generative AI tool, VibeVoice, is an interesting one, as it can generate audio of up to 90 minutes in length with four distinct speakers. Naturally, with a script, VibeVoice becomes a viable tool for creating an audio podcast or other "expressive, long-form, multi-speaker conversational audio."

With there already being quite a few AI-powered Text-to-Speech (TTS) systems and tools, what separates VibeVoice from the pack is its ability to maintain and preserve audio fidelity, speaker consistency, and "natural turn-taking" over an extended period.
"VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details," the official description reads. VibeVoice offers a live demo for you to check out, along with the option to download it.
As a pure Text-to-Speech (TTS) tool, VibeVoice requires a script to work, which you'll need to whip up yourself or use another AI tool like ChatGPT to generate. VibeVoice is available in multiple versions: a compact 1.5 billion-parameter model and a more complex 7 billion-parameter model. There's also a 0.5 billion parameter model on the way, designed for real-time audio generation. For those with a modern GPU, the 1.5 billion-parameter version requires approximately 7GB of VRAM, while the larger 7 billion-parameter model requires around 18GB.
As for the quality of VibeVoice, the voices and conversation flow, although impressive, still sound very much like those of an AI. For more on VibeVoice, check out its GitHub repository and Hugging Face page.



