Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers

VibeVoice is a new open-source AI tool that can generate a full 90 minute audio podcast recording with multiple speakers from a simple script.

Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers
Comment IconFacebook IconX IconReddit Icon
Senior Editor
Published
1 minute & 45 seconds read time
TL;DR: Microsoft's open-source VibeVoice AI generates up to 90 minutes of multi-speaker, high-fidelity conversational audio using advanced text-to-speech technology. It leverages a Large Language Model and diffusion framework to maintain speaker consistency and natural dialogue flow, making it ideal for podcasts and long-form audio content.

Microsoft's new open-source text-to-voice generative AI tool, VibeVoice, is an interesting one, as it can generate audio of up to 90 minutes in length with four distinct speakers. Naturally, with a script, VibeVoice becomes a viable tool for creating an audio podcast or other "expressive, long-form, multi-speaker conversational audio."

Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers 2

With there already being quite a few AI-powered Text-to-Speech (TTS) systems and tools, what separates VibeVoice from the pack is its ability to maintain and preserve audio fidelity, speaker consistency, and "natural turn-taking" over an extended period.

"VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details," the official description reads. VibeVoice offers a live demo for you to check out, along with the option to download it.

As a pure Text-to-Speech (TTS) tool, VibeVoice requires a script to work, which you'll need to whip up yourself or use another AI tool like ChatGPT to generate. VibeVoice is available in multiple versions: a compact 1.5 billion-parameter model and a more complex 7 billion-parameter model. There's also a 0.5 billion parameter model on the way, designed for real-time audio generation. For those with a modern GPU, the 1.5 billion-parameter version requires approximately 7GB of VRAM, while the larger 7 billion-parameter model requires around 18GB.

As for the quality of VibeVoice, the voices and conversation flow, although impressive, still sound very much like those of an AI. For more on VibeVoice, check out its GitHub repository and Hugging Face page.

Best Deals: Freelance Voice Synthesis Services: Use ElevenLabs, Murf, or Play.ht for Voice Cloning Gigs
Today7 days ago30 days ago
$7 USD$7 USD
$9.65 CAD$9.65 CAD
£5.27£5.27
$7 USD$7 USD
Check PriceCheck Price
* Prices last scanned 5/11/2026 at 10:07 pm CDT - prices may be inaccurate. As an Amazon Associate, we earn from qualifying purchases. We earn affiliate commission from any Newegg or PCCG sales.

Senior Editor

Email IconX IconLinkedIn Icon

Kosta is a veteran gaming journalist that cut his teeth on well-respected Aussie publications like PC PowerPlay and HYPER back when articles were printed on paper. A lifelong gamer since the 8-bit Nintendo era, it was the CD-ROM-powered 90s that cemented his love for all things games and technology. From point-and-click adventure games to RTS games with full-motion video cut-scenes and FPS titles referred to as Doom clones. Genres he still loves to this day. Kosta is also a musician, releasing dreamy electronic jams under the name Kbit.

Follow TweakTown on Google News
Newsletter Subscription