Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers

VibeVoice is a new open-source AI tool that can generate a full 90 minute audio podcast recording with multiple speakers from a simple script.

VIEW GALLERY - 2

Kosta Andreadis

Senior Editor

Published Sep 2, 2025 9:32 PM CDT

1 minute & 45 seconds read time

TL;DR: Microsoft's open-source VibeVoice AI generates up to 90 minutes of multi-speaker, high-fidelity conversational audio using advanced text-to-speech technology. It leverages a Large Language Model and diffusion framework to maintain speaker consistency and natural dialogue flow, making it ideal for podcasts and long-form audio content.

Voice: Kosta AndreadisSpeed

0:00 / --:--

Microsoft's new open-source text-to-voice generative AI tool, VibeVoice, is an interesting one, as it can generate audio of up to 90 minutes in length with four distinct speakers. Naturally, with a script, VibeVoice becomes a viable tool for creating an audio podcast or other "expressive, long-form, multi-speaker conversational audio."

Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers 2

VIEW GALLERY - 2 IMAGES

With there already being quite a few AI-powered Text-to-Speech (TTS) systems and tools, what separates VibeVoice from the pack is its ability to maintain and preserve audio fidelity, speaker consistency, and "natural turn-taking" over an extended period.

"VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details," the official description reads. VibeVoice offers a live demo for you to check out, along with the option to download it.

Read more: AMD Software: Adrenalin Edition for Radeon GPUs just got it biggest update in years

As a pure Text-to-Speech (TTS) tool, VibeVoice requires a script to work, which you'll need to whip up yourself or use another AI tool like ChatGPT to generate. VibeVoice is available in multiple versions: a compact 1.5 billion-parameter model and a more complex 7 billion-parameter model. There's also a 0.5 billion parameter model on the way, designed for real-time audio generation. For those with a modern GPU, the 1.5 billion-parameter version requires approximately 7GB of VRAM, while the larger 7 billion-parameter model requires around 18GB.

As for the quality of VibeVoice, the voices and conversation flow, although impressive, still sound very much like those of an AI. For more on VibeVoice, check out its GitHub repository and Hugging Face page.