Microsoft has released VibeVoice, a new open-source artificial intelligence (AI) model that lets users create podcasts and other audio — a counter to Google’s popular NotebookLM.
But there are notable differences. Microsoft’s text-to-speech model can generate four voices and up to 90 minutes of podcast-quality speech. NotebookLM can do two voices.
Additionally, VibeVoice reads and organizes text while NotebookLM ingests documents and turns them into two-person podcasts. Users can also query and get document summaries, according to tech firm Hugging Face.
That means VibeVoice doesn’t try to understand the text but rather performs it audibly, ostensibly to replace a recording studio.
VibeVoice is the latest offering in voice AI technology, which has been attracting venture capital funding.
In 2024, voice AI startups raised $2.1 billion, up eightfold from the prior year, according to market research firm CB Insights. There’s rising interest in voice shopping: A PYMNTS Intelligence report shows that 30.4% of Gen Z consumers already shop by voice every week, followed by millennials. For all ages, the average is 17.9% of consumers using voice to shop.
VibeVoice runs on 1.5 billion parameters, relatively small for a model capable of sustaining dialogue across multiple speakers.
It was trained using Alibaba’s open-source Qwen2.5, a large language model that helps orchestrate natural turn-taking and contextually aware speech patterns during dialogues.
Microsoft claims this means VibeVoice can produce fluid conversations among four voices and yet maintain each voice’s distinct characteristics, even in longer conversations.
See also: How the World Does Digital: A Deep Dive Into Global Digital Engagement
How to use VibeVoice
Potential research applications of VibeVoice include the following:
Prototyping podcasts and training content
- Creators could generate mock podcasts, panel discussions or training modules with multiple AI voices. Instead of hiring four voice actors to test dialogue flow, users can create a synthetic version in minutes using text.
Accessibility and education
- Educational material, textbooks or research papers could be turned into long-form audio with distinct narrators. This could help people who learn better by listening, or make dense material more engaging.
Game and media development
- Game developers or storytellers could use VibeVoice to prototype dialogue between characters. Because it handles four speakers, you can stage a full in-game conversation without recording sessions.
Recognizing the risks of deepfakes, Microsoft said VibeVoice’s safeguards include ensuring every audio file includes both a disclaimer—such as “This segment was generated by AI”—and a hidden digital watermark.
It bars impersonation, disinformation and live deepfake uses such as real-time voice conversion in calls. It supports only English and Chinese speech for now. The model is available for research, not commercial deployment.
Read more:
Nobody’s Talking: Voice Interfaces Face Hurdles for Wide Adoption
AWS and Vonage Partner to Distribute ‘Natural-Sounding’ AI Voice Agents
Meta to Make a Bid for Voice AI Startup PlayAI