← Back to Catalog

microsoft/VibeVoice

↗ GitHub

Open-Source Frontier Voice AI

34,899

Stars

3,969

Forks

202

Watchers

146

Open Issues

Python·MIT License·Last commit Apr 2, 2026·by @microsoft·Published April 2, 2026·Analyzed 6d ago
A

Safety Rating A

No hardcoded secrets, malicious code patterns, suspicious dependencies, or prompt injection attempts were detected. The repository is a well-documented Microsoft Research open-source project under MIT license. The README itself includes a responsible-use disclaimer and notes that the TTS code was removed after discovering misuse cases (deepfakes), demonstrating active responsible-AI governance. No red flags were identified in the repository content provided.

AI-assisted review, not a professional security audit.

AI Analysis

VibeVoice is a family of open-source frontier voice AI models from Microsoft, encompassing Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and real-time streaming TTS capabilities. The ASR model (7B) handles up to 60-minute long-form audio in a single pass with speaker diarization, timestamps, and customized hotword support across 50+ languages. The TTS model (1.5B) generates up to 90 minutes of expressive multi-speaker conversational audio. The streaming TTS model (0.5B) provides real-time synthesis with ~300ms first-audio latency. All models use continuous speech tokenizers at 7.5 Hz and a next-token diffusion framework built on top of a large language model backbone.

Use Cases

  • Long-form audio transcription with speaker diarization and timestamps
  • Podcast and multi-speaker dialogue synthesis
  • Real-time text-to-speech for voice interfaces and input methods
  • Multilingual speech recognition and synthesis (50+ languages)
  • Fine-tuning speech recognition models on custom domain data
  • Voice-powered input methods and accessibility tools

Tags

#voice#llm#streaming#fine-tuning#library#server#dataset#real-time

Project Connections

microsoft/VibeVoice — Yggdrasil