SoPro
A lightweight text-to-speech model with zero-shot voice cloning. 169M parameters, streams audio in real time.
Overview
SoPro is a compact text-to-speech model that can clone voices from a short audio sample — no fine-tuning needed. At 169M parameters, it's small enough to run efficiently while still producing natural-sounding speech with real-time streaming.
Challenge
Getting decent voice quality from a small model. Most good TTS models are huge. I wanted something that could run on reasonable hardware and still sound natural, especially with the zero-shot cloning where you only get a few seconds of reference audio.
Approach
Focused on efficient architecture design to pack as much quality into 169M parameters as possible. Built streaming audio synthesis so speech starts playing before the full generation is complete. The zero-shot cloning pipeline extracts speaker embeddings from short reference clips.
Outcome
The streaming turned out to be crucial for usability — waiting for full generation kills the experience. Voice quality is surprisingly good for the model size, especially on English speech. It's been a good exploration of the tradeoffs between model size and output quality.