SoPro

A lightweight text-to-speech model with zero-shot voice cloning. 169M parameters, streams audio in real time.

2024

PythonPyTorchAudio SynthesisStreaming

Overview

SoPro is a compact text-to-speech model that can clone voices from a short audio sample, no fine-tuning needed. At 169M parameters, it's small enough to run efficiently while still producing natural-sounding speech with real-time streaming.

Challenge

Getting decent voice quality from a small model. Most good TTS models are huge. I wanted something that could run on reasonable hardware and still sound natural, especially with the zero-shot cloning where you only get a few seconds of reference audio.

Approach

Focused on efficient architecture design to pack as much quality into 169M parameters as possible. Built streaming audio synthesis so speech starts playing before the full generation is complete. The zero-shot cloning pipeline extracts speaker embeddings from short reference clips.

Outcome

The streaming turned out to be crucial for usability: waiting for full generation kills the experience. Voice quality is surprisingly good for the model size, especially on English speech. It's been a good exploration of the tradeoffs between model size and output quality.