Most text-to-speech models that sound good are massive — hundreds of millions or billions of parameters. I wanted to see how good I could get with something much smaller. SoPro is 169M parameters and can do zero-shot voice cloning from a few seconds of reference audio.
Zero-shot cloning means you give the model a short audio clip of someone's voice, and it can generate new speech in that voice without any fine-tuning. The model extracts a speaker embedding from the reference clip and uses it to condition the synthesis. It's not perfect — it captures the general character of a voice more than exact quirks — but it's surprisingly usable.
The streaming architecture was non-negotiable. Nobody wants to wait 10 seconds for a sentence to generate before hearing anything. SoPro starts outputting audio as soon as it has enough to play, so there's minimal latency between submitting text and hearing speech. This was tricky to implement because you're essentially running the model in chunks while maintaining coherence across chunk boundaries.
The biggest tradeoff with a small model is that it struggles with longer sentences. For short to medium text, the quality is solid. But once you get past a paragraph, you start hearing artifacts — odd pauses, slight pitch drift, occasional garbled syllables. Larger models handle this better because they have more capacity for long-range dependencies.
I used PyTorch for the model and training pipeline. The architecture borrows ideas from several recent papers but isn't a direct reproduction of any one approach. I was mostly trying to find the best quality-to-size ratio, which meant a lot of experimentation with layer sizes, attention heads, and training schedules.
English works best. I did some testing with other languages and the quality drops noticeably — it still generates speech, but the prosody and accent are less natural. This makes sense given the training data distribution.
What would I do differently? I'd invest more time in training data quality upfront. I spent a lot of compute training on noisy data before realizing that a smaller, cleaner dataset produces better results than a bigger, messier one. Classic machine learning lesson that I apparently needed to learn firsthand.