No off-the-shelf model speaks colloquial Bangla well — especially the English code-switching students actually use (“আমি calculus পড়তে চাই”) — so I built the whole speech pipeline myself. Nine Python scripts distill a single YouTube video into a 187-clip, 17.6-minute LJSpeech dataset through five quality gates (duration 3–12s, an RMS floor, a clipping ratio under 0.001, and a NaN check).
XTTS-v2 has no Bengali support, so I transliterated transcripts to Devanagari and trained under the 'hi' language code — the closest supported Indic tokenizer. I served the result on a rented RTX 3090 at about $0.22/hour and closed the loop with objective eval: synthesize held-out sentences, round-trip through Whisper, and compute per-sentence WER. This is the experiment that taught me what production needs — and why Mr. Topper ultimately runs on Gemini TTS rather than a self-hosted model.
The hard part
Fine-tuning speech for a language the tools don't support
XTTS-v2 simply has no Bengali tokenizer. Rather than abandon it, I transliterated the training transcripts into Devanagari and trained under the Hindi ('hi') language code — the closest supported Indic script — which let the model learn Bengali phonetics through a tokenizer that already existed. Paired with a 5-gate data-cleaning pipeline and a Whisper round-trip WER eval, one YouTube video became a usable voice.
Highlights
- Built a 9-script pipeline (~1,660 LOC) distilling one YouTube video into a 187-clip, 17.6-minute LJSpeech set through 5 quality gates.
- Bridged XTTS-v2's missing Bengali support by transliterating transcripts to Devanagari and training under the 'hi' tokenizer.
- Ran a code-switch QA loop: an ASCII-ratio heuristic flags mixed clips, then Gemini post-corrects spelling while preserving dialectal forms.
- Patched SGLang-Omni's H200 defaults (mem fraction 0.85→0.55, chunked prefill 8192→2048) to serve Fish Speech S2 at ~16GB on a $0.22/hr RTX 3090.
- Closed the loop with objective eval: synthesize, round-trip through Whisper, compute per-sentence jiwer WER, rank worst performers.
Stack