← all work
ML / TRAINING2026

Bengali TTSa custom voice model from one video

I trained a custom Bengali speech model from a single YouTube video, end to end.

187
clips / 17.6 min
1
YouTube video
~16 GB
of 24 GB VRAM
$0.22/hr
serving

No off-the-shelf model speaks colloquial Bangla well — especially the English code-switching students actually use (“আমি calculus পড়তে চাই”) — so I built the whole speech pipeline myself. Nine Python scripts distill a single YouTube video into a 187-clip, 17.6-minute LJSpeech dataset through five quality gates (duration 3–12s, an RMS floor, a clipping ratio under 0.001, and a NaN check).

XTTS-v2 has no Bengali support, so I transliterated transcripts to Devanagari and trained under the 'hi' language code — the closest supported Indic tokenizer. I served the result on a rented RTX 3090 at about $0.22/hour and closed the loop with objective eval: synthesize held-out sentences, round-trip through Whisper, and compute per-sentence WER. This is the experiment that taught me what production needs — and why Mr. Topper ultimately runs on Gemini TTS rather than a self-hosted model.

The hard part

Fine-tuning speech for a language the tools don't support

XTTS-v2 simply has no Bengali tokenizer. Rather than abandon it, I transliterated the training transcripts into Devanagari and trained under the Hindi ('hi') language code — the closest supported Indic script — which let the model learn Bengali phonetics through a tokenizer that already existed. Paired with a 5-gate data-cleaning pipeline and a Whisper round-trip WER eval, one YouTube video became a usable voice.

Highlights

  • Built a 9-script pipeline (~1,660 LOC) distilling one YouTube video into a 187-clip, 17.6-minute LJSpeech set through 5 quality gates.
  • Bridged XTTS-v2's missing Bengali support by transliterating transcripts to Devanagari and training under the 'hi' tokenizer.
  • Ran a code-switch QA loop: an ASCII-ratio heuristic flags mixed clips, then Gemini post-corrects spelling while preserving dialectal forms.
  • Patched SGLang-Omni's H200 defaults (mem fraction 0.85→0.55, chunked prefill 8192→2048) to serve Fish Speech S2 at ~16GB on a $0.22/hr RTX 3090.
  • Closed the loop with objective eval: synthesize, round-trip through Whisper, compute per-sentence jiwer WER, rank worst performers.

Stack

PythonPyTorchCoqui XTTS-v2WhisperSilero VADDemucsFish Speech S2RunPod RTX 3090