Bengali TTS — Naimuzzaman Shuvo

My current Bengali TTS approach skips training entirely: I deploy Fish-Speech S2 on a rented RunPod RTX 3090, served through SGLang-Omni behind an OpenAI-style /v1/audio/speech API, and adapt the voice with zero-shot synthesis plus reference-clip cloning — a single self-recorded Bengali clip and its exact transcript passed on every request.

The core engineering win was diagnosing why SGLang-Omni — which defaults to an H200 (141 GB) — OOMs the vocoder on a 24 GB card, then patching three values in stages.py (mem_fraction_static 0.85→0.55, chunked_prefill_size 8192→2048, max_running_requests 64→4) to land a stable ~16 GB footprint. I also wrote a dependency-free Bun client with a tiered Bengali text-normalization front-end: a full 0–99 number-word table with lakh/crore logic, an English→Bengali phonetic dictionary, abbreviation and dari handling, sentence chunking, and hand-rolled RIFF/WAV concatenation.

This replaced an earlier experiment — a self-contained Coqui XTTS-v2 fine-tuning pipeline (YouTube scrape → Demucs → Silero VAD → Whisper → LJSpeech → GPT-trainer) that I abandoned because XTTS-v2 has no Bengali tokenizer, forcing a lossy Bengali→Devanagari transliteration trained as Hindi; that version only ever reached a 250-step smoke test.

The hard part

Fitting an H200-tuned model onto a 24 GB GPU

SGLang-Omni ships tuned for an H200 (141 GB) and OOMs the Fish-Speech vocoder on a 24 GB RTX 3090. I traced the failure to three runtime defaults and patched them in stages.py — mem_fraction_static 0.85→0.55, chunked_prefill_size 8192→2048, max_running_requests 64→4 — bringing the served footprint down to a stable ~16 GB so I could run zero-shot Bengali voice cloning on a ~$0.22/hr consumer card.

Highlights

Deploy Fish-Speech S2 on a rented RTX 3090 via SGLang-Omni behind an OpenAI-style /v1/audio/speech API, with zero-shot synthesis and reference-clip voice cloning (one self-recorded Bengali clip per request).
Diagnosed and fixed an SGLang-Omni OOM on a 24 GB card — patched stages.py (mem_fraction_static 0.85→0.55, chunked_prefill_size 8192→2048, max_running_requests 64→4) to a stable ~16 GB footprint.
Wrote a dependency-free Bun TTS client with tiered Bengali text normalization: a 0–99 number-word table with lakh/crore logic, an English→Bengali phonetic dictionary, abbreviation/dari handling, and hand-rolled RIFF/WAV concatenation.
Earlier explored Coqui XTTS-v2 fine-tuning: a 9-script pipeline distilling one YouTube video into a 187-clip, 17.6-minute LJSpeech set through 5 quality gates, with a Whisper round-trip WER eval.
Abandoned XTTS-v2 once its missing Bengali tokenizer forced a lossy Devanagari-as-Hindi transliteration — the experiment that decided the deploy-not-train approach.

Stack

Fish-Speech S2 (latest)SGLang-OmniRunPod RTX 3090 / CUDA 12Bun + TypeScriptCoqui XTTS-v2 (earlier)PyTorch / torchaudioWhisperSilero VADDemucsyt-dlp + FFmpeg