← all work
VOICE AI2026

Mr. Toppera real-time Bangla voice tutor

A real-time Bangla voice tutor for HSC students — Pipecat voice, live whiteboard, ~32 BDT a lesson.

~32 BDT
per 30-min lesson
8
whiteboard tools
~1–3s
faster first audio
74%
HSC RAG eval

Mr. Topper is the Bangla AI voice tutor my education work was building toward. It started as a WhatsApp bot, then a React Native app where I trained my own Bengali text-to-speech, before I settled on the architecture it runs on now.

The real-time layer is a Pipecat pipeline that handles voice, video, and live whiteboard tool-calling; everything else — database schema, APIs, auth, and the lesson lifecycle — lives in a NestJS backend. The tutor speaks with Gemini TTS, listens through Deepgram, and grounds every answer in a retrieval index built from the national NCTB textbooks.

Off-the-shelf voice stacks assume English. I got first audio one to three seconds sooner by making the sentence tokenizer treat the Bangla danda (।) as a terminator instead of waiting for a period, tuned barge-in to a 150ms interrupt, and raised speech endpointing to 300ms so the tutor stops cutting students off on natural Bangla pauses. A tested pricing module keeps a 30-minute lesson around 32 BDT — inside the 25–35 BDT a student can actually pay.

The hard part

Teaching a voice agent to hear Bangla

Off-the-shelf voice stacks assume English punctuation and pacing. The sentence tokenizer only split on . ? !, so a Bangla reply ending in a danda (।) buffered dozens of tokens before the tutor said a word — I made it terminate on the danda and got first audio one to three seconds sooner. And the speech detector's 25ms endpointing kept cutting students off on natural Bangla pauses, so I raised it to 300ms and tuned barge-in to a 150ms interrupt.

Highlights

  • Architected it on a Pipecat real-time pipeline (voice, video, live whiteboard tool-calling) with a NestJS backend owning the schema, APIs, and lesson lifecycle.
  • Built LLM-driven whiteboard tool-calling — 8 tools (text, lists, code, equations, tables, SVG) rendered live on a React Native board over the real-time data channel.
  • Cut time-to-first-audio ~1–3s by treating the Bangla danda (।) as a sentence terminator, with barge-in tuned to a 150ms interrupt and endpointing raised 25→300ms for Bangla speech.
  • Kept per-lesson economics honest with a tested pricing module: ~32 BDT per 30-minute lesson, inside the 25–35 BDT target.
  • Grounded answers in a Bangla RAG index over NCTB textbooks, scoring 74% on a graded HSC ICT evaluation.

Stack

TypeScriptPythonPipecatNestJSGemini TTS + LLMDeepgram nova-3Supabase pgvectorReact Native