Mr. Topper is the Bangla AI voice tutor my education work was building toward. It started as a WhatsApp bot, then a React Native app where I trained my own Bengali text-to-speech, before I settled on the architecture it runs on now.
The real-time layer is a Pipecat pipeline that handles voice, video, and live whiteboard tool-calling; everything else — database schema, APIs, auth, and the lesson lifecycle — lives in a NestJS backend. The tutor speaks with Gemini TTS, listens through Deepgram, and grounds every answer in a retrieval index built from the national NCTB textbooks.
Off-the-shelf voice stacks assume English. I got first audio one to three seconds sooner by making the sentence tokenizer treat the Bangla danda (।) as a terminator instead of waiting for a period, tuned barge-in to a 150ms interrupt, and raised speech endpointing to 300ms so the tutor stops cutting students off on natural Bangla pauses. A tested pricing module keeps a 30-minute lesson around 32 BDT — inside the 25–35 BDT a student can actually pay.
The hard part
Teaching a voice agent to hear Bangla
Off-the-shelf voice stacks assume English punctuation and pacing. The sentence tokenizer only split on . ? !, so a Bangla reply ending in a danda (।) buffered dozens of tokens before the tutor said a word — I made it terminate on the danda and got first audio one to three seconds sooner. And the speech detector's 25ms endpointing kept cutting students off on natural Bangla pauses, so I raised it to 300ms and tuned barge-in to a 150ms interrupt.
Highlights
- Architected it on a Pipecat real-time pipeline (voice, video, live whiteboard tool-calling) with a NestJS backend owning the schema, APIs, and lesson lifecycle.
- Built LLM-driven whiteboard tool-calling — 8 tools (text, lists, code, equations, tables, SVG) rendered live on a React Native board over the real-time data channel.
- Cut time-to-first-audio ~1–3s by treating the Bangla danda (।) as a sentence terminator, with barge-in tuned to a 150ms interrupt and endpointing raised 25→300ms for Bangla speech.
- Kept per-lesson economics honest with a tested pricing module: ~32 BDT per 30-minute lesson, inside the 25–35 BDT target.
- Grounded answers in a Bangla RAG index over NCTB textbooks, scoring 74% on a graded HSC ICT evaluation.
Stack