23 March 2026·5 min readVoice AIMultilingual

Voice AI for Indian languages: what actually works in production

Deploying voice AI in India is not a technology demo - it is an engineering problem of accents, code-switching, noisy lines and language mix. Here's what works, and what doesn't.

Ajay Dhillon

Founder

Most voice AI demos are built for a studio environment. Clean microphone, American English, one speaker, no background noise. They show well and do very little for the Indian market.

Deploying voice AI in India is a different engineering problem. It is the problem of working across 22 constitutionally-recognised languages, dozens of dialects, aggressive code-switching within a single sentence, patchy mobile lines, and a population that tolerates poor voice experiences at a lower rate than the industry assumes.

This is a field note from production deployments. What works, what doesn't, and what most proposals miss.

The real inputs to model for

Before picking technology, model the real distribution of calls you will receive.

Language mix. Most Indian voice deployments see at least 3–4 languages in production, not one. A typical north-Indian deployment hears Hindi, English, Haryanvi and Punjabi inside the same customer base.
Code-switching. A single sentence can contain Hindi and English words, often alternating mid-phrase. Monolingual ASR fails hard on these, and typical benchmarks don't include them.
Accent variation. A Bengaluru caller and a Bhopal caller speaking English have genuinely different phoneme distributions. Accent adaptation matters.
Line quality. A significant portion of calls are on 2G or patchy 4G. Models trained on clean audio collapse.
Background noise. Indian calls happen in streets, markets, homes with TVs on, and auto-rickshaws. Noise robustness is not optional.
Expectations. Citizens calling a government service hotline have less patience for robotic voice interactions than a tech-industry user base does.

What actually works

Code-switched ASR

Mono-language models don't work in production for this market. The baseline is an ASR model trained on code-switched audio - Hindi-English, Tamil-English, Bengali-English. Public models have improved; the gap on production audio is narrowing but still real. Fine-tuning on customer-domain audio is the single highest-leverage intervention.

A phoneme-aware routing layer

Don't try to detect "Hindi" or "English" at the utterance level. Route at the phoneme level. Many Indian speakers switch mid-word, and a router that only looks at the first second of audio misclassifies and then the whole response pipeline derails.

Explicit noise and compression handling

Train your ASR on audio that has been through your actual telephony codec - usually G.711 or Opus at low bitrates - not on clean 48kHz audio. The WER gap between the two is large enough that it dominates other choices.

A quick-escalate design

Even with the best model, 5–15% of calls will not complete cleanly in voice AI. The design question is how fast the system notices and escalates. Deployments that escalate within 30 seconds have acceptable citizen satisfaction. Deployments that loop for two minutes before escalating are rated worse than the human-only baseline.

A single intent catalogue across voice and chat

Voice and chat need to share the intent model. When someone calls today and chats tomorrow, the experience should carry across. Building voice and chat on separate intent catalogues is a common, expensive mistake.

What doesn't work

Cloning a US voice stack. The acoustic distribution is different, the language distribution is different, the line quality is different. Voice stacks built for US English require extensive adaptation before they work in India, and the adaptation cost often rivals the original build cost.

Trying to build a single mega-model. One model for all languages and all dialects sounds elegant. In production, it underperforms a routed system of specialist models. Route first, specialise the model under each route.

Under-investing in evaluation. Voice systems are harder to evaluate than chat systems because the ground truth is audio, not text. Many deployments measure containment rate only and miss the fact that the "contained" calls produced wrong answers. Containment is not quality. Measure quality explicitly, by sampling and human review.

Pure IVR fallback. When the AI fails, some systems drop into an IVR menu. Citizens hate this. The fallback should be a human, reached quickly.

Costs, honestly

Voice is more expensive to run than chat for obvious reasons - continuous audio streaming, ASR, TTS, telephony bills. A useful planning range for the ongoing cost per call is INR 2 to INR 8 depending on call length, language mix and model tier. Scaled to hundreds of thousands of calls a month this is meaningful; it needs to be designed around from day one. Containment rate improvements below about 60% rarely justify the full AI build; above 80%, the economics look very different.

Governance for voice

Voice carries governance questions that chat does not. PII appears naturally in audio (names, numbers, addresses), and redaction in audio is a harder problem than in text. Production-grade voice deployments need:

Real-time PII redaction in transcripts before logging.
Recorded-call consent flows at the start of the call, audit-logged.
Language-specific quality sampling - don't sample 90% Hindi audio and extrapolate English quality.
Human review queue where low-confidence outputs are flagged for listen- back.

These are not optional in any regulated deployment.

An example: citizen services

One of our deployments handles pass-extension calls for a Union Territory across seven languages. Before the deployment, average resolution time was nine days. After, four hours for straightforward cases. The AI does not handle everything - it handles the well-shaped 80%. The remaining 20% is routed to a human within 30 seconds. Containment is 74%, satisfaction is higher than the pre-deployment baseline. The design took three months; the language additions have taken longer.

The lesson: Indian voice AI works in production, but it requires the kind of disciplined scope, quality measurement and operating model that most demos don't show.

Frequently asked

Does voice AI work for Indian languages in production? Yes, but it requires code-switched ASR, accent adaptation, telephony-codec training, and a quick-escalate design. Out-of-the-box vendor stacks built for US English typically do not reach acceptable quality without significant fine-tuning.

How many Indian languages can a single voice AI system support? Practically, 5–10 languages with acceptable quality in a single deployment, if the system is built as a routed set of specialist models rather than one mega-model.

How much does voice AI cost per call? Typically INR 2 to INR 8 per call depending on length, language and model tier. This needs to be designed for from the start; voice economics are volume-sensitive.

What is the right fallback when voice AI can't handle a call? A human, reached quickly. Falling back to an IVR menu produces the worst measured satisfaction of any pattern we have tested.

Our voice & chat service is built for this market, not for a studio demo. If you are planning a deployment in an Indian context, let us know what you're trying to do.

Written by

Ajay Dhillon · Founder