Voice AI: Beyond the Chatbot

Text chatbots are everywhere.

Voice interfaces that don't suck are rare.

The jump from text to voice isn't incremental. It's a different problem.

Why Voice Is Hard#

Latency matters more. 500ms text response is fine. 500ms voice delay is awkward.

Turn-taking is complex. When do I speak? When are you done? Text has "send." Voice has... vibes.

Context is richer. Tone. Pace. Hesitation. Text strips all of this out.

Errors are worse. Misheard text can be re-read. Misheard voice breaks flow.

The Pipeline#

Every voice AI has the same basic flow:

Audio In → ASR → Text → LLM → Text → TTS → Audio Out

ASR (Automatic Speech Recognition): Audio to text. LLM: Text understanding and generation. TTS (Text-to-Speech): Text to audio.

Simple in theory. Each step introduces latency and error.

Latency Budget#

Target: under 1 second end-to-end.

Component	Target	Common Reality
ASR	200ms	300-500ms
LLM	300ms	500-2000ms
TTS	200ms	200-400ms
Network	100ms	Variable
Total	800ms	1500-3000ms

Most voice bots feel slow because they are slow.

Making It Fast#

Streaming Everything#

Don't wait for complete transcription. Process as you hear.

Don't wait for complete LLM response. Start TTS on the first sentence.

Chain streams:

Audio chunk → partial ASR → early LLM start → TTS streaming

The user hears response starting before they've fully finished speaking.

Speculative Responses#

For common patterns, pre-generate responses.

User: "Hey, what's—"
System: (already loading likely next steps)

Saves hundreds of milliseconds on predictable queries.

Smart Interruption#

Users interrupt. They should be able to.

Detect interruption → stop TTS → flush buffers → switch to listening.

Takes about 200ms to detect and respond to interruption. Faster feels responsive. Slower feels robotic.

Turn-Taking#

The hardest unsolved problem in voice AI.

Pauses aren't endings. "I want to... um... schedule a meeting" isn't three sentences.

Backchannel isn't input. "Uh-huh" doesn't mean talk over me.

Culture matters. Some cultures overlap. Some don't. Universal models fail.

Current Approaches#

Fixed timeout: Wait 800ms of silence. Simple. Often wrong.

Acoustic features: Listen for falling intonation, slowing pace. Better.

Semantic completion: Does the sentence feel done? Best, but most expensive.

Hybrid: All of the above, weighted by context.

Stay Updated

Get updates on new labs and experiments.

What If AI Was the Operating System, Not Just an App?

Exploring AI-native architecture where reasoning becomes infrastructure - from DAG execution to agentic systems that rethink how software works when thinking becomes cheap.

Emotional Awareness#

Text: "I'm fine." Voice: exasperated sigh "I'm fine."

Same words. Opposite meanings.

What We Can Detect#

Frustration (speaking faster, higher pitch)
Confusion (hesitation, rising intonation)
Satisfaction (relaxed pace, falling tones)
Urgency (clipped speech, emphasis)

What We Can't#

Sarcasm (sometimes)
Cultural nuance (usually)
Context from prior calls (without memory)

Don't overclaim. Emotional AI is partial at best.

The Voice Persona#

Your voice AI has a character whether you design one or not.

Decisions you're making:

How fast to speak
When to use filler words
How formal to be
Whether to mirror user's pace
How to handle errors

Bad default: robotic correctness. Good default: warm competence.

Production Considerations#

Fallback to Text#

Voice fails often. Network issues, noisy environments, accents.

Always have human escalation or text fallback.

Privacy#

Voice is biometric data. Store carefully. Anonymize when possible. Delete when required.

Accessibility#

Voice-only excludes deaf users. Voice-and-text includes everyone.

What We're Building#

At Kingly, voice is core to Kingly Bot:

Sub-second response latency
Natural turn-taking
Multi-channel memory
Graceful degradation to text

Voice done right feels like talking to a helpful person. Voice done wrong feels like screaming at a phone menu.

We're aiming for helpful person.

2026 Field Notes: Closing the Action Gap#

Traditional APIs are no longer enough. The industry is rapidly shifting towards Vision-LLM scaffolding to close the "Action Gap." In our work with NAAC building the COPI (Co-Pilot Intelligence) module for the experimental Tarragon aircraft, we've replaced rigid state machines with a hybrid neural-symbolic architecture.

Furthermore, we're leveraging SOFIA (Self-Organizing Flight Intelligence Agent)—our open-source RL framework with 42 training levels, 15 RL algorithms, 29 aircraft configs, and 128 parallel environments—to train autonomous agents through domain randomization rather than hard-coded heuristics.