Real-time AI voice agents are moving quickly from demo territory into practical products, but the engineering required to make them feel natural remains difficult. In a recent discussion, LiveKit’s Ben Cherry focused on the technical challenges behind building systems that can listen, think and respond with minimal delay.
The core problem, according to Cherry, is that voice interactions are much less forgiving than text. Users expect a conversational rhythm that feels immediate, while AI systems must capture speech, process it, generate a response and speak back without awkward pauses. Even small delays can make an assistant feel clumsy or robotic.
LiveKit, which provides infrastructure for audio and video applications, has been paying close attention to this shift toward voice-first AI interfaces. Cherry’s comments reflect a broader industry effort to create agents that can handle back-and-forth conversation in real time rather than only responding after a user finishes speaking.
A key theme in the discussion is latency. In a written chat interface, a short wait is usually acceptable. In voice, however, timing shapes the entire experience. A response that arrives too late can interrupt the flow of a conversation, cause users to repeat themselves or make the agent seem less intelligent than it is.
That timing issue is not limited to a single step in the pipeline. Voice agents depend on multiple stages working together, from speech recognition and language generation to text-to-speech output. If any step slows down, the entire interaction suffers. Cherry highlighted the importance of engineering these systems so they can keep pace with human speech in natural settings.
Another challenge is handling interruptions. People often talk over assistants, change their minds mid-sentence or correct themselves on the fly. Building agents that can detect and react to those moments is essential for making conversations feel genuine. Without that, the system may continue speaking when it should stop or miss cues that the user is trying to redirect the exchange.
The discussion also points to a broader design question. A useful voice agent is not just fast, it is also good at conversation. That means understanding when to respond, when to wait and how to maintain context across a spoken exchange.
Cherry’s perspective suggests that the best real-time voice systems will rely on infrastructure that can support low-latency audio streaming and responsive session management. Those pieces matter because voice agents are expected to operate more like a live participant than a query box.
For developers, that creates a different product mindset. Instead of optimizing only for accuracy, teams also need to think about conversational timing, turn-taking and the subtle cues that make speech feel human. In practice, that can affect everything from user satisfaction to whether a product feels reliable enough for everyday use.
Interest in AI voice tools has grown as more companies look for ways to move beyond text-based assistants. LiveKit’s work sits at the intersection of communications infrastructure and AI application development, making it part of a larger push to support these new use cases.
Cherry’s remarks underline a central point for the category. Real-time voice agents are technically possible, but making them pleasant to use requires careful attention to how people actually speak. The companies that solve for speed, interruptions and conversational flow may help define the next generation of AI interfaces.