Siri launched in 2011. That’s almost fifteen years ago. And most people who use it regularly do so mainly to set a timer or spell something into a search bar.
That’s not a lack of ambition. Apple, Google and Amazon have invested billions in voice technology. The problem is architectural.
The classic voice assistant operates as a chain of three models: one that transcribes speech to text, one that generates a response, and one that converts the response back to audio. Three models. Three handoffs. Cumulative latency. And with each step, information is lost – prosody, emphasis, emotional nuance – because these qualities can’t be preserved in flat text.
That’s why Siri sounds robotic. Not because the voice synthesis is poor in isolation, but because it’s built on a foundation of information loss.
A different architecture
NVIDIA’s PersonaPlex 7B hit Hacker News on March 5, 2026 with 374 points and 125 comments. Not because it’s another voice system. But because it takes a fundamentally different approach.
The model replaces the entire pipeline chain with a single model. Audio in, audio out. No transcription. No text intermediate. Built on Kyutai’s Moshi architecture, it processes 17 parallel audio token streams directly – one frame every 80 milliseconds at 12.5 Hz. It listens and speaks simultaneously, full-duplex, delivering below real-time with an RTF of 0.87.
That’s shorter latency than a natural conversational pause.
Google’s Gemini 2.0 Live API, announced in December 2024, takes the same approach at scale: real-time conversations directly with a language model, without an intermediate pipeline.
This isn’t an improvement to Siri. It’s a different type of technology.
What actually changes
The technical difference has a concrete consequence for user experience. Latency under 100 milliseconds isn’t perceived as a delay – conversation flows. And a model that preserves audio information throughout can respond to emphasis and tempo, not because it’s “smarter,” but because it never discarded that information in the first place.
That changes who can use the technology at all.
Digital systems have for years imposed an implicit requirement: you need to be able to type. A keyboard, a touchscreen, a search bar. Technology works in practice for people comfortable with these interfaces.
Siri-quality voice interaction never solved that problem. When it fails three times in a row, you learn quickly that typing is faster. The result is that voice features are primarily used by those already comfortable with alternatives – and typically only for simple commands.
Technology that keeps conversation flowing and responds naturally is something different. A 74-year-old who never learned to type on a smartphone. A warehouse worker wearing safety gloves. A nurse in the middle of an assessment. All of them can use technology that works like a conversation – if the conversation actually works.
What it means for your business
The practical implications are concrete.
AI agents that handle customer inquiries via voice in real-time are available now, with technology that holds the quality needed to keep customers from hanging up. That’s a significant jump from before: your customers already talk to AI, but the question is no longer whether the technology works – it’s whether you’re using it.
Internal support is an underrated use case. Most IT helpdesks and HR systems are primarily used by people sitting at computers. The share of employees working without screens – manufacturing, logistics, healthcare – rarely uses them, because it’s cumbersome. A voice interface to those same systems is a different proposition.
Accessibility isn’t just an ethical argument. It’s a commercial argument. If your digital channels effectively work best for people with IT experience and fingers free for a keyboard, that’s a customer base you’re not fully reaching.
PersonaPlex and Gemini Live make that question sharp. For the first time, the answer is yes for most use cases. That’s what’s changed.
And it’s not Siri.