
September 2023 Releases
Every time a new model ships, we test it for Voice AI agents. Not because we're chasing benchmarks, but because the three things we care most about—latency, cost, and quality —don't move together. A model that's faster is usually more expensive. A model that's cheaper usually makes more mistakes. And a model that scores well on all three in a lab setting often falls apart when it's actually running a conversation on a live call.
For a long time, optimizing for one of these meant accepting a trade-off on the others. What we're always looking for is a model that shifts the entire curve: an LLM that makes the trade-offs less sharp, or eliminates one of them entirely. That's a high bar, and most models don't clear it. But some do, or clear it in ways that matter for a specific slice of what our customers actually run.
So that’s why we keep evaluating. Not as a formal process that happens on a schedule, but because models are shipping fast and ignoring them would mean leaving real improvements on the table.
When we put a new model through its paces, we're not running synthetic benchmarks. We run it against single-state AI agent configurations, testing it against the kind of use cases that our customers actually deploy across industries such as healthcare, insurance, home services, education and more. After testing them on real call scenarios, we score the LLM across three qualitative areas that really come down to three questions: is the agent using its tools correctly? Is it following its prompts correctly? Is it hallucinating?
Action invocation: Does the agent call the right action at the right time? And when something fails, does it recover cleanly or does it just keep going like nothing happened?
Instruction adherence and safety: Is the agent actually paying attention to what it's been told to do, and is it staying in bounds? These two are closely related: a model that ignores its instructions tends to drift in the same ways a model that ignores its safety constraints does.
Conversational quality: How well is it structuring its responses? This matters more in voice than people expect, because a model that buries the key information or rambles before getting to the point creates real friction on a call.
We score each model across all three factors, then average them so we can report on individual areas and also get a single holistic number to compare models directly.
The two quantitative things we measure on top of that:
Cost: For most organizations evaluating AI, defaulting to the latest model feels like the obvious choice. But in production environments running at scale, cost becomes a critical lever. For enterprises with hundreds of thousands of calls per month, ROI depends as much on cost optimization as it does on capability. Cost shapes everything: whether a model makes sense as a broad default, only in certain moments of a conversation, or only for customers with a use case that justifies the price.
Latency: Time to first token, end-to-end speech-to-speech latency. This is how long it takes the model to start responding after a caller speaks. Callers feel delays before they can articulate why something feels off.
These are early findings. We've run Gemini Flash 3.5 through our standard benchmark, but we haven't had the chance to do the deeper optimization work, such as tuning how the model is integrated into our platform, testing prompt structures that are more native to how Gemini processes instructions, and so on. That work typically moves the numbers, and we'd expect it to here too. So take these as a first read, not a final verdict.
With that said: Gemini Flash 3.5 beat every GPT model in our benchmark across all three qualitative dimensions. That's not what we expected going in. It outperformed GPT on tool invocation, instruction adherence, and conversational quality. Instruction adherence in particular stood out, since that's consistently the hardest dimension for models to maintain across a full conversation.
The latency and cost are where you feel the trade-off. Running it on our platform today, you're looking at roughly a second more latency than the equivalent GPT models, and it's more expensive. Some of that gap will likely close as we optimize, but we don't want to promise numbers we haven't hit yet. What we can say is that even at current performance, the quality case is strong enough to make it worth deploying in the right context.

A call with an AI agent powered by Gemini 3.5 Flash:
A call with an AI agent powered by OpenAI's GPT-4o Mini:
The short answer is: we make it available for our AI agents.
Even when a model doesn't hit our thresholds on latency or cost, we're not going to be the ones deciding it's not worth trying. There are use cases we haven't thought of. There are specific nodes in a conversation,where a specific model that wouldn't make sense end-to-end is actually a good fit for that one moment.
In a lead qualification call, for instance, a lighter model may be sufficient for structured, predictable questions, while a more capable model is warranted for open-ended responses that require conditional logic and nuanced reasoning. That's a real pattern we see across use cases, and we don't want to foreclose it.
What we do bring is a point of view. When you are standing up a new agent or thinking about migrating an AI agent that's already running, we're not going to just add adropdown and say good luck. Our Forward Deployed Engineers will consider which model makes sense, where the trade-offs are, and what to watch for. That opinion is based on what we’ve seen in the evaluation and, over time, what we see in production.
Where we'd point customers toward Gemini Flash 3.5: use cases where decision-making quality is the priority and contact experience isn't bottlenecked by response speed. If you're running conversations where the person on the other end knows they're talking to an AI, or where they're conditioned to expect a beat before the agent responds, the latency trade-off is much easier to absorb. What you get in return is a model that's more accurate, follows its instructions more reliably, and handles complex tool calls better than anything else we've tested.
Each model has its relative strengths and weaknesses. We recommend using Test Cases & Simulations to stress test new models before switching, as you may need to tweak your prompt.
If your use case depends on a fast, naturalistic back-and-forth, there are cheaper models that will serve you better. But if you have an agent running on something like 5.2 and quality isn't where it needs to be, Gemini Flash 3.5 is the next thing to try. It's available on the platform now.
This is part of an ongoing series on how Regal evaluates LLMs for production voice AI. More to come as we keep running new models through the same process.
Regal evaluates new models as they ship, not on a fixed schedule. The pace of model releases in the industry is fast enough that waiting would mean leaving real improvements on the table. The goal isn't to chase every benchmark — it's to identify models that meaningfully shift the trade-offs between latency, cost, and quality for production voice AI.
In a lab setting, the newest model often looks like the obvious choice, but production environments are a different story. At scale, token-level cost differences compound quickly, and a model that performs well on benchmarks can behave differently on a live call. The right model depends on your specific use case, call volume, and where you're willing to accept trade-offs.
Gemini 3.5 Flash outperformed every GPT model across all three qualitative dimensions: action invocation, instruction adherence, and conversational quality. Instruction adherence in particular stood out, as it's consistently the hardest dimension for models to maintain across a full conversation. The trade-off is latency and cost, which means it's best suited for use cases where decision-making quality is the priority over response speed.
Yes, and in many cases that's the smarter approach. A lighter, faster model may be sufficient for structured, predictable questions, while a more capable model is better suited for open-ended responses that require conditional logic and nuanced reasoning. Regal's Forward Deployed Engineers can help identify where each model makes sense within a given conversation flow.
Regal recommends using Test Cases and Simulations to stress test any new model before deploying it in production. Switching models often requires prompt adjustments, since different models process instructions differently. Treat any evaluation as a first read rather than a final verdict — optimization work after the initial test typically moves the numbers meaningfully.
Ready to see Regal in action?
Book a personalized demo.



