
September 2023 Releases
You need confidence that every AI agent will behave exactly as intended before it interacts with live customers.
But enterprise-level testing is complex. Multi-turn conversations, branching logic, and hundreds of scenarios make it hard to identify end-to-end failures quickly. Doing so manually is very time-consuming.
The typical tradeoff is to run regression tests in small bulks and go live, hoping the agent handles edge cases correctly.
That tradeoff is what our Simulations feature solves for.
As an extension of our Simulations feature, Evaluations further speed up the process of pre-launch and regression testing, while giving greater visibility into the edge cases—the mishandled questions, repetition, miscommunications, incorrect function calls.
Evaluations allow you to validate agent behavior more quickly, so you can make more immediate and impactful updates to your AI agent prompt, custom actions, and knowledge bases and launch with confidence.
Evaluations are scenario-specific automated pass/fail assessments that use an LLM to review simulated conversations against the success criteria you define.
They provide a granular QA layer on top of running simulations in bulk. You get a quick view of which cases passed and failed, so you know exactly where to prioritize prompt, knowledge base, or custom action updates.
1. Unlock granular automated QA with scenario-specific assessments
Each test case is evaluated against precise scenario occurrences (did the AI handle this objection correctly, did the AI get clarification on a vague answer), not the outcome of the entire conversation.
For example, you might test an appointment scheduling flow for “Unrelated Inquiry Handling,” where the success criteria is that the AI addresses the contact’s unrelated question, gracefully lets them know they can’t help, and then continues on the scheduling path.
It’s possible that the AI disregards the question completely or mistakenly tries to answer it, but still successfully schedules an appointment.
In this case, the success criteria was for handling the inquiry. This case would register as failed.
Scenario-specific assessments like this allow for fast, granular evaluation of every point in a conversation so you can tailor improvements to scoped, tangible issues.
2. Ensure consistency with unified scoring logic across the platform
The logic used to score simulations comes from the same engine logic that powers Regal scorecards. This ensures alignment between pre-launch testing and post-launch QA, bringing one unified way to measure improvements, identify regressions, and act on QA insights, regardless of when or where you’re doing so.
3. Built for complexity
Because LLMs can interpret variations in phrasing, intent, and context that strict rule-based testing might miss, your AI agent is tested and evaluated against realistic, nuanced conversations.
How it works: An LLM is employed as the customer-side contact in every simulation. An LLM is also employed to carry out the evaluation itself, determining pass or fail scores using the Success Criteria you define in the Simulation prompt.
This all plays into how fast you’re able to evaluate test cases and take action on AI agent improvements.
By running tests in bulk and getting instant pass-fail scores, you’ll know where to prioritize AI Agent updates.
This reduces QA back-and-forth, and helps speed up the iteration and deployment cycle: You can quickly identify where the AI is underperforming and address gaps pre-launch, assuring better performance at deployment. This lessens the need (and gets you time back) from having to manually QA the agent post-launch.
For example:
You have 10 test cases for an auto insurance lead qualification flow. Considering the use case, the tests would likely cover:
You run all 10, and two test cases fail:
You’ll see the summary of why the case failed, alongside the original success criteria:
In a matter of seconds, you know that you need to address your objection handling prompt and guardrails to better address unsupported service questions and speak clearly to pricing.
Conversation Flow:
Actions Taken:
Branching and Conditional Logic:
Evaluations don’t just flag failures.
They provide a human-readable explanation of why a test failed and give you a sample conversation displaying this failure. And since you know the precise scenario that failed, this enables immediate iteration on prompts, branching logic, and function calls.
For Example:
The home insurance scenario above failed because the agent followed its prompt correctly, maintaining a helpful nature, but lacked explicit instructions on how to exit out-of-scope inquiries.
By adding a single instruction in the Objection Handling prompt, you address inquiries about home insurance.
From here, you re-run the simulation, and the test passes:
Even after validation, AI agents aren’t perfectly predictable.
LLMs can’t match human unpredictability 1-to-1, so live calls will always introduce new edge cases and unexpected variances.
That makes Simulations and Evaluations critical for regression testing: when you create new agent versions or make improvements over time, you can continue to rerun your existing test suite and evaluations in one click, instead of starting from scratch every time.
This approach lets your enterprise turn insights into actionable improvements immediately, aligning simulation outcomes with live performance metrics, shortening iteration cycles, and ensuring agents perform reliably across every scenario.
Every failed test scenario is an opportunity to improve.
Evaluations turn each test case into precise insights, letting you understand exactly what went wrong and where to intervene. Plus, since it’s done in an automated way at scale, it massively reduces the need for manual review.
This closes the gap between simulation and live performance, enabling you to iterate quickly, reduce QA cycles, and deploy AI agents that are reliable across complex workflows.
By incorporating scenario-specific feedback into every pre-launch evaluation, you ensure that your AI agents are not only tested—but truly optimized for customers.
Explore Evaluations today to aggregate, analyze, and act on test results with confidence.
Ready to see Regal in action?
Book a personalized demo.