Eval mode

A/B testing

Pick a winner between two variants — different prompts, models, tools, or whole agent versions — with statistical confidence rather than a vibe check.

What you'll learn

How to define A and B variants
How to read the comparison charts
How to detect significant deltas
How to promote the winner safely

Set up the test

1
Pick the dataset
Use an existing simulation or annotation dataset. Both variants run the same inputs.
2
Define variant A
Usually the current production version. Pin to a specific agent version ID.
3
Define variant B
The candidate. Could be a new prompt, a different model, an added tool, or a forked agent.
4
Pick metrics
Any metric you care about — success rate, average cost, P95 latency, F1, custom rubric scores. The report shows each metric per variant.

Reading the report

The report puts variants side by side on every metric, with deltas and a significance marker. Statistically significant improvements are flagged in green; regressions in red. A per-case grid lets you spot which inputs flipped — that is often where the real insight is.

Promote safely

1
Read the regressions
Even a winning variant usually loses on some cases. Read those before promoting — sometimes the cases that regressed are the ones that matter most.
2
Promote the variant
Promote turns the winning variant into the new production version. The old version is kept for rollback.
3
Schedule a re-test
Set drift detection on the new version. If quality moves after release, you will hear about it.

Frequently asked questions

How many cases do I need for significance?: It depends on effect size, but 200 to 500 cases is a common sweet spot. The report shows confidence intervals so you can see when the gap is real.
Can I run more than two variants?: Yes. A/B/n is supported; the report adds columns per variant. Watch out for multiple-comparison effects on significance.
Can I A/B in production traffic instead of on a dataset?: Yes — that is online A/B. Configure a traffic split on the agent and the platform routes runs to each variant. Use this once you trust offline results.

Track latency and cost trends across variants.