Eval mode

Simulation

Define a fixed set of inputs and expected outcomes. Simulation runs the agent over each one and reports a success rate plus per-case breakdown.

What you'll learn

How to assemble a simulation dataset
How to define success per scenario
How to read the run report
How to gate releases on simulation pass rate

Build the dataset

1
Start with real runs
In Monitor, multi-select interesting runs and choose Add to dataset. Each becomes a scenario with the original input pre-filled.
2
Add expected behavior
For each scenario, write the expected output, the expected tool calls, or a rubric. You can mix formats — one scenario can check for a tool call, another can check semantic similarity.
3
Add edge cases by hand
Cover known failure modes — empty input, hostile prompts, off-topic asks, long context. Ten well-chosen scenarios beat a hundred generic ones.

Run the simulation

1
Pick the agent version
Select which agent and which version to evaluate. You can pin to a specific version or always test latest.
2
Pick the metric
Default is overall success — pass if every scenario expectation is met. You can also break down by metric (exact match, semantic, tool match).
3
Run and watch
Each scenario streams in as it finishes. The report shows pass/fail per case with a one-click link to the run trace.

Reading the report

The top of the report is the headline success rate. Below it, a grid lists every scenario with its verdict, latency and cost. Failing rows expand to show what was expected versus what the agent produced — and the full trace, in case you need to debug.

Gating releases

In your agent's settings, set a minimum simulation pass rate. Publishing a new version that scores below the threshold is blocked. Use this as a CI-style guardrail on quality.

Frequently asked questions

How big should a simulation dataset be?: Start with 20 to 50 scenarios. Add a new case every time you find a real failure. Quality beats quantity — every case should test something specific.
Can I run simulation from CI?: Yes. The Eval API exposes a run-and-wait endpoint that returns the pass rate. Wire it into your deploy pipeline.
Does simulation use real tools?: By default yes — the agent runs end to end. You can mark a dataset as sandboxed, which uses mock tool responses you record on each scenario.

Source real runs into your dataset.