Eval mode
Simulation
Define a fixed set of inputs and expected outcomes. Simulation runs the agent over each one and reports a success rate plus per-case breakdown.
What you'll learn
- How to assemble a simulation dataset
- How to define success per scenario
- How to read the run report
- How to gate releases on simulation pass rate
Build the dataset
- 1
Start with real runs
In Monitor, multi-select interesting runs and choose Add to dataset. Each becomes a scenario with the original input pre-filled. - 2
Add expected behavior
For each scenario, write the expected output, the expected tool calls, or a rubric. You can mix formats — one scenario can check for a tool call, another can check semantic similarity. - 3
Add edge cases by hand
Cover known failure modes — empty input, hostile prompts, off-topic asks, long context. Ten well-chosen scenarios beat a hundred generic ones.
Run the simulation
- 1
Pick the agent version
Select which agent and which version to evaluate. You can pin to a specific version or always test latest. - 2
Pick the metric
Default is overall success — pass if every scenario expectation is met. You can also break down by metric (exact match, semantic, tool match). - 3
Run and watch
Each scenario streams in as it finishes. The report shows pass/fail per case with a one-click link to the run trace.
Reading the report
The top of the report is the headline success rate. Below it, a grid lists every scenario with its verdict, latency and cost. Failing rows expand to show what was expected versus what the agent produced — and the full trace, in case you need to debug.
Gating releases
In your agent's settings, set a minimum simulation pass rate. Publishing a new version that scores below the threshold is blocked. Use this as a CI-style guardrail on quality.
Frequently asked questions
- How big should a simulation dataset be?
- Start with 20 to 50 scenarios. Add a new case every time you find a real failure. Quality beats quantity — every case should test something specific.
- Can I run simulation from CI?
- Yes. The Eval API exposes a run-and-wait endpoint that returns the pass rate. Wire it into your deploy pipeline.
- Does simulation use real tools?
- By default yes — the agent runs end to end. You can mark a dataset as sandboxed, which uses mock tool responses you record on each scenario.