Eval mode

Simulation

Define a fixed set of inputs and expected outcomes. Simulation runs the agent over each one and reports a success rate plus per-case breakdown.

What you'll learn
  • How to assemble a simulation dataset
  • How to define success per scenario
  • How to read the run report
  • How to gate releases on simulation pass rate

Build the dataset

  1. 1

    Start with real runs

    In Monitor, multi-select interesting runs and choose Add to dataset. Each becomes a scenario with the original input pre-filled.
  2. 2

    Add expected behavior

    For each scenario, write the expected output, the expected tool calls, or a rubric. You can mix formats — one scenario can check for a tool call, another can check semantic similarity.
  3. 3

    Add edge cases by hand

    Cover known failure modes — empty input, hostile prompts, off-topic asks, long context. Ten well-chosen scenarios beat a hundred generic ones.

Run the simulation

  1. 1

    Pick the agent version

    Select which agent and which version to evaluate. You can pin to a specific version or always test latest.
  2. 2

    Pick the metric

    Default is overall success — pass if every scenario expectation is met. You can also break down by metric (exact match, semantic, tool match).
  3. 3

    Run and watch

    Each scenario streams in as it finishes. The report shows pass/fail per case with a one-click link to the run trace.

Reading the report

The top of the report is the headline success rate. Below it, a grid lists every scenario with its verdict, latency and cost. Failing rows expand to show what was expected versus what the agent produced — and the full trace, in case you need to debug.

Gating releases

In your agent's settings, set a minimum simulation pass rate. Publishing a new version that scores below the threshold is blocked. Use this as a CI-style guardrail on quality.

Frequently asked questions

How big should a simulation dataset be?
Start with 20 to 50 scenarios. Add a new case every time you find a real failure. Quality beats quantity — every case should test something specific.
Can I run simulation from CI?
Yes. The Eval API exposes a run-and-wait endpoint that returns the pass rate. Wire it into your deploy pipeline.
Does simulation use real tools?
By default yes — the agent runs end to end. You can mark a dataset as sandboxed, which uses mock tool responses you record on each scenario.