Eval

Evaluate agents with confidence

Stop shipping LLM changes on gut feel. Eval gives you four measurement modes — synthetic scenarios, human-labeled outputs, two-variant tests, and continuous drift detection — over a single datasets layer.

What you'll learn
  • What problem each eval mode solves
  • When to reach for simulation vs annotation vs A/B vs drift
  • How datasets, metrics and judges fit together
  • Where eval slots into your release workflow

The four modes

Each mode answers a different question. Use them together to cover the full quality lifecycle.
  1. 1

    Simulation

    Run the agent against a set of synthetic scenarios and report a success rate. Best for catching regressions before you ship.
  2. 2

    Annotation

    Have humans label real outputs as correct or incorrect, then compute precision, recall and F1. Best for measuring true quality on production data.
  3. 3

    A/B testing

    Run two agent variants on the same dataset and compare metrics side by side. Best for picking between two prompt or model candidates.
  4. 4

    Drift detection

    Continuously score a sample of live traffic and alert when quality moves. Best for catching silent regressions after release.

Datasets, metrics and judges

All four modes share three primitives. A dataset is the set of inputs (and optional expected outputs) you evaluate against. A metric is what you measure — exact match, semantic similarity, faithfulness, custom rubric. A judge is what produces the metric — a model, a rule, or a human. Calibrate the judge on a small labeled set before trusting it at scale.

Where eval slots in

  1. 1

    During build

    Run simulation against a small golden dataset every time you change the agent. CI-style gate before publish.
  2. 2

    Pre-release

    Run A/B against the live version on a larger dataset. Promote the winner.
  3. 3

    In production

    Drift detection runs continuously. Annotation queues sample failing runs for human review.

Frequently asked questions

Do I need a dataset to start?
For simulation and A/B, yes — but you can bootstrap one from Monitor by sending failing runs into a new dataset. Drift detection needs no dataset; it samples live traffic.
Which model should I use as a judge?
A frontier model from a different family than the agent. Calibrate on a small human-labeled set first — judge agreement with humans should clear 80 percent before you trust it.
Can I write a custom metric?
Yes. Metrics are pluggable. Use the built-in library for common cases (exact match, semantic match, faithfulness, toxicity) or register a custom function or LLM rubric.
How is this different from Monitor?
Monitor records what happened. Eval scores whether what happened was correct. You need both — observability tells you the run finished; eval tells you it was right.