Eval

Evaluate agents with confidence

Stop shipping LLM changes on gut feel. Eval gives you four measurement modes — synthetic scenarios, human-labeled outputs, two-variant tests, and continuous drift detection — over a single datasets layer.

What you'll learn

What problem each eval mode solves
When to reach for simulation vs annotation vs A/B vs drift
How datasets, metrics and judges fit together
Where eval slots into your release workflow

The four modes

Each mode answers a different question. Use them together to cover the full quality lifecycle.

1
Simulation
Run the agent against a set of synthetic scenarios and report a success rate. Best for catching regressions before you ship.
2
Annotation
Have humans label real outputs as correct or incorrect, then compute precision, recall and F1. Best for measuring true quality on production data.
3
A/B testing
Run two agent variants on the same dataset and compare metrics side by side. Best for picking between two prompt or model candidates.
4
Drift detection
Continuously score a sample of live traffic and alert when quality moves. Best for catching silent regressions after release.

Datasets, metrics and judges

All four modes share three primitives. A dataset is the set of inputs (and optional expected outputs) you evaluate against. A metric is what you measure — exact match, semantic similarity, faithfulness, custom rubric. A judge is what produces the metric — a model, a rule, or a human. Calibrate the judge on a small labeled set before trusting it at scale.

Where eval slots in

1
During build
Run simulation against a small golden dataset every time you change the agent. CI-style gate before publish.
2
Pre-release
Run A/B against the live version on a larger dataset. Promote the winner.
3
In production
Drift detection runs continuously. Annotation queues sample failing runs for human review.

Frequently asked questions

Do I need a dataset to start?: For simulation and A/B, yes — but you can bootstrap one from Monitor by sending failing runs into a new dataset. Drift detection needs no dataset; it samples live traffic.
Which model should I use as a judge?: A frontier model from a different family than the agent. Calibrate on a small human-labeled set first — judge agreement with humans should clear 80 percent before you trust it.
Can I write a custom metric?: Yes. Metrics are pluggable. Use the built-in library for common cases (exact match, semantic match, faithfulness, toxicity) or register a custom function or LLM rubric.
How is this different from Monitor?: Monitor records what happened. Eval scores whether what happened was correct. You need both — observability tells you the run finished; eval tells you it was right.

Simulation

Score against synthetic scenarios.

Annotation

Human-labeled precision and recall.

A/B testing

Compare two variants on one dataset.

Drift detection

Continuous eval against live traffic.

Monitor

Source the runs that feed eval datasets.