Eval
Evaluate agents with confidence
Stop shipping LLM changes on gut feel. Eval gives you four measurement modes — synthetic scenarios, human-labeled outputs, two-variant tests, and continuous drift detection — over a single datasets layer.
What you'll learn
- What problem each eval mode solves
- When to reach for simulation vs annotation vs A/B vs drift
- How datasets, metrics and judges fit together
- Where eval slots into your release workflow
The four modes
Each mode answers a different question. Use them together to cover the full quality lifecycle.
- 1
Simulation
Run the agent against a set of synthetic scenarios and report a success rate. Best for catching regressions before you ship. - 2
Annotation
Have humans label real outputs as correct or incorrect, then compute precision, recall and F1. Best for measuring true quality on production data. - 3
A/B testing
Run two agent variants on the same dataset and compare metrics side by side. Best for picking between two prompt or model candidates. - 4
Drift detection
Continuously score a sample of live traffic and alert when quality moves. Best for catching silent regressions after release.
Datasets, metrics and judges
All four modes share three primitives. A dataset is the set of inputs (and optional expected outputs) you evaluate against. A metric is what you measure — exact match, semantic similarity, faithfulness, custom rubric. A judge is what produces the metric — a model, a rule, or a human. Calibrate the judge on a small labeled set before trusting it at scale.
Where eval slots in
- 1
During build
Run simulation against a small golden dataset every time you change the agent. CI-style gate before publish. - 2
Pre-release
Run A/B against the live version on a larger dataset. Promote the winner. - 3
In production
Drift detection runs continuously. Annotation queues sample failing runs for human review.
Frequently asked questions
- Do I need a dataset to start?
- For simulation and A/B, yes — but you can bootstrap one from Monitor by sending failing runs into a new dataset. Drift detection needs no dataset; it samples live traffic.
- Which model should I use as a judge?
- A frontier model from a different family than the agent. Calibrate on a small human-labeled set first — judge agreement with humans should clear 80 percent before you trust it.
- Can I write a custom metric?
- Yes. Metrics are pluggable. Use the built-in library for common cases (exact match, semantic match, faithfulness, toxicity) or register a custom function or LLM rubric.
- How is this different from Monitor?
- Monitor records what happened. Eval scores whether what happened was correct. You need both — observability tells you the run finished; eval tells you it was right.