Eval mode
Annotation
Synthetic tests miss real-world failures. Annotation sends sampled production runs to human reviewers, captures verdicts, and rolls them into precision, recall and F1.
What you'll learn
- How to build an annotation queue
- How to define labels and rubrics
- How precision, recall and F1 are computed
- How to feed annotated examples back into datasets
Build the queue
- 1
Pick a source
Sample from live runs of one agent, or pull from a Monitor filter (e.g. failing runs, runs over a cost threshold). - 2
Set sample size
Annotation is human-time-bound. Aim for a sample big enough to be statistically meaningful — a few hundred runs per week is a reasonable starting point. - 3
Assign reviewers
Add workspace members as reviewers. Each run can require one or multiple reviewers; multi-review yields agreement scores.
Define the rubric
A rubric is the structured judgement the reviewer makes. Simplest form: a binary correct / incorrect. Richer rubrics break into dimensions — helpfulness, factuality, tone, tool choice — each scored on a fixed scale. Keep rubrics short; long rubrics drift across reviewers.
Reading the metrics
- 1
Precision
Of the runs the agent treated as a positive (e.g. resolved, classified as urgent), what fraction were actually positive. - 2
Recall
Of the runs that should have been positive, what fraction the agent caught. - 3
F1
The harmonic mean of precision and recall. A single number to track over time. - 4
Inter-rater agreement
When two reviewers label the same run, how often they agree. Below 80 percent usually means the rubric is too vague.
Close the loop
Annotated runs feed back into datasets. Send failing examples to a simulation dataset for regression testing. Send borderline examples to a calibration set for tuning the judge LLM.
Frequently asked questions
- Can the judge LLM annotate instead of a human?
- Yes, but calibrate first. Have humans label a few hundred runs, then check the judge agrees on at least 80 percent before letting it annotate solo.
- How do I avoid reviewer bias?
- Blind reviewers to the agent version, randomize the order, and require multiple reviewers per run. Agreement metrics will surface drift across the team.
- Are annotated runs locked?
- Yes. Once a run is labeled, the verdict and labeler are immutable for audit. You can append new annotations but not overwrite the old ones.