Eval mode

Drift detection

Models, prompts, and the world all change. Drift detection scores a continuous sample of production traffic and alerts when the quality curve bends.

What you'll learn

How drift detection samples production traffic
How to calibrate the judge LLM
How drift alerts fire and route
How to root-cause a drift event

How sampling works

1
Pick the agent
Drift detection runs per agent. Enable it from the agent settings; the platform will start sampling immediately.
2
Configure the rate
Pick a percentage of runs to sample, or a fixed budget per day. Sampling is uniform and does not affect production latency.
3
Pick the metrics
Each sampled run is scored on one or more metrics — success, faithfulness, tone, custom rubric. Scores roll up into time-series charts.

Calibrate the judge

Drift detection scores runs with a judge LLM. Calibration is non-optional: label a few hundred runs by hand, then run the judge against the same set and confirm agreement above your threshold (80 percent is a common floor). Re-calibrate when you change models, prompts, or judge versions.

Alerts and routing

1
Threshold
Set a floor (absolute, e.g. success rate below 90 percent) or a delta (e.g. a 5-point drop within 24 hours). Both can fire independently.
2
Channel
Route alerts to Slack, email, PagerDuty, or webhook. Each alert links straight to the failing sample and the relevant trace.
3
Suppression
Mute during planned changes (a known prompt rollout, a model swap) so the team only sees real drift.

Investigating a drift

When an alert fires, open the drift report. It shows the time-series curve, the breakpoint, and a sample of failing runs from the affected window. From there: open a run, send it into an annotation queue, or fork an A/B test with a candidate fix.

Frequently asked questions

Does drift detection cost LLM tokens?: Yes — the judge LLM consumes tokens per sampled run. Set a daily sample budget if you want a predictable spend.
How fast will I know about a drift?: Within minutes for sharp drops, depending on sample size. Slow drifts take longer to confirm — that is a statistical floor, not a platform limit.
Can I run drift detection without a judge LLM?: For some metrics yes — anything rule-based (status code, tool match, output schema) works without a judge. Quality rubrics need a judge.

Eval overview

Annotation

Calibrate the judge with human labels.

A/B testing

Test a fix before re-deploying.

Monitor

Inspect the runs behind a drift alert.