Eval mode
Drift detection
Models, prompts, and the world all change. Drift detection scores a continuous sample of production traffic and alerts when the quality curve bends.
What you'll learn
- How drift detection samples production traffic
- How to calibrate the judge LLM
- How drift alerts fire and route
- How to root-cause a drift event
How sampling works
- 1
Pick the agent
Drift detection runs per agent. Enable it from the agent settings; the platform will start sampling immediately. - 2
Configure the rate
Pick a percentage of runs to sample, or a fixed budget per day. Sampling is uniform and does not affect production latency. - 3
Pick the metrics
Each sampled run is scored on one or more metrics — success, faithfulness, tone, custom rubric. Scores roll up into time-series charts.
Calibrate the judge
Drift detection scores runs with a judge LLM. Calibration is non-optional: label a few hundred runs by hand, then run the judge against the same set and confirm agreement above your threshold (80 percent is a common floor). Re-calibrate when you change models, prompts, or judge versions.
Alerts and routing
- 1
Threshold
Set a floor (absolute, e.g. success rate below 90 percent) or a delta (e.g. a 5-point drop within 24 hours). Both can fire independently. - 2
Channel
Route alerts to Slack, email, PagerDuty, or webhook. Each alert links straight to the failing sample and the relevant trace. - 3
Suppression
Mute during planned changes (a known prompt rollout, a model swap) so the team only sees real drift.
Investigating a drift
When an alert fires, open the drift report. It shows the time-series curve, the breakpoint, and a sample of failing runs from the affected window. From there: open a run, send it into an annotation queue, or fork an A/B test with a candidate fix.
Frequently asked questions
- Does drift detection cost LLM tokens?
- Yes — the judge LLM consumes tokens per sampled run. Set a daily sample budget if you want a predictable spend.
- How fast will I know about a drift?
- Within minutes for sharp drops, depending on sample size. Slow drifts take longer to confirm — that is a statistical floor, not a platform limit.
- Can I run drift detection without a judge LLM?
- For some metrics yes — anything rule-based (status code, tool match, output schema) works without a judge. Quality rubrics need a judge.