Eval mode

Drift detection

Models, prompts, and the world all change. Drift detection scores a continuous sample of production traffic and alerts when the quality curve bends.

What you'll learn
  • How drift detection samples production traffic
  • How to calibrate the judge LLM
  • How drift alerts fire and route
  • How to root-cause a drift event

How sampling works

  1. 1

    Pick the agent

    Drift detection runs per agent. Enable it from the agent settings; the platform will start sampling immediately.
  2. 2

    Configure the rate

    Pick a percentage of runs to sample, or a fixed budget per day. Sampling is uniform and does not affect production latency.
  3. 3

    Pick the metrics

    Each sampled run is scored on one or more metrics — success, faithfulness, tone, custom rubric. Scores roll up into time-series charts.

Calibrate the judge

Drift detection scores runs with a judge LLM. Calibration is non-optional: label a few hundred runs by hand, then run the judge against the same set and confirm agreement above your threshold (80 percent is a common floor). Re-calibrate when you change models, prompts, or judge versions.

Alerts and routing

  1. 1

    Threshold

    Set a floor (absolute, e.g. success rate below 90 percent) or a delta (e.g. a 5-point drop within 24 hours). Both can fire independently.
  2. 2

    Channel

    Route alerts to Slack, email, PagerDuty, or webhook. Each alert links straight to the failing sample and the relevant trace.
  3. 3

    Suppression

    Mute during planned changes (a known prompt rollout, a model swap) so the team only sees real drift.

Investigating a drift

When an alert fires, open the drift report. It shows the time-series curve, the breakpoint, and a sample of failing runs from the affected window. From there: open a run, send it into an annotation queue, or fork an A/B test with a candidate fix.

Frequently asked questions

Does drift detection cost LLM tokens?
Yes — the judge LLM consumes tokens per sampled run. Set a daily sample budget if you want a predictable spend.
How fast will I know about a drift?
Within minutes for sharp drops, depending on sample size. Slow drifts take longer to confirm — that is a statistical floor, not a platform limit.
Can I run drift detection without a judge LLM?
For some metrics yes — anything rule-based (status code, tool match, output schema) works without a judge. Quality rubrics need a judge.