Use case

Incident response and on-call agents

When a pager fires at 3am, the first 10 minutes go to context-gathering. A Dezifi on-call agent does that work the moment the alert lands, so the human jumps straight to action.

What you'll learn
  • How to design a read-mostly observability agent
  • Which Datadog / Grafana / PagerDuty hooks matter
  • How to summarize incident context for fast handoff
  • How to stop the agent from taking destructive actions

The agent design

On-call agents must be fast, accurate and read-only by default. They produce context, not commands.
  1. 1

    LLM choice

    GPT-4o or Claude Sonnet — the agent needs to correlate metrics, logs and recent deploys. Latency matters; pick the fastest tier you trust.
  2. 2

    Tools

    Datadog (query metrics, fetch dashboard snapshots), Grafana (query Prometheus, render panels), PagerDuty (read incident state), GitHub (recent merges), Slack (post into the incident channel).
  3. 3

    Guardrails

    Read-only mode on all infra tools. No production action without explicit approval. Block agent-initiated restarts, rollbacks or deploys outright — those stay manual.
  4. 4

    Workflow shape

    Trigger: PagerDuty incident created. Step 1: pull error rate, latency, saturation for the affected service. Step 2: pull last 10 merges to that service. Step 3: post a context summary into Slack with links to the relevant dashboards. Step 4: suggest the most likely cause.

Tools to connect

  • Datadog — metrics, logs, dashboard snapshots scoped to a service.
  • Grafana — Prometheus / Loki queries, panel screenshots.
  • PagerDuty — incident state, on-call rotation context.
  • GitHub — recent commits and deploys for the affected service.
  • Slack — post into the auto-created incident channel.

How to set this up in Dezifi

  1. 1

    Connect observability tools

    Integrations → Datadog → API key + app key. For Grafana, paste the URL and a service-account token. Both should be read-only credentials.
  2. 2

    Connect PagerDuty

    Integrations → PagerDuty → OAuth. Subscribe to incident_created and incident_resolved events.
  3. 3

    Create the agent

    New Agent → "On-Call Context Agent". Attach Datadog, Grafana, PagerDuty, GitHub, Slack. Use GPT-4o for speed.
  4. 4

    Lock down with policies

    Apply a policy with all tools in read-only mode. No write actions allowed without manual override. Add a cost cap per run — incidents are bursty.
  5. 5

    Build the incident-context workflow

    Trigger: PagerDuty incident_created webhook. Steps: fetch service metrics → fetch recent deploys → agent synthesizes a 5-line incident summary → post into Slack incident channel.

Frequently asked questions

Should the agent be allowed to take action during incidents?
Start with read-only. Once your team trusts the summaries, you can add narrow, approval-gated runbooks — like restarting a known-flaky job — but never destructive actions without a human.
How does it handle alert storms?
A rate-limit policy caps concurrent runs per service. The agent groups related alerts in a single context post instead of firing once per alert.
Can it write postmortems?
Yes — run a separate workflow after PagerDuty marks the incident resolved. The agent stitches the timeline from PagerDuty, Slack, and GitHub, and drafts a postmortem template.
What if the LLM is wrong about the root cause?
Treat the summary as a hypothesis, not a verdict. The agent always links to the underlying dashboards so the on-call can verify in seconds.