Use case
Incident response and on-call agents
When a pager fires at 3am, the first 10 minutes go to context-gathering. A Dezifi on-call agent does that work the moment the alert lands, so the human jumps straight to action.
What you'll learn
- How to design a read-mostly observability agent
- Which Datadog / Grafana / PagerDuty hooks matter
- How to summarize incident context for fast handoff
- How to stop the agent from taking destructive actions
The agent design
On-call agents must be fast, accurate and read-only by default. They produce context, not commands.
- 1
LLM choice
GPT-4o or Claude Sonnet — the agent needs to correlate metrics, logs and recent deploys. Latency matters; pick the fastest tier you trust. - 2
Tools
Datadog (query metrics, fetch dashboard snapshots), Grafana (query Prometheus, render panels), PagerDuty (read incident state), GitHub (recent merges), Slack (post into the incident channel). - 3
Guardrails
Read-only mode on all infra tools. No production action without explicit approval. Block agent-initiated restarts, rollbacks or deploys outright — those stay manual. - 4
Workflow shape
Trigger: PagerDuty incident created. Step 1: pull error rate, latency, saturation for the affected service. Step 2: pull last 10 merges to that service. Step 3: post a context summary into Slack with links to the relevant dashboards. Step 4: suggest the most likely cause.
Tools to connect
- Datadog — metrics, logs, dashboard snapshots scoped to a service.
- Grafana — Prometheus / Loki queries, panel screenshots.
- PagerDuty — incident state, on-call rotation context.
- GitHub — recent commits and deploys for the affected service.
- Slack — post into the auto-created incident channel.
How to set this up in Dezifi
- 1
Connect observability tools
Integrations → Datadog → API key + app key. For Grafana, paste the URL and a service-account token. Both should be read-only credentials. - 2
Connect PagerDuty
Integrations → PagerDuty → OAuth. Subscribe to incident_created and incident_resolved events. - 3
Create the agent
New Agent → "On-Call Context Agent". Attach Datadog, Grafana, PagerDuty, GitHub, Slack. Use GPT-4o for speed. - 4
Lock down with policies
Apply a policy with all tools in read-only mode. No write actions allowed without manual override. Add a cost cap per run — incidents are bursty. - 5
Build the incident-context workflow
Trigger: PagerDuty incident_created webhook. Steps: fetch service metrics → fetch recent deploys → agent synthesizes a 5-line incident summary → post into Slack incident channel.
Frequently asked questions
- Should the agent be allowed to take action during incidents?
- Start with read-only. Once your team trusts the summaries, you can add narrow, approval-gated runbooks — like restarting a known-flaky job — but never destructive actions without a human.
- How does it handle alert storms?
- A rate-limit policy caps concurrent runs per service. The agent groups related alerts in a single context post instead of firing once per alert.
- Can it write postmortems?
- Yes — run a separate workflow after PagerDuty marks the incident resolved. The agent stitches the timeline from PagerDuty, Slack, and GitHub, and drafts a postmortem template.
- What if the LLM is wrong about the root cause?
- Treat the summary as a hypothesis, not a verdict. The agent always links to the underlying dashboards so the on-call can verify in seconds.