Use case

Incident response and on-call agents

When a pager fires at 3am, the first 10 minutes go to context-gathering. A Dezifi on-call agent does that work the moment the alert lands, so the human jumps straight to action.

What you'll learn

How to design a read-mostly observability agent
Which Datadog / Grafana / PagerDuty hooks matter
How to summarize incident context for fast handoff
How to stop the agent from taking destructive actions

The agent design

On-call agents must be fast, accurate and read-only by default. They produce context, not commands.

1
LLM choice
GPT-4o or Claude Sonnet — the agent needs to correlate metrics, logs and recent deploys. Latency matters; pick the fastest tier you trust.
2
Tools
Datadog (query metrics, fetch dashboard snapshots), Grafana (query Prometheus, render panels), PagerDuty (read incident state), GitHub (recent merges), Slack (post into the incident channel).
3
Guardrails
Read-only mode on all infra tools. No production action without explicit approval. Block agent-initiated restarts, rollbacks or deploys outright — those stay manual.
4
Workflow shape
Trigger: PagerDuty incident created. Step 1: pull error rate, latency, saturation for the affected service. Step 2: pull last 10 merges to that service. Step 3: post a context summary into Slack with links to the relevant dashboards. Step 4: suggest the most likely cause.

Tools to connect

Datadog — metrics, logs, dashboard snapshots scoped to a service.
Grafana — Prometheus / Loki queries, panel screenshots.
PagerDuty — incident state, on-call rotation context.
GitHub — recent commits and deploys for the affected service.
Slack — post into the auto-created incident channel.

How to set this up in Dezifi

1
Connect observability tools
Integrations → Datadog → API key + app key. For Grafana, paste the URL and a service-account token. Both should be read-only credentials.
2
Connect PagerDuty
Integrations → PagerDuty → OAuth. Subscribe to incident_created and incident_resolved events.
3
Create the agent
New Agent → "On-Call Context Agent". Attach Datadog, Grafana, PagerDuty, GitHub, Slack. Use GPT-4o for speed.
4
Lock down with policies
Apply a policy with all tools in read-only mode. No write actions allowed without manual override. Add a cost cap per run — incidents are bursty.
5
Build the incident-context workflow
Trigger: PagerDuty incident_created webhook. Steps: fetch service metrics → fetch recent deploys → agent synthesizes a 5-line incident summary → post into Slack incident channel.

Frequently asked questions

Should the agent be allowed to take action during incidents?: Start with read-only. Once your team trusts the summaries, you can add narrow, approval-gated runbooks — like restarting a known-flaky job — but never destructive actions without a human.
How does it handle alert storms?: A rate-limit policy caps concurrent runs per service. The agent groups related alerts in a single context post instead of firing once per alert.
Can it write postmortems?: Yes — run a separate workflow after PagerDuty marks the incident resolved. The agent stitches the timeline from PagerDuty, Slack, and GitHub, and drafts a postmortem template.
What if the LLM is wrong about the root cause?: Treat the summary as a hypothesis, not a verdict. The agent always links to the underlying dashboards so the on-call can verify in seconds.

Datadog integration

Grafana integration

PagerDuty integration

Engineering productivity