Incident Response Automation
git clone https://github.com/AEEF-AI/aeef-production.git
The Production tier includes automated incident response tooling that reduces mean time to resolution (MTTR) for AI governance incidents. This includes automated triage, rollback automation, structured alert routing, and incident record generation that feeds into post-incident review processes.
For the normative incident response requirements, see PRD-STD-010: AI Product Safety & Trust and the Incident Response governance guide.
Automated Triage Scripts
The triage system classifies incidents by type and severity, then routes them to the appropriate response automation:
Triage Script
#!/usr/bin/env bash
# scripts/triage.sh
set -euo pipefail
SEVERITY="${1:?Usage: triage.sh <P1|P2|P3|P4> <incident-type>}"
INCIDENT_TYPE="${2:-unknown}"
echo "Triaging incident: severity=$SEVERITY type=$INCIDENT_TYPE"
case "$INCIDENT_TYPE" in
drift)
echo "Configuration drift detected"
./scripts/assess-drift-impact.sh
;;
security)
echo "Security incident detected"
./scripts/isolate-affected-services.sh
;;
agent-violation)
echo "Agent trust boundary violation"
./scripts/suspend-agent.sh
;;
data-breach)
echo "Potential data breach"
./scripts/activate-breach-protocol.sh
;;
*)
echo "Unknown incident type, routing to on-call"
;;
esac
# Create incident record
./scripts/create-incident.sh "$SEVERITY" "$INCIDENT_TYPE"
# Route based on severity
case "$SEVERITY" in
P1)
echo "P1: Initiating immediate rollback and page on-call"
./scripts/rollback.sh --to-last-known-good
./scripts/page-oncall.sh --severity P1
;;
P2)
echo "P2: Alerting on-call team"
./scripts/page-oncall.sh --severity P2
;;
P3)
echo "P3: Creating ticket for next business day"
./scripts/create-ticket.sh --priority high
;;
P4)
echo "P4: Logging for review"
./scripts/log-incident.sh
;;
esac
Incident Type Classification
| Type | Description | Auto-Response |
|---|---|---|
drift | Configuration has deviated from baseline | Impact assessment, baseline comparison report |
security | SAST/SCA finding in production, vulnerability detected | Service isolation, emergency patch workflow |
agent-violation | Agent exceeded trust boundary or permission scope | Agent suspension, audit log extraction |
data-breach | PII exposure or unauthorized data access | Breach protocol activation, regulatory notification prep |
quality-degradation | Coverage/mutation score below threshold | Alert + investigation assignment |
deployment-failure | Production deployment failed health checks | Automatic rollback to last known good |
Rollback Automation
The rollback script reverts the deployment to the last known good state:
#!/usr/bin/env bash
# scripts/rollback.sh
set -euo pipefail
TARGET="${1:---to-last-known-good}"
DEPLOYMENT_LOG="deployments/history.json"
case "$TARGET" in
--to-last-known-good)
ROLLBACK_SHA=$(jq -r '.deployments[] | select(.status == "healthy") | .commit' \
"$DEPLOYMENT_LOG" | head -1)
;;
--to-commit)
ROLLBACK_SHA="${2:?Usage: rollback.sh --to-commit <sha>}"
;;
esac
echo "Rolling back to commit: $ROLLBACK_SHA"
# Verify the target commit passed all quality gates
PROVENANCE="provenance/${ROLLBACK_SHA}.json"
if [ ! -f "$PROVENANCE" ]; then
echo "ERROR: No provenance record for $ROLLBACK_SHA"
exit 1
fi
ALL_PASSED=$(jq -r '.stages | to_entries | all(.value.status == "pass")' "$PROVENANCE")
if [ "$ALL_PASSED" != "true" ]; then
echo "ERROR: Target commit did not pass all quality gates"
exit 1
fi
# Execute rollback
git checkout "$ROLLBACK_SHA"
docker compose build
docker compose up -d --force-recreate
# Verify health
sleep 10
HEALTH=$(curl -sf http://localhost:8080/health | jq -r '.status')
if [ "$HEALTH" != "ok" ]; then
echo "ERROR: Rollback health check failed"
exit 1
fi
echo "Rollback successful to $ROLLBACK_SHA"
# Record rollback event
jq --arg sha "$ROLLBACK_SHA" --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
'.rollbacks += [{"commit": $sha, "timestamp": $ts}]' \
"$DEPLOYMENT_LOG" > tmp.json && mv tmp.json "$DEPLOYMENT_LOG"
Alert Routing Configuration
Alerts are routed based on severity and incident type:
| Severity | Response Time | Notification Channel | Escalation |
|---|---|---|---|
| P1 | Immediate | PagerDuty + Slack #incidents + Phone | VP Engineering within 15 min |
| P2 | 30 minutes | PagerDuty + Slack #incidents | Team lead within 1 hour |
| P3 | Next business day | Slack #aeef-warnings | Sprint backlog |
| P4 | Best effort | Slack #aeef-info | Monthly review |
Slack Integration
{
"channels": {
"critical": "#aeef-incidents",
"warning": "#aeef-warnings",
"info": "#aeef-info"
},
"templates": {
"incident": {
"blocks": [
{
"type": "header",
"text": "AEEF Incident: {{severity}} - {{type}}"
},
{
"type": "section",
"text": "**Description:** {{description}}\n**Detected:** {{timestamp}}\n**Auto-response:** {{action_taken}}"
}
]
}
}
}
Incident Record Schema
Every incident generates a structured record for post-incident review:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["incidentId", "severity", "type", "detectedAt", "status"],
"properties": {
"incidentId": { "type": "string", "format": "uuid" },
"severity": { "enum": ["P1", "P2", "P3", "P4"] },
"type": { "type": "string" },
"description": { "type": "string" },
"detectedAt": { "type": "string", "format": "date-time" },
"resolvedAt": { "type": "string", "format": "date-time" },
"status": { "enum": ["open", "investigating", "mitigated", "resolved", "closed"] },
"affectedServices": {
"type": "array",
"items": { "type": "string" }
},
"rootCause": { "type": "string" },
"actionsTaken": {
"type": "array",
"items": {
"type": "object",
"properties": {
"timestamp": { "type": "string", "format": "date-time" },
"action": { "type": "string" },
"automated": { "type": "boolean" },
"actor": { "type": "string" }
}
}
},
"postMortem": {
"type": "object",
"properties": {
"timeline": { "type": "string" },
"rootCauseAnalysis": { "type": "string" },
"actionItems": {
"type": "array",
"items": { "type": "string" }
}
}
}
}
}
Integration with Incident Management Platforms
The incident response system integrates with external platforms:
Jira Integration
# scripts/create-ticket.sh
curl -X POST "https://your-org.atlassian.net/rest/api/3/issue" \
-H "Authorization: Basic ${JIRA_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"fields": {
"project": {"key": "AEEF"},
"summary": "AEEF Incident: ${SEVERITY} - ${INCIDENT_TYPE}",
"issuetype": {"name": "Bug"},
"priority": {"name": "$(map_severity_to_priority $SEVERITY)"},
"description": "${DESCRIPTION}"
}
}
EOF
PagerDuty Integration
# scripts/page-oncall.sh
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"routing_key": "${PAGERDUTY_ROUTING_KEY}",
"event_action": "trigger",
"payload": {
"summary": "AEEF ${SEVERITY}: ${INCIDENT_TYPE}",
"severity": "$(map_severity_to_pd $SEVERITY)",
"source": "aeef-monitoring",
"component": "governance"
}
}
EOF
OpsGenie Integration
Supported via AlertManager webhook configuration. See the Monitoring Setup page for AlertManager routing.
Next Steps
- Monitoring alerts: Monitoring Setup for configuring alert thresholds and routing
- Sovereign compliance: Sovereign Compliance Overlays for jurisdiction-specific incident protocols
- Governance framework: Incident Response for normative requirements