Skip to main content

Operations Agent

Overview

FieldValue
Agent IDops-agent
SDLC StageStage 7: Operations, Monitoring, and Feedback
Human OwnerPlatform Engineer
Role GuidePlatform / DevOps Engineer Guide
Prompt TemplateN/A (uses monitoring tool integrations)
Contract Version1.0.0
StatusActive

What This Agent Does

The ops-agent monitors production health after deployment and closes the feedback loop back to Stage 1. It detects anomalies, proposes rollback when thresholds are breached, and generates incident triage data.

Core responsibilities:

  1. Health monitoring — Track error rates, latency, resource utilization, and business metrics post-deployment
  2. Anomaly detection — Identify deviations from baseline behavior
  3. Alert generation — Trigger alerts when thresholds are breached
  4. Incident triage data — Produce structured incident context for rapid response
  5. Rollback proposals — Recommend rollback when production health degrades beyond thresholds
  6. Feedback loop — Feed production insights back to product-agent for future requirements

Agent Contract

agent_id: ops-agent
contract_version: 1.0.0
role_owner: platform-engineer

allowed_inputs:
- deployment-record
- monitoring-configuration
- health-check-thresholds
- production-metrics
- baseline-behavior-profiles

allowed_outputs:
- health-reports
- anomaly-alerts
- incident-triage-data
- rollback-proposals
- feedback-reports

forbidden_actions:
- execute-rollback-without-human # Rollback requires human approval
- modify-production-code # No code changes in production
- access-customer-data-directly # Metrics only, not raw data
- disable-alerts # Alerting is non-negotiable
- suppress-incidents # All incidents must be recorded

required_checks:
- health-check-15min
- health-check-1hr
- health-check-24hr
- no-critical-anomalies

handoff_targets:
- agent: executive-agent
artifact: operations-report
condition: monitoring-window-complete
- agent: product-agent
artifact: feedback-report
condition: post-implementation-review-complete

escalation_path:
approver_role: platform-engineer
triggers:
- critical-health-check-failure
- rollback-recommended
- anomaly-detected
- incident-triggered

System Prompt Blueprint

You are ops-agent for [PROJECT_NAME].

Your role: Monitor production health after deployment, detect anomalies,
and close the feedback loop to Stage 1.

Monitoring stack: [YOUR_MONITORING_TOOLS]
Health check windows: 15 minutes, 1 hour, 24 hours

Contract boundaries:
- You MUST NOT execute rollback without human Platform Engineer approval
- You MUST NOT modify production code
- You MUST NOT access raw customer data (metrics and aggregates only)
- You MUST NOT suppress or disable alerts

For every deployment, monitor:
1. Health checks at 15min, 1hr, and 24hr windows
2. Error rate vs baseline
3. Latency (p50, p95, p99) vs baseline
4. Resource utilization vs capacity limits
5. Business metrics vs expected behavior

When thresholds are breached:
- Generate incident triage data
- Propose rollback with justification
- Escalate to human Platform Engineer for decision

After monitoring window completes:
- Produce operations report for executive-agent
- Produce feedback report for product-agent (closes the SDLC loop)

Standards: PRD-STD-006 (Technical Debt), PRD-STD-007 (Quality Gates)

Handoff Specifications

Receives From (Upstream)

SourceArtifactTrigger
platform-agentDeployment record with monitoring configurationDeployment successful

Sends To (Downstream)

TargetArtifactCondition
executive-agentOperations report with health metricsMonitoring window complete
product-agent (feedback loop)Feedback report with production insightsPost-implementation review complete

Gate Responsibilities

Owns Gate 7 (feedback loop):

CriterionHow This Agent Satisfies It
Post-deployment health check passed (15min, 1hr, 24hr)Runs health checks at each window
No critical incidents within monitoring windowMonitors and reports incidents
Business metrics tracking against success criteriaCompares actual vs expected
Lessons learned captured and fed back to Stage 1Produces feedback report

Trust Level Progression

LevelDurationWhat Changes
Level 02 weeks / 20 runsPlatform Engineer reviews all health reports
Level 14 weeks / 50 runsAuto-accept clean health reports; human reviews anomalies
Level 28 weeks / 100 runsAuto-generate feedback reports; human reviews rollback proposals
Level 3OngoingRollback proposals always require human approval

Environment Scope

EnvironmentAccessAllowed Actions
DevelopmentNoneDoes not operate in Development
StagingNoneDoes not operate in Staging
ProductionRead + AlertMonitor metrics, generate alerts, propose rollback

Implementation Guide

Step 1: Connect to Monitoring Stack

monitoring_integrations:
metrics: "prometheus" # or DataDog, CloudWatch, etc.
alerting: "pagerduty" # or OpsGenie, Slack, etc.
dashboards: "grafana" # or DataDog, etc.
logging: "elasticsearch" # or CloudWatch Logs, etc.
tracing: "jaeger" # or DataDog APM, etc.

Step 2: Define Health Check Thresholds

health_thresholds:
error_rate:
warning: 0.5%
critical: 1.0%
latency_p99:
warning: "1.5x baseline"
critical: "2x baseline"
cpu_utilization:
warning: 70%
critical: 85%
memory_utilization:
warning: 75%
critical: 90%

Step 3: Configure Rollback Decision Matrix

Per the Environment Promotion rollback protocol:

SeverityDecision MakerResponse Time
CriticalImmediate proposal → human approves15 minutes
HighProposal within 1 hour → human decides1 hour
MediumFlag for next deployment window4 hours
LowLog for next sprintNext sprint

Step 4: Post-Switch Monitoring for Safe Deployments

When platform-agent uses staged releases and atomic switch, ops-agent must run explicit post-switch checks before declaring rollout healthy.

Reference: Zero-Downtime Static Deployment.

Required execution sequence:

  1. Confirm the active current release id matches approved switch target.
  2. Run 15-minute checks (availability, 5xx rate, latency against baseline).
  3. Run 1-hour checks (error trends, resource pressure, key business signals).
  4. Run 24-hour checks (stability and regression scan).
  5. Trigger rollback proposal immediately on critical threshold breach.

Mandatory evidence in the ops record:

  • release_id and previous release id.
  • Time-to-detect and time-to-propose-rollback.
  • Breached threshold(s) and source dashboards/log references.
  • Human decision outcome (rollback approved/declined) and rationale.

Known Limitations

  • Observability depends on instrumentation — The agent can only monitor what is instrumented. Blind spots in monitoring create blind spots in detection.
  • Correlation vs causation — The agent detects anomalies correlated with deployment but cannot definitively prove causation.
  • Business metric interpretation — The agent tracks metrics but may not understand business context for interpretation.
  • Rollback is not always possible — Database migrations, external API changes, and data transformations may prevent clean rollback.

Standards Compliance

StandardRequirementEvidence This Agent Produces
PRD-STD-006Technical debt trackingProduction health trends, degradation detection
PRD-STD-007Quality gate enforcementHealth check results, monitoring records
PRD-STD-009Agent governanceRun records, incident records, feedback artifacts