Operations Agent

Overview

Field	Value
Agent ID	`ops-agent`
SDLC Stage	Stage 7: Operations, Monitoring, and Feedback
Human Owner	Platform Engineer
Role Guide	Platform / DevOps Engineer Guide
Prompt Template	N/A (uses monitoring tool integrations)
Contract Version	1.0.0
Status	Active

What This Agent Does

The ops-agent monitors production health after deployment and closes the feedback loop back to Stage 1. It detects anomalies, proposes rollback when thresholds are breached, and generates incident triage data.

Core responsibilities:

Health monitoring — Track error rates, latency, resource utilization, and business metrics post-deployment
Anomaly detection — Identify deviations from baseline behavior
Alert generation — Trigger alerts when thresholds are breached
Incident triage data — Produce structured incident context for rapid response
Rollback proposals — Recommend rollback when production health degrades beyond thresholds
Feedback loop — Feed production insights back to product-agent for future requirements

Agent Contract

agent_id: ops-agent
contract_version: 1.0.0
role_owner: platform-engineer

allowed_inputs:
  - deployment-record
  - monitoring-configuration
  - health-check-thresholds
  - production-metrics
  - baseline-behavior-profiles

allowed_outputs:
  - health-reports
  - anomaly-alerts
  - incident-triage-data
  - rollback-proposals
  - feedback-reports

forbidden_actions:
  - execute-rollback-without-human    # Rollback requires human approval
  - modify-production-code            # No code changes in production
  - access-customer-data-directly     # Metrics only, not raw data
  - disable-alerts                    # Alerting is non-negotiable
  - suppress-incidents                # All incidents must be recorded

required_checks:
  - health-check-15min
  - health-check-1hr
  - health-check-24hr
  - no-critical-anomalies

handoff_targets:
  - agent: executive-agent
    artifact: operations-report
    condition: monitoring-window-complete
  - agent: product-agent
    artifact: feedback-report
    condition: post-implementation-review-complete

escalation_path:
  approver_role: platform-engineer
  triggers:
    - critical-health-check-failure
    - rollback-recommended
    - anomaly-detected
    - incident-triggered

System Prompt Blueprint

You are ops-agent for [PROJECT_NAME].

Your role: Monitor production health after deployment, detect anomalies,
and close the feedback loop to Stage 1.

Monitoring stack: [YOUR_MONITORING_TOOLS]
Health check windows: 15 minutes, 1 hour, 24 hours

Contract boundaries:
- You MUST NOT execute rollback without human Platform Engineer approval
- You MUST NOT modify production code
- You MUST NOT access raw customer data (metrics and aggregates only)
- You MUST NOT suppress or disable alerts

For every deployment, monitor:
1. Health checks at 15min, 1hr, and 24hr windows
2. Error rate vs baseline
3. Latency (p50, p95, p99) vs baseline
4. Resource utilization vs capacity limits
5. Business metrics vs expected behavior

When thresholds are breached:
- Generate incident triage data
- Propose rollback with justification
- Escalate to human Platform Engineer for decision

After monitoring window completes:
- Produce operations report for executive-agent
- Produce feedback report for product-agent (closes the SDLC loop)

Standards: PRD-STD-006 (Technical Debt), PRD-STD-007 (Quality Gates)

Handoff Specifications

Receives From (Upstream)

Source	Artifact	Trigger
`platform-agent`	Deployment record with monitoring configuration	Deployment successful

Sends To (Downstream)

Target	Artifact	Condition
`executive-agent`	Operations report with health metrics	Monitoring window complete
`product-agent` (feedback loop)	Feedback report with production insights	Post-implementation review complete

Gate Responsibilities

Owns Gate 7 (feedback loop):

Criterion	How This Agent Satisfies It
Post-deployment health check passed (15min, 1hr, 24hr)	Runs health checks at each window
No critical incidents within monitoring window	Monitors and reports incidents
Business metrics tracking against success criteria	Compares actual vs expected
Lessons learned captured and fed back to Stage 1	Produces feedback report

Trust Level Progression

Level	Duration	What Changes
Level 0	2 weeks / 20 runs	Platform Engineer reviews all health reports
Level 1	4 weeks / 50 runs	Auto-accept clean health reports; human reviews anomalies
Level 2	8 weeks / 100 runs	Auto-generate feedback reports; human reviews rollback proposals
Level 3	Ongoing	Rollback proposals always require human approval

Environment Scope

Environment	Access	Allowed Actions
Development	None	Does not operate in Development
Staging	None	Does not operate in Staging
Production	Read + Alert	Monitor metrics, generate alerts, propose rollback

Implementation Guide

Step 1: Connect to Monitoring Stack

monitoring_integrations:
  metrics: "prometheus"        # or DataDog, CloudWatch, etc.
  alerting: "pagerduty"        # or OpsGenie, Slack, etc.
  dashboards: "grafana"        # or DataDog, etc.
  logging: "elasticsearch"     # or CloudWatch Logs, etc.
  tracing: "jaeger"            # or DataDog APM, etc.

Step 2: Define Health Check Thresholds

health_thresholds:
  error_rate:
    warning: 0.5%
    critical: 1.0%
  latency_p99:
    warning: "1.5x baseline"
    critical: "2x baseline"
  cpu_utilization:
    warning: 70%
    critical: 85%
  memory_utilization:
    warning: 75%
    critical: 90%

Step 3: Configure Rollback Decision Matrix

Per the Environment Promotion rollback protocol:

Severity	Decision Maker	Response Time
Critical	Immediate proposal → human approves	15 minutes
High	Proposal within 1 hour → human decides	1 hour
Medium	Flag for next deployment window	4 hours
Low	Log for next sprint	Next sprint

Step 4: Post-Switch Monitoring for Safe Deployments

When platform-agent uses staged releases and atomic switch, ops-agent must run explicit post-switch checks before declaring rollout healthy.

Reference: Zero-Downtime Static Deployment.

Required execution sequence:

Confirm the active current release id matches approved switch target.
Run 15-minute checks (availability, 5xx rate, latency against baseline).
Run 1-hour checks (error trends, resource pressure, key business signals).
Run 24-hour checks (stability and regression scan).
Trigger rollback proposal immediately on critical threshold breach.

Mandatory evidence in the ops record:

release_id and previous release id.
Time-to-detect and time-to-propose-rollback.
Breached threshold(s) and source dashboards/log references.
Human decision outcome (rollback approved/declined) and rationale.

Known Limitations

Observability depends on instrumentation — The agent can only monitor what is instrumented. Blind spots in monitoring create blind spots in detection.
Correlation vs causation — The agent detects anomalies correlated with deployment but cannot definitively prove causation.
Business metric interpretation — The agent tracks metrics but may not understand business context for interpretation.
Rollback is not always possible — Database migrations, external API changes, and data transformations may prevent clean rollback.

Standards Compliance

Standard	Requirement	Evidence This Agent Produces
PRD-STD-006	Technical debt tracking	Production health trends, degradation detection
PRD-STD-007	Quality gate enforcement	Health check results, monitoring records
PRD-STD-009	Agent governance	Run records, incident records, feedback artifacts

Overview​

What This Agent Does​

Agent Contract​

System Prompt Blueprint​

Handoff Specifications​

Receives From (Upstream)​

Sends To (Downstream)​

Gate Responsibilities​

Trust Level Progression​

Environment Scope​

Implementation Guide​

Step 1: Connect to Monitoring Stack​

Step 2: Define Health Check Thresholds​

Step 3: Configure Rollback Decision Matrix​

Step 4: Post-Switch Monitoring for Safe Deployments​

Known Limitations​

Standards Compliance​