Orchestration Rules and State Machine
The orchestrator is the control layer that routes work between agents, enforces stage order, and handles failures. This document defines the rules that govern the orchestrator's behavior. It is mandatory per PRD-STD-009 REQ-009-05.
State Machine Definition
Every work item in the pipeline exists in exactly one state at any time. The orchestrator manages state transitions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION STATE MACHINE │
│ │
│ ┌──────────┐ │
│ │ INTAKE │ │
│ └────┬─────┘ │
│ │ │
│ v │
│ ┌────────────┐ PASS ┌──────────┐ PASS ┌──────────────┐ │
│ │REQUIREMENTS│────────>│ DESIGN │────────>│IMPLEMENTATION│ │
│ │ (Stage 1) │ │(Stage 2) │ │ (Stage 3) │ │
│ └─────┬──────┘ └────┬─────┘ └──────┬───────┘ │
│ │ FAIL │ FAIL │ FAIL │
│ v v v │
│ REWORK REWORK REWORK │
│ (Stage 1) (Stage 2) (Stage 3) │
│ │ PASS │
│ v │
│ ┌──────────┐ PASS ┌──────────┐ │
│ │ SECURITY │<────────│ TESTING │ │
│ │(Stage 5) │ │(Stage 4) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ FAIL │ FAIL │
│ v v │
│ REWORK REWORK │
│ (Stage 3/5) (Stage 3/4) │
│ │ PASS │
│ v │
│ ┌──────────┐ PASS ┌──────────┐ │
│ │DEPLOYMENT│────────>│OPERATIONS│ │
│ │(Stage 6) │ │(Stage 7) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ FAIL │ INCIDENT │
│ v v │
│ REWORK ROLLBACK │
│ (Stage 6) + REWORK │
│ │ │
│ FEEDBACK │
│ TO INTAKE │
│ │
│ Terminal states: DEPLOYED, ROLLED_BACK, CANCELLED │
└─────────────────────────────────────────────────────────────────────────┘
State Definitions
| State | Description | Owner Agent | Valid Transitions |
|---|---|---|---|
INTAKE | Work item received, not yet assigned | Orchestrator | → REQUIREMENTS |
REQUIREMENTS | Stage 1 active | product-agent, scrum-agent | → DESIGN (pass), → REWORK_REQ (fail) |
DESIGN | Stage 2 active | architect-agent | → IMPLEMENTATION (pass), → REWORK_DESIGN (fail) |
IMPLEMENTATION | Stage 3 active | developer-agent | → TESTING (pass), → REWORK_IMPL (fail) |
TESTING | Stage 4 active | qa-agent, devmgr-agent | → SECURITY (pass), → REWORK_IMPL (code fix), → REWORK_TEST (test fix) |
SECURITY | Stage 5 active | security-agent, compliance-agent | → DEPLOYMENT (pass), → REWORK_IMPL (remediation), → REWORK_SEC (scan config) |
DEPLOYMENT | Stage 6 active | platform-agent | → OPERATIONS (pass), → REWORK_DEPLOY (config fix) |
OPERATIONS | Stage 7 active | ops-agent, executive-agent | → DEPLOYED (stable), → ROLLBACK (incident) |
REWORK_* | Rework in progress at specific stage | Stage-specific agent | → Return to originating stage |
DEPLOYED | Successfully in production (terminal) | — | → INTAKE (new work via feedback) |
ROLLED_BACK | Rolled back from production (terminal) | — | → INTAKE (rework item created) |
CANCELLED | Work item cancelled (terminal) | — | None |
Transition Rules
Forward Transitions (Happy Path)
Every forward transition requires:
- Gate criteria met — all checks for the current stage passed
- Handoff artifact produced — structured output per PRD-STD-009 REQ-009-06
- Target agent available — next agent has capacity and valid contract
- No blocking escalations — no unresolved escalation requests
# Example transition rule
transition:
from: IMPLEMENTATION
to: TESTING
requires:
- gate_3_passed: true
- handoff_artifact: present
- ai_metadata: [AI-Usage, Agent-IDs, AI-Prompt-Ref]
- unit_tests: passing
- lint: passing
produces:
- handoff: "HO-developer-agent-qa-agent-{timestamp}"
- state_change: "IMPLEMENTATION → TESTING"
- audit_record: true
Failure Transitions (Rework Routing)
When a gate fails, the orchestrator must route the work item to the correct agent for rework. The routing depends on the failure type.
| Current Stage | Failure Type | Route To | Rework Agent | Max Rework Iterations |
|---|---|---|---|---|
| Testing (4) | Test failure (code bug) | REWORK_IMPL | developer-agent | 3 |
| Testing (4) | Test gap (missing coverage) | REWORK_TEST | qa-agent | 2 |
| Testing (4) | Acceptance criteria mismatch | REWORK_REQ | product-agent | 1 |
| Security (5) | Vulnerability found | REWORK_IMPL | developer-agent | 3 |
| Security (5) | License violation | REWORK_IMPL | developer-agent | 2 |
| Security (5) | Compliance evidence gap | REWORK_SEC | compliance-agent | 2 |
| Deployment (6) | Configuration error | REWORK_DEPLOY | platform-agent | 2 |
| Deployment (6) | Environment mismatch | REWORK_DEPLOY | platform-agent | 2 |
| Operations (7) | Health check failure | ROLLBACK | ops-agent + platform-agent | 1 (then escalate) |
| Operations (7) | Critical incident | ROLLBACK | ops-agent + human | Immediate |
Escalation Transitions
When an agent cannot resolve an issue within its contract, it must escalate to a human.
| Trigger | Escalation Target | Response SLA | Action if SLA Breached |
|---|---|---|---|
| Architecture-impacting decision | Solution Architect | 4 hours | Block pipeline, notify CTO |
| Auth/crypto/PII change | Security Engineer | 2 hours | Block pipeline, notify Security lead |
| Rework iteration limit reached | Development Manager | 4 hours | Block pipeline, create incident |
| Agent contract violation | CTO | 1 hour | Suspend agent immediately |
| Cross-agent conflict (contradictory outputs) | Solution Architect | 4 hours | Block pipeline, convene review |
Iteration Limits and Deadlock Prevention
Maximum Iteration Thresholds
Per PRD-STD-009 REQ-009-07, autonomous loops must enforce maximum iteration limits.
| Loop Type | Max Iterations | On Breach |
|---|---|---|
| Single agent rework (same stage) | 3 | Escalate to human owner of that stage |
| Cross-stage rework (bouncing between stages) | 5 total across all stages | Escalate to Development Manager |
| Full pipeline retry (Stage 1 restart) | 2 | Escalate to CTO, likely needs scope change |
| Deployment retry | 2 | Block deployment, human investigation required |
Deadlock Detection
A deadlock occurs when two or more agents are waiting for each other's output. The orchestrator must detect and resolve these.
Detection rules:
- Circular wait: Agent A waits for Agent B, Agent B waits for Agent A
- Timeout: Any agent state unchanged for >2x its expected execution time
- Contradictory outputs: Two agents produce conflicting recommendations with no resolution path
Resolution protocol:
1. Orchestrator detects deadlock condition
2. Orchestrator pauses all involved agents
3. Orchestrator notifies the Development Manager with:
- Deadlock type (circular, timeout, contradictory)
- Involved agents and their current states
- Last handoff artifacts from each agent
- Suggested resolution (human decision needed)
4. Development Manager resolves by:
- Choosing one agent's output over the other
- Providing additional context to break the tie
- Escalating to Solution Architect for architecture decisions
- Cancelling the work item if resolution is not feasible
5. Orchestrator resumes pipeline with resolution applied
Parallel Execution Rules
Some stages can run in parallel to reduce cycle time. The orchestrator manages parallelism.
Allowed Parallel Paths
Stage 3 (Implementation) completes
│
├──> Stage 4 (qa-agent) ──────────────┐
│ │
└──> Stage 5 (security-agent) ─────────┤ ──> Merge results ──> Gate 5
Stage 5 (compliance-agent) ────────┘
Rules for parallel execution:
qa-agentandsecurity-agentMAY run in parallel after Gate 3compliance-agentMAY run in parallel withsecurity-agent- Both paths must complete and pass before Gate 5 is evaluated
- If one path fails, the other continues but the work item cannot advance
devmgr-agentruns after bothqa-agentandsecurity-agentcomplete (needs both outputs)
Forbidden Parallel Paths
These stages MUST run sequentially:
| Stage A | Stage B | Reason |
|---|---|---|
| Requirements (1) | Design (2) | Design depends on approved requirements |
| Design (2) | Implementation (3) | Code depends on approved design |
| Security (5) | Deployment (6) | Cannot deploy security-uncleared code |
| Deployment (6) | Operations (7) | Cannot monitor what is not deployed |
Orchestrator Configuration
Work Item Metadata
Every work item tracked by the orchestrator carries this metadata:
work_item:
id: "WI-{project}-{sequence}"
title: "{descriptive title}"
risk_tier: 1|2|3|4
data_classification: public|internal|confidential|restricted
current_state: "{state from state machine}"
current_agent: "{agent-id or null}"
created_at: "{ISO 8601}"
updated_at: "{ISO 8601}"
stage_history:
- stage: 1
agent: "product-agent"
entered_at: "{ISO 8601}"
exited_at: "{ISO 8601}"
result: "pass|fail|escalate"
handoff_id: "HO-{id}"
iteration: 1
rework_count: 0
total_iterations: 0
escalation_history: []
trust_levels:
product-agent: 1
architect-agent: 0
developer-agent: 2
qa-agent: 1
security-agent: 1
Orchestrator Health Checks
The orchestrator itself must be monitored:
| Metric | Threshold | Action on Breach |
|---|---|---|
| Queue depth | >50 work items | Alert Development Manager, assess capacity |
| Average cycle time | >2x baseline | Investigate bottleneck stages |
| Deadlock rate | >1 per week | Review agent contracts for conflicts |
| Escalation rate | >10% of work items | Review trust levels and agent capabilities |
| Gate failure rate | >30% at any single gate | Investigate root cause, retrain agents |
Event Log Format
Every state transition produces an event in the orchestration log:
{
"event_id": "EVT-{uuid}",
"timestamp": "2026-02-23T14:30:00Z",
"work_item_id": "WI-myproject-042",
"transition": {
"from_state": "IMPLEMENTATION",
"to_state": "TESTING",
"trigger": "gate_3_passed"
},
"agent": {
"source": "developer-agent",
"target": "qa-agent"
},
"handoff_id": "HO-developer-agent-qa-agent-20260223T143000",
"gate_results": {
"lint": "pass",
"unit_tests": "pass",
"sast_basic": "pass",
"ai_metadata": "pass"
},
"trust_level": 2,
"human_approval": null,
"duration_seconds": 3420
}
This log format satisfies PRD-STD-009 REQ-009-14 (auditable run records) and PRD-STD-005 (documentation requirements).
Quick Reference: Orchestration Decision Tree
Is the work item new?
├── YES → State: INTAKE → Route to product-agent (Stage 1)
└── NO → Is the current gate passed?
├── YES → Is there a next stage?
│ ├── YES → Can parallel paths run?
│ │ ├── YES → Launch parallel agents
│ │ └── NO → Route to next stage's owner agent
│ └── NO → State: DEPLOYED (terminal)
└── NO → Is the rework limit reached?
├── YES → Escalate to human (Development Manager)
└── NO → What type of failure?
├── Code bug → Route to developer-agent
├── Test gap → Route to qa-agent
├── Security finding → Route to developer-agent (remediate)
├── Compliance gap → Route to compliance-agent
├── Config error → Route to platform-agent
└── Unclear → Escalate to human (Development Manager)