PRD-STD-019: Agent Swarm Coordination
Standard ID: PRD-STD-019
Version: 1.0
Status: Active
Compliance Level: Level 4 (Managed)
Effective Date: April 2026
Last Reviewed: April 2026
This page is the normative source of requirements for governance of multi-agent swarm workflows.
For implementation support:
- Swarm patterns: Agent Swarm Patterns
- Tool guides: Kimi Code (primary swarm implementation)
- Related standards: PRD-STD-009: Agent Governance
Use the Compliance Level metadata to sequence adoption with other PRD-STDs.
1. Purpose
This standard defines governance requirements for Agent Swarm workflows — parallel execution patterns where a coordinating agent decomposes complex tasks into sub-tasks executed concurrently by specialized sub-agents. Agent Swarm capabilities (exemplified by Kimi K2.5's 100-agent coordination) enable significant speedups for parallelizable work but introduce unique risks around accountability, state management, and error handling.
Agent Swarm enables:
- Parallel codebase refactoring across modules
- Concurrent test generation for multiple components
- Distributed code review across file sets
- Simultaneous documentation updates
These capabilities require governance frameworks addressing task decomposition, cross-agent coordination, and failure recovery.
2. Scope
This standard applies to:
- All AI workflows involving parallel agent execution (2+ agents)
- All coordinating agents that decompose tasks for sub-agents
- All sub-agents operating under swarm coordination
- All tools with swarm capabilities including Kimi Code Agent Swarm mode
This standard does NOT apply to:
- Single-agent workflows (covered by PRD-STD-009)
- Sequential multi-agent handoffs (covered by PRD-STD-009)
- Human-managed parallel tasks without agent coordination
3. Definitions
| Term | Definition |
|---|---|
| Agent Swarm | A coordinated group of AI agents working in parallel on decomposed sub-tasks |
| Orchestrator Agent | The coordinating agent responsible for task decomposition and result aggregation |
| Sub-Agent | An individual agent executing a specific sub-task within the swarm |
| Task Decomposition | The process of breaking a complex task into independent parallelizable sub-tasks |
| Swarm State | The collective state of all agents in the swarm including progress, intermediate results, and failures |
| Conflict Resolution | Mechanisms for handling competing or contradictory outputs from sub-agents |
| Checkpoint | A recoverable state in the swarm workflow enabling rollback or restart |
| Swarm Owner | The human accountable for swarm execution and outcomes |
4. Requirements
4.1 Governance and Accountability
REQ-019-01: Every swarm MUST have a designated Swarm Owner who is accountable for the swarm's execution and outcomes.
REQ-019-02: The orchestrator agent MUST log all task decomposition decisions including rationale for sub-task boundaries.
REQ-019-03: Sub-agent authorization MUST follow a hierarchical model — sub-agents operate within constraints defined by the orchestrator.
REQ-019-04: Swarm execution MUST be attributable to the Swarm Owner in audit trails (not anonymized behind agent IDs).
REQ-019-05: Swarm Owners SHOULD be senior engineers (Staff+) familiar with both the domain and agentic workflows.
REQ-019-06: Critical swarms SHOULD have a secondary observer agent monitoring for anomalous behavior.
4.2 Task Decomposition
REQ-019-07: Task decomposition MUST create sub-tasks with clear, non-overlapping scopes to prevent conflict.
REQ-019-08: Sub-tasks MUST declare their dependencies explicitly — dependent tasks MUST NOT execute in parallel.
REQ-019-09: The orchestrator MUST validate that decomposed sub-tasks collectively cover the original requirement.
REQ-019-10: Task decomposition for production changes MUST be reviewable by humans before swarm execution.
REQ-019-11: Decomposition strategies SHOULD be documented and reusable (domain-based, component-based, data-based).
REQ-019-12: Sub-task granularity SHOULD balance parallelism against coordination overhead (typically 5-20 sub-tasks).
4.3 Cross-Agent Coordination
REQ-019-13: All inter-agent communication MUST be logged including message content, sender, receiver, and timestamp.
REQ-019-14: Shared state between agents MUST be versioned with conflict detection (last-write-wins is insufficient).
REQ-019-15: Agents MUST handle messages from other agents as untrusted input — validate before acting.
REQ-019-16: Swarm termination MUST be orderly — incomplete sub-agents MUST be signaled and given cleanup opportunity.
REQ-019-17: Shared state SHOULD be minimized — prefer immutable data passing over shared mutable state.
REQ-019-18: Agent communication SHOULD use structured formats (JSON, protobuf) with schema validation.
4.4 Error Handling and Recovery
REQ-019-19: Swarm workflows MUST implement checkpointing — recoverable state every N sub-tasks or time interval.
REQ-019-20: Sub-agent failures MUST NOT cascade — orchestrator MUST isolate failures and continue non-dependent tasks.
REQ-019-21: Partial swarm failures MUST trigger human notification with clear indication of completed vs. failed sub-tasks.
REQ-019-22: Swarm recovery MUST support: (a) retry failed sub-agents, (b) rollback to checkpoint, (c) manual intervention.
REQ-019-23: Idempotent sub-tasks SHOULD be preferred — enables safe retry without side effects.
REQ-019-24: Sub-agent timeouts SHOULD be configurable per task type to prevent indefinite blocking.
4.5 Conflict Resolution
REQ-019-25: Conflicting outputs from sub-agents (e.g., different solutions to same pattern) MUST be flagged for human resolution.
REQ-019-26: Automated conflict resolution rules MUST be documented and approved — no opaque arbitration.
REQ-019-27: Code-level conflicts (merge conflicts from parallel edits) MUST follow standard version control resolution workflows.
REQ-019-28: Conflicting approaches SHOULD be preserved as alternatives for human review rather than auto-selecting.
REQ-019-29: Swarm configurations SHOULD specify conflict resolution strategy per task type.
4.6 Audit and Observability
REQ-019-30: Complete swarm execution logs MUST include: (a) decomposition plan, (b) sub-agent assignments, (c) all inter-agent messages, (d) final aggregated output.
REQ-019-31: Swarm metrics MUST be captured: execution time, token usage per agent, success/failure rates, conflict count.
REQ-019-32: Swarm audit trails MUST be retained for the same duration as the code they produce.
REQ-019-33: Real-time swarm dashboards SHOULD display progress, current agent states, and any blocked tasks.
REQ-019-34: Post-execution analysis SHOULD identify optimization opportunities for future swarms.
4.7 Resource Management
REQ-019-35: Swarm execution MUST respect token budget limits — orchestrator MUST track aggregate usage.
REQ-019-36: Parallel agent count MUST be limited to prevent resource exhaustion (configurable maximum).
REQ-019-37: Long-running swarms (exceeding threshold) MUST require explicit approval to continue.
REQ-019-38: Cost estimation SHOULD be provided before swarm execution based on decomposition plan.
REQ-019-39: Resource usage SHOULD be optimized — decompose to maximize parallel efficiency without excessive coordination overhead.
5. Compliance Levels
| Level | Requirement |
|---|---|
| Level 1 (Uncontrolled) | No swarm governance. Parallel agents used ad-hoc without coordination. |
| Level 2 (Exploratory) | Basic swarm logging. REQ-019-01, REQ-019-30 implemented. |
| Level 3 (Defined) | REQ-019-01 through REQ-019-10 implemented. Task decomposition reviewable. |
| Level 4 (Managed) | All MANDATORY requirements. Checkpointing, conflict resolution, resource limits. |
| Level 5 (AI-First) | All Level 4 + automated optimization, predictive conflict detection, self-healing swarms. |
6. Tool-Specific Guidance
Kimi Code (Agent Swarm)
Capabilities:
- Up to 100 parallel sub-agents
- Automatic task decomposition
- Self-directed coordination
- 4.5x speedup on parallelizable tasks
Governance Configuration:
# Enable comprehensive logging
export KIMI_SWARM_AUDIT=true
export KIMI_SWARM_LOG_LEVEL=debug
# Set resource limits
export KIMI_SWARM_MAX_AGENTS=20
export KIMI_SWARM_TOKEN_BUDGET=1000000
export KIMI_SWARM_TIMEOUT=3600
# Require checkpoint approval
export KIMI_SWARM_CHECKPOINT_APPROVAL=true
Execution with Governance:
# Document swarm owner
kimi swarm --owner "senior-dev@company.com" \
--decomposition-review \
--checkpoint-interval 5 \
--mode swarm \
"Migrate codebase to TypeScript"
Best Practices:
- Start with smaller swarms (5-10 agents) before scaling to 100
- Use checkpoint approval for production changes
- Review decomposition plan before execution
- Monitor token usage — swarms can consume budget rapidly
Custom Swarm Implementations
For teams building custom swarm orchestration:
Required Components:
- Orchestrator Service — Decomposition and coordination
- Agent Registry — Sub-agent lifecycle management
- Message Bus — Inter-agent communication with logging
- State Store — Versioned shared state with conflict detection
- Checkpoint Manager — Recovery point creation and restoration
- Monitoring Dashboard — Real-time visibility and alerting
7. Swarm Patterns
Pattern 1: Domain-Based Decomposition
Use Case: Large codebase refactoring
Orchestrator
├── Frontend Agent (React components)
├── Backend Agent (API routes)
├── Database Agent (schema migrations)
└── Test Agent (test updates)
Governance: Each domain agent has specialized context and constraints.
Pattern 2: Component-Based Decomposition
Use Case: Microservice architecture changes
Orchestrator
├── Service A Agent
├── Service B Agent
├── Service C Agent
└── Integration Test Agent (depends on A, B, C)
Governance: Dependency declarations ensure proper execution order.
Pattern 3: Data-Based Decomposition
Use Case: Batch processing across datasets
Orchestrator
├── Shard 1 Agent
├── Shard 2 Agent
├── Shard 3 Agent
└── Aggregation Agent (depends on all shards)
Governance: Idempotency required for safe retry of data agents.
8. Implementation Checklist
Immediate (Week 1)
- Identify current or planned swarm use cases
- Designate Swarm Owners for each use case
- Enable swarm audit logging in tools
- Document task decomposition review process
Short Term (Month 1)
- Implement checkpoint and recovery mechanisms
- Configure resource limits (token budgets, agent counts)
- Establish conflict resolution workflows
- Create swarm monitoring dashboard
Medium Term (Quarter 1)
- Develop reusable decomposition patterns
- Integrate swarm metrics into engineering KPIs
- Automate cost estimation for swarm tasks
- Conduct tabletop exercises for swarm failure scenarios
9. Anti-Patterns
Anti-Pattern 1: The Uncoordinated Stampede
Problem: Multiple agents working on same files without coordination. Solution: Clear task boundaries with ownership mapping.
Anti-Pattern 2: The Cascade Failure
Problem: One failed agent causes entire swarm to fail. Solution: Proper dependency isolation and failure containment.
Anti-Pattern 3: The Silent Conflict
Problem: Agents produce conflicting outputs that auto-merge incorrectly. Solution: Conflict detection with human review requirement.
Anti-Pattern 4: The Runaway Swarm
Problem: Swarm consumes excessive tokens or runs indefinitely. Solution: Budget limits, timeouts, and checkpoint approvals.
10. Related Standards
- PRD-STD-009: Autonomous Agent Governance — Base agent governance
- Agent Swarm Patterns — Practical implementation patterns
- PRD-STD-001: Prompt Engineering — Prompt standards for agents
11. References
| Reference | Description |
|---|---|
| Kimi K2.5 Technical Report | Moonshot AI Agent Swarm architecture |
| Multi-Agent Reinforcement Learning | Academic foundation for swarm coordination |
| Distributed Systems Theory | Consensus and consistency patterns |