PRD-STD-019: Agent Swarm Coordination

Standard ID: PRD-STD-019
Version: 1.0
Status: Active
Compliance Level: Level 4 (Managed)
Effective Date: April 2026
Last Reviewed: April 2026

How To Use This Standard

This page is the normative source of requirements for governance of multi-agent swarm workflows.

For implementation support:

Swarm patterns: Agent Swarm Patterns
Tool guides: Kimi Code (primary swarm implementation)
Related standards: PRD-STD-009: Agent Governance

Use the Compliance Level metadata to sequence adoption with other PRD-STDs.

1. Purpose

This standard defines governance requirements for Agent Swarm workflows — parallel execution patterns where a coordinating agent decomposes complex tasks into sub-tasks executed concurrently by specialized sub-agents. Agent Swarm capabilities (exemplified by Kimi K2.5's 100-agent coordination) enable significant speedups for parallelizable work but introduce unique risks around accountability, state management, and error handling.

Agent Swarm enables:

Parallel codebase refactoring across modules
Concurrent test generation for multiple components
Distributed code review across file sets
Simultaneous documentation updates

These capabilities require governance frameworks addressing task decomposition, cross-agent coordination, and failure recovery.

2. Scope

This standard applies to:

All AI workflows involving parallel agent execution (2+ agents)
All coordinating agents that decompose tasks for sub-agents
All sub-agents operating under swarm coordination
All tools with swarm capabilities including Kimi Code Agent Swarm mode

This standard does NOT apply to:

Single-agent workflows (covered by PRD-STD-009)
Sequential multi-agent handoffs (covered by PRD-STD-009)
Human-managed parallel tasks without agent coordination

3. Definitions

Term	Definition
Agent Swarm	A coordinated group of AI agents working in parallel on decomposed sub-tasks
Orchestrator Agent	The coordinating agent responsible for task decomposition and result aggregation
Sub-Agent	An individual agent executing a specific sub-task within the swarm
Task Decomposition	The process of breaking a complex task into independent parallelizable sub-tasks
Swarm State	The collective state of all agents in the swarm including progress, intermediate results, and failures
Conflict Resolution	Mechanisms for handling competing or contradictory outputs from sub-agents
Checkpoint	A recoverable state in the swarm workflow enabling rollback or restart
Swarm Owner	The human accountable for swarm execution and outcomes

4. Requirements

4.1 Governance and Accountability

MANDATORY

REQ-019-01: Every swarm MUST have a designated Swarm Owner who is accountable for the swarm's execution and outcomes.

REQ-019-02: The orchestrator agent MUST log all task decomposition decisions including rationale for sub-task boundaries.

REQ-019-03: Sub-agent authorization MUST follow a hierarchical model — sub-agents operate within constraints defined by the orchestrator.

REQ-019-04: Swarm execution MUST be attributable to the Swarm Owner in audit trails (not anonymized behind agent IDs).

RECOMMENDED

REQ-019-05: Swarm Owners SHOULD be senior engineers (Staff+) familiar with both the domain and agentic workflows.

REQ-019-06: Critical swarms SHOULD have a secondary observer agent monitoring for anomalous behavior.

4.2 Task Decomposition

MANDATORY

REQ-019-07: Task decomposition MUST create sub-tasks with clear, non-overlapping scopes to prevent conflict.

REQ-019-08: Sub-tasks MUST declare their dependencies explicitly — dependent tasks MUST NOT execute in parallel.

REQ-019-09: The orchestrator MUST validate that decomposed sub-tasks collectively cover the original requirement.

REQ-019-10: Task decomposition for production changes MUST be reviewable by humans before swarm execution.

RECOMMENDED

REQ-019-11: Decomposition strategies SHOULD be documented and reusable (domain-based, component-based, data-based).

REQ-019-12: Sub-task granularity SHOULD balance parallelism against coordination overhead (typically 5-20 sub-tasks).

4.3 Cross-Agent Coordination

MANDATORY

REQ-019-13: All inter-agent communication MUST be logged including message content, sender, receiver, and timestamp.

REQ-019-14: Shared state between agents MUST be versioned with conflict detection (last-write-wins is insufficient).

REQ-019-15: Agents MUST handle messages from other agents as untrusted input — validate before acting.

REQ-019-16: Swarm termination MUST be orderly — incomplete sub-agents MUST be signaled and given cleanup opportunity.

RECOMMENDED

REQ-019-17: Shared state SHOULD be minimized — prefer immutable data passing over shared mutable state.

REQ-019-18: Agent communication SHOULD use structured formats (JSON, protobuf) with schema validation.

4.4 Error Handling and Recovery

MANDATORY

REQ-019-19: Swarm workflows MUST implement checkpointing — recoverable state every N sub-tasks or time interval.

REQ-019-20: Sub-agent failures MUST NOT cascade — orchestrator MUST isolate failures and continue non-dependent tasks.

REQ-019-21: Partial swarm failures MUST trigger human notification with clear indication of completed vs. failed sub-tasks.

REQ-019-22: Swarm recovery MUST support: (a) retry failed sub-agents, (b) rollback to checkpoint, (c) manual intervention.

RECOMMENDED

REQ-019-23: Idempotent sub-tasks SHOULD be preferred — enables safe retry without side effects.

REQ-019-24: Sub-agent timeouts SHOULD be configurable per task type to prevent indefinite blocking.

4.5 Conflict Resolution

MANDATORY

REQ-019-25: Conflicting outputs from sub-agents (e.g., different solutions to same pattern) MUST be flagged for human resolution.

REQ-019-26: Automated conflict resolution rules MUST be documented and approved — no opaque arbitration.

REQ-019-27: Code-level conflicts (merge conflicts from parallel edits) MUST follow standard version control resolution workflows.

RECOMMENDED

REQ-019-28: Conflicting approaches SHOULD be preserved as alternatives for human review rather than auto-selecting.

REQ-019-29: Swarm configurations SHOULD specify conflict resolution strategy per task type.

4.6 Audit and Observability

MANDATORY

REQ-019-30: Complete swarm execution logs MUST include: (a) decomposition plan, (b) sub-agent assignments, (c) all inter-agent messages, (d) final aggregated output.

REQ-019-31: Swarm metrics MUST be captured: execution time, token usage per agent, success/failure rates, conflict count.

REQ-019-32: Swarm audit trails MUST be retained for the same duration as the code they produce.

RECOMMENDED

REQ-019-33: Real-time swarm dashboards SHOULD display progress, current agent states, and any blocked tasks.

REQ-019-34: Post-execution analysis SHOULD identify optimization opportunities for future swarms.

4.7 Resource Management

MANDATORY

REQ-019-35: Swarm execution MUST respect token budget limits — orchestrator MUST track aggregate usage.

REQ-019-36: Parallel agent count MUST be limited to prevent resource exhaustion (configurable maximum).

REQ-019-37: Long-running swarms (exceeding threshold) MUST require explicit approval to continue.

RECOMMENDED

REQ-019-38: Cost estimation SHOULD be provided before swarm execution based on decomposition plan.

REQ-019-39: Resource usage SHOULD be optimized — decompose to maximize parallel efficiency without excessive coordination overhead.

5. Compliance Levels

Level	Requirement
Level 1 (Uncontrolled)	No swarm governance. Parallel agents used ad-hoc without coordination.
Level 2 (Exploratory)	Basic swarm logging. REQ-019-01, REQ-019-30 implemented.
Level 3 (Defined)	REQ-019-01 through REQ-019-10 implemented. Task decomposition reviewable.
Level 4 (Managed)	All MANDATORY requirements. Checkpointing, conflict resolution, resource limits.
Level 5 (AI-First)	All Level 4 + automated optimization, predictive conflict detection, self-healing swarms.

6. Tool-Specific Guidance

Kimi Code (Agent Swarm)

Capabilities:

Up to 100 parallel sub-agents
Automatic task decomposition
Self-directed coordination
4.5x speedup on parallelizable tasks

Governance Configuration:

# Enable comprehensive logging
export KIMI_SWARM_AUDIT=true
export KIMI_SWARM_LOG_LEVEL=debug

# Set resource limits
export KIMI_SWARM_MAX_AGENTS=20
export KIMI_SWARM_TOKEN_BUDGET=1000000
export KIMI_SWARM_TIMEOUT=3600

# Require checkpoint approval
export KIMI_SWARM_CHECKPOINT_APPROVAL=true

Execution with Governance:

# Document swarm owner
kimi swarm --owner "senior-dev@company.com" \
           --decomposition-review \
           --checkpoint-interval 5 \
           --mode swarm \
           "Migrate codebase to TypeScript"

Best Practices:

Start with smaller swarms (5-10 agents) before scaling to 100
Use checkpoint approval for production changes
Review decomposition plan before execution
Monitor token usage — swarms can consume budget rapidly

Custom Swarm Implementations

For teams building custom swarm orchestration:

Required Components:

Orchestrator Service — Decomposition and coordination
Agent Registry — Sub-agent lifecycle management
Message Bus — Inter-agent communication with logging
State Store — Versioned shared state with conflict detection
Checkpoint Manager — Recovery point creation and restoration
Monitoring Dashboard — Real-time visibility and alerting

7. Swarm Patterns

Pattern 1: Domain-Based Decomposition

Use Case: Large codebase refactoring

Orchestrator
├── Frontend Agent (React components)
├── Backend Agent (API routes)
├── Database Agent (schema migrations)
└── Test Agent (test updates)

Governance: Each domain agent has specialized context and constraints.

Pattern 2: Component-Based Decomposition

Use Case: Microservice architecture changes

Orchestrator
├── Service A Agent
├── Service B Agent
├── Service C Agent
└── Integration Test Agent (depends on A, B, C)

Governance: Dependency declarations ensure proper execution order.

Pattern 3: Data-Based Decomposition

Use Case: Batch processing across datasets

Orchestrator
├── Shard 1 Agent
├── Shard 2 Agent
├── Shard 3 Agent
└── Aggregation Agent (depends on all shards)

Governance: Idempotency required for safe retry of data agents.

8. Implementation Checklist

Immediate (Week 1)

Identify current or planned swarm use cases
Designate Swarm Owners for each use case
Enable swarm audit logging in tools
Document task decomposition review process

Short Term (Month 1)

Implement checkpoint and recovery mechanisms
Configure resource limits (token budgets, agent counts)
Establish conflict resolution workflows
Create swarm monitoring dashboard

Medium Term (Quarter 1)

Develop reusable decomposition patterns
Integrate swarm metrics into engineering KPIs
Automate cost estimation for swarm tasks
Conduct tabletop exercises for swarm failure scenarios

9. Anti-Patterns

Anti-Pattern 1: The Uncoordinated Stampede

Problem: Multiple agents working on same files without coordination. Solution: Clear task boundaries with ownership mapping.

Anti-Pattern 2: The Cascade Failure

Problem: One failed agent causes entire swarm to fail. Solution: Proper dependency isolation and failure containment.

Anti-Pattern 3: The Silent Conflict

Problem: Agents produce conflicting outputs that auto-merge incorrectly. Solution: Conflict detection with human review requirement.

Anti-Pattern 4: The Runaway Swarm

Problem: Swarm consumes excessive tokens or runs indefinitely. Solution: Budget limits, timeouts, and checkpoint approvals.

PRD-STD-009: Autonomous Agent Governance — Base agent governance
Agent Swarm Patterns — Practical implementation patterns
PRD-STD-001: Prompt Engineering — Prompt standards for agents

11. References

Reference	Description
Kimi K2.5 Technical Report	Moonshot AI Agent Swarm architecture
Multi-Agent Reinforcement Learning	Academic foundation for swarm coordination
Distributed Systems Theory	Consensus and consistency patterns

1. Purpose​

2. Scope​

3. Definitions​

4. Requirements​

4.1 Governance and Accountability​

4.2 Task Decomposition​

4.3 Cross-Agent Coordination​

4.4 Error Handling and Recovery​

4.5 Conflict Resolution​

4.6 Audit and Observability​

4.7 Resource Management​

5. Compliance Levels​

6. Tool-Specific Guidance​

Kimi Code (Agent Swarm)​

Custom Swarm Implementations​

7. Swarm Patterns​

Pattern 1: Domain-Based Decomposition​

Pattern 2: Component-Based Decomposition​

Pattern 3: Data-Based Decomposition​

8. Implementation Checklist​

Immediate (Week 1)​

Short Term (Month 1)​

Medium Term (Quarter 1)​

9. Anti-Patterns​

Anti-Pattern 1: The Uncoordinated Stampede​

Anti-Pattern 2: The Cascade Failure​

Anti-Pattern 3: The Silent Conflict​

Anti-Pattern 4: The Runaway Swarm​

10. Related Standards​

11. References​

1. Purpose

2. Scope

3. Definitions

4. Requirements

4.1 Governance and Accountability

4.2 Task Decomposition

4.3 Cross-Agent Coordination

4.4 Error Handling and Recovery

4.5 Conflict Resolution

4.6 Audit and Observability

4.7 Resource Management

5. Compliance Levels

6. Tool-Specific Guidance

Kimi Code (Agent Swarm)

Custom Swarm Implementations

7. Swarm Patterns

Pattern 1: Domain-Based Decomposition

Pattern 2: Component-Based Decomposition

Pattern 3: Data-Based Decomposition

8. Implementation Checklist

Immediate (Week 1)

Short Term (Month 1)

Medium Term (Quarter 1)

9. Anti-Patterns

Anti-Pattern 1: The Uncoordinated Stampede

Anti-Pattern 2: The Cascade Failure

Anti-Pattern 3: The Silent Conflict

Anti-Pattern 4: The Runaway Swarm

10. Related Standards

11. References