Skip to main content

PRD-STD-019: Agent Swarm Coordination

Standard ID: PRD-STD-019
Version: 1.0
Status: Active
Compliance Level: Level 4 (Managed)
Effective Date: April 2026
Last Reviewed: April 2026

How To Use This Standard

This page is the normative source of requirements for governance of multi-agent swarm workflows.

For implementation support:

Use the Compliance Level metadata to sequence adoption with other PRD-STDs.

1. Purpose

This standard defines governance requirements for Agent Swarm workflows — parallel execution patterns where a coordinating agent decomposes complex tasks into sub-tasks executed concurrently by specialized sub-agents. Agent Swarm capabilities (exemplified by Kimi K2.5's 100-agent coordination) enable significant speedups for parallelizable work but introduce unique risks around accountability, state management, and error handling.

Agent Swarm enables:

  • Parallel codebase refactoring across modules
  • Concurrent test generation for multiple components
  • Distributed code review across file sets
  • Simultaneous documentation updates

These capabilities require governance frameworks addressing task decomposition, cross-agent coordination, and failure recovery.

2. Scope

This standard applies to:

  • All AI workflows involving parallel agent execution (2+ agents)
  • All coordinating agents that decompose tasks for sub-agents
  • All sub-agents operating under swarm coordination
  • All tools with swarm capabilities including Kimi Code Agent Swarm mode

This standard does NOT apply to:

  • Single-agent workflows (covered by PRD-STD-009)
  • Sequential multi-agent handoffs (covered by PRD-STD-009)
  • Human-managed parallel tasks without agent coordination

3. Definitions

TermDefinition
Agent SwarmA coordinated group of AI agents working in parallel on decomposed sub-tasks
Orchestrator AgentThe coordinating agent responsible for task decomposition and result aggregation
Sub-AgentAn individual agent executing a specific sub-task within the swarm
Task DecompositionThe process of breaking a complex task into independent parallelizable sub-tasks
Swarm StateThe collective state of all agents in the swarm including progress, intermediate results, and failures
Conflict ResolutionMechanisms for handling competing or contradictory outputs from sub-agents
CheckpointA recoverable state in the swarm workflow enabling rollback or restart
Swarm OwnerThe human accountable for swarm execution and outcomes

4. Requirements

4.1 Governance and Accountability

MANDATORY

REQ-019-01: Every swarm MUST have a designated Swarm Owner who is accountable for the swarm's execution and outcomes.

REQ-019-02: The orchestrator agent MUST log all task decomposition decisions including rationale for sub-task boundaries.

REQ-019-03: Sub-agent authorization MUST follow a hierarchical model — sub-agents operate within constraints defined by the orchestrator.

REQ-019-04: Swarm execution MUST be attributable to the Swarm Owner in audit trails (not anonymized behind agent IDs).

RECOMMENDED

REQ-019-05: Swarm Owners SHOULD be senior engineers (Staff+) familiar with both the domain and agentic workflows.

REQ-019-06: Critical swarms SHOULD have a secondary observer agent monitoring for anomalous behavior.

4.2 Task Decomposition

MANDATORY

REQ-019-07: Task decomposition MUST create sub-tasks with clear, non-overlapping scopes to prevent conflict.

REQ-019-08: Sub-tasks MUST declare their dependencies explicitly — dependent tasks MUST NOT execute in parallel.

REQ-019-09: The orchestrator MUST validate that decomposed sub-tasks collectively cover the original requirement.

REQ-019-10: Task decomposition for production changes MUST be reviewable by humans before swarm execution.

RECOMMENDED

REQ-019-11: Decomposition strategies SHOULD be documented and reusable (domain-based, component-based, data-based).

REQ-019-12: Sub-task granularity SHOULD balance parallelism against coordination overhead (typically 5-20 sub-tasks).

4.3 Cross-Agent Coordination

MANDATORY

REQ-019-13: All inter-agent communication MUST be logged including message content, sender, receiver, and timestamp.

REQ-019-14: Shared state between agents MUST be versioned with conflict detection (last-write-wins is insufficient).

REQ-019-15: Agents MUST handle messages from other agents as untrusted input — validate before acting.

REQ-019-16: Swarm termination MUST be orderly — incomplete sub-agents MUST be signaled and given cleanup opportunity.

RECOMMENDED

REQ-019-17: Shared state SHOULD be minimized — prefer immutable data passing over shared mutable state.

REQ-019-18: Agent communication SHOULD use structured formats (JSON, protobuf) with schema validation.

4.4 Error Handling and Recovery

MANDATORY

REQ-019-19: Swarm workflows MUST implement checkpointing — recoverable state every N sub-tasks or time interval.

REQ-019-20: Sub-agent failures MUST NOT cascade — orchestrator MUST isolate failures and continue non-dependent tasks.

REQ-019-21: Partial swarm failures MUST trigger human notification with clear indication of completed vs. failed sub-tasks.

REQ-019-22: Swarm recovery MUST support: (a) retry failed sub-agents, (b) rollback to checkpoint, (c) manual intervention.

RECOMMENDED

REQ-019-23: Idempotent sub-tasks SHOULD be preferred — enables safe retry without side effects.

REQ-019-24: Sub-agent timeouts SHOULD be configurable per task type to prevent indefinite blocking.

4.5 Conflict Resolution

MANDATORY

REQ-019-25: Conflicting outputs from sub-agents (e.g., different solutions to same pattern) MUST be flagged for human resolution.

REQ-019-26: Automated conflict resolution rules MUST be documented and approved — no opaque arbitration.

REQ-019-27: Code-level conflicts (merge conflicts from parallel edits) MUST follow standard version control resolution workflows.

RECOMMENDED

REQ-019-28: Conflicting approaches SHOULD be preserved as alternatives for human review rather than auto-selecting.

REQ-019-29: Swarm configurations SHOULD specify conflict resolution strategy per task type.

4.6 Audit and Observability

MANDATORY

REQ-019-30: Complete swarm execution logs MUST include: (a) decomposition plan, (b) sub-agent assignments, (c) all inter-agent messages, (d) final aggregated output.

REQ-019-31: Swarm metrics MUST be captured: execution time, token usage per agent, success/failure rates, conflict count.

REQ-019-32: Swarm audit trails MUST be retained for the same duration as the code they produce.

RECOMMENDED

REQ-019-33: Real-time swarm dashboards SHOULD display progress, current agent states, and any blocked tasks.

REQ-019-34: Post-execution analysis SHOULD identify optimization opportunities for future swarms.

4.7 Resource Management

MANDATORY

REQ-019-35: Swarm execution MUST respect token budget limits — orchestrator MUST track aggregate usage.

REQ-019-36: Parallel agent count MUST be limited to prevent resource exhaustion (configurable maximum).

REQ-019-37: Long-running swarms (exceeding threshold) MUST require explicit approval to continue.

RECOMMENDED

REQ-019-38: Cost estimation SHOULD be provided before swarm execution based on decomposition plan.

REQ-019-39: Resource usage SHOULD be optimized — decompose to maximize parallel efficiency without excessive coordination overhead.

5. Compliance Levels

LevelRequirement
Level 1 (Uncontrolled)No swarm governance. Parallel agents used ad-hoc without coordination.
Level 2 (Exploratory)Basic swarm logging. REQ-019-01, REQ-019-30 implemented.
Level 3 (Defined)REQ-019-01 through REQ-019-10 implemented. Task decomposition reviewable.
Level 4 (Managed)All MANDATORY requirements. Checkpointing, conflict resolution, resource limits.
Level 5 (AI-First)All Level 4 + automated optimization, predictive conflict detection, self-healing swarms.

6. Tool-Specific Guidance

Kimi Code (Agent Swarm)

Capabilities:

  • Up to 100 parallel sub-agents
  • Automatic task decomposition
  • Self-directed coordination
  • 4.5x speedup on parallelizable tasks

Governance Configuration:

# Enable comprehensive logging
export KIMI_SWARM_AUDIT=true
export KIMI_SWARM_LOG_LEVEL=debug

# Set resource limits
export KIMI_SWARM_MAX_AGENTS=20
export KIMI_SWARM_TOKEN_BUDGET=1000000
export KIMI_SWARM_TIMEOUT=3600

# Require checkpoint approval
export KIMI_SWARM_CHECKPOINT_APPROVAL=true

Execution with Governance:

# Document swarm owner
kimi swarm --owner "senior-dev@company.com" \
--decomposition-review \
--checkpoint-interval 5 \
--mode swarm \
"Migrate codebase to TypeScript"

Best Practices:

  • Start with smaller swarms (5-10 agents) before scaling to 100
  • Use checkpoint approval for production changes
  • Review decomposition plan before execution
  • Monitor token usage — swarms can consume budget rapidly

Custom Swarm Implementations

For teams building custom swarm orchestration:

Required Components:

  1. Orchestrator Service — Decomposition and coordination
  2. Agent Registry — Sub-agent lifecycle management
  3. Message Bus — Inter-agent communication with logging
  4. State Store — Versioned shared state with conflict detection
  5. Checkpoint Manager — Recovery point creation and restoration
  6. Monitoring Dashboard — Real-time visibility and alerting

7. Swarm Patterns

Pattern 1: Domain-Based Decomposition

Use Case: Large codebase refactoring

Orchestrator
├── Frontend Agent (React components)
├── Backend Agent (API routes)
├── Database Agent (schema migrations)
└── Test Agent (test updates)

Governance: Each domain agent has specialized context and constraints.

Pattern 2: Component-Based Decomposition

Use Case: Microservice architecture changes

Orchestrator
├── Service A Agent
├── Service B Agent
├── Service C Agent
└── Integration Test Agent (depends on A, B, C)

Governance: Dependency declarations ensure proper execution order.

Pattern 3: Data-Based Decomposition

Use Case: Batch processing across datasets

Orchestrator
├── Shard 1 Agent
├── Shard 2 Agent
├── Shard 3 Agent
└── Aggregation Agent (depends on all shards)

Governance: Idempotency required for safe retry of data agents.

8. Implementation Checklist

Immediate (Week 1)

  • Identify current or planned swarm use cases
  • Designate Swarm Owners for each use case
  • Enable swarm audit logging in tools
  • Document task decomposition review process

Short Term (Month 1)

  • Implement checkpoint and recovery mechanisms
  • Configure resource limits (token budgets, agent counts)
  • Establish conflict resolution workflows
  • Create swarm monitoring dashboard

Medium Term (Quarter 1)

  • Develop reusable decomposition patterns
  • Integrate swarm metrics into engineering KPIs
  • Automate cost estimation for swarm tasks
  • Conduct tabletop exercises for swarm failure scenarios

9. Anti-Patterns

Anti-Pattern 1: The Uncoordinated Stampede

Problem: Multiple agents working on same files without coordination. Solution: Clear task boundaries with ownership mapping.

Anti-Pattern 2: The Cascade Failure

Problem: One failed agent causes entire swarm to fail. Solution: Proper dependency isolation and failure containment.

Anti-Pattern 3: The Silent Conflict

Problem: Agents produce conflicting outputs that auto-merge incorrectly. Solution: Conflict detection with human review requirement.

Anti-Pattern 4: The Runaway Swarm

Problem: Swarm consumes excessive tokens or runs indefinitely. Solution: Budget limits, timeouts, and checkpoint approvals.

11. References

ReferenceDescription
Kimi K2.5 Technical ReportMoonshot AI Agent Swarm architecture
Multi-Agent Reinforcement LearningAcademic foundation for swarm coordination
Distributed Systems TheoryConsensus and consistency patterns