Skip to main content

Agent Swarm Patterns

Agent Swarm enables parallel execution of complex development tasks by coordinating multiple AI agents. This guide provides proven patterns for implementing swarm workflows effectively while maintaining governance and quality.

Prerequisites

Before implementing swarm patterns, review PRD-STD-019: Agent Swarm Coordination for mandatory governance requirements.

When to Use Agent Swarm

Appropriate Use Cases

ScenarioBenefitExample
Large codebase refactoring4.5x speedupMigrate 100+ files to TypeScript
Multi-component updatesParallel executionUpdate auth across frontend, backend, mobile
Comprehensive testingCoverage in parallelGenerate tests for all API endpoints
Documentation updatesConsistency across scopeUpdate all README files with new API
Dependency upgradesRipple effect handlingReact 17→18 upgrade across codebase

Inappropriate Use Cases

ScenarioWhy Not SwarmBetter Approach
Single-file changesCoordination overheadSingle agent
Sequential dependenciesCannot parallelizeSequential handoffs
Security-critical codeRequires focused reviewSenior engineer + single agent
Novel architecture designNeeds coherent visionSingle agent with deep reasoning

The Parallelization Test

Before using swarm, verify your task passes the PARALLEL test:

  • Partitionable — Can be divided into independent sub-tasks?
  • Aggregatable — Can sub-results be combined into coherent output?
  • Reviewable — Can you verify each sub-task independently?
  • Accountable — Can you identify ownership for each sub-task?
  • Limited dependencies — Are cross-task dependencies minimal?
  • Logged — Can you capture full audit trail?
  • Estimable — Can you estimate cost/time for each sub-task?

Core Patterns

Pattern 1: Domain-Based Decomposition

Divide work by architectural domain.

Task: Migrate monolith to microservices

Orchestrator
├── Frontend Agent (React components)
├── Backend API Agent (Express routes)
├── Database Agent (schema, migrations)
├── Worker Agent (background jobs)
└── Integration Test Agent (depends on all)

Implementation:

kimi --mode swarm \
--decomposition domain \
--checkpoint-interval 3 \
"Migrate user service to microservices architecture"

Governance Notes:

  • Each domain agent receives only relevant context
  • Integration agent waits for domain agents (explicit dependency)
  • Domain expertise documented in agent configuration

When to Use: Large architectural changes affecting multiple layers

Pattern 2: Component-Based Decomposition

Divide work by discrete components.

Task: Add OAuth2 authentication

Orchestrator
├── Login Component Agent (UI)
├── Token Service Agent (backend)
├── Middleware Agent (auth checks)
├── Database Agent (user sessions)
└── E2E Test Agent (depends on all)

Implementation:

kimi --mode swarm \
--decomposition component \
--max-agents 10 \
"Implement OAuth2 authentication flow"

Governance Notes:

  • Component interfaces defined before swarm launch
  • Contract tests between components
  • Clear ownership per component

When to Use: Feature implementation spanning multiple services/modules

Pattern 3: Data-Based Decomposition

Divide work by data partitions.

Task: Process and migrate user data

Orchestrator
├── Shard A Agent (users A-F)
├── Shard B Agent (users G-M)
├── Shard C Agent (users N-S)
├── Shard D Agent (users T-Z)
└── Aggregation Agent (depends on all shards)

Implementation:

kimi --mode swarm \
--decomposition data \
--shards 4 \
--shard-key "user_id" \
"Migrate user preferences to new schema"

Governance Notes:

  • Idempotency required (safe to retry)
  • Shard boundaries must not overlap
  • Aggregation validates completeness

When to Use: Batch processing, data migrations, ETL workflows

Pattern 4: Stage-Based Decomposition

Divide by pipeline stages.

Task: CI/CD pipeline optimization

Orchestrator
├── Build Agent (compile, bundle)
├── Test Agent (unit, integration)
├── Security Agent (SAST, dependency scan)
├── Deploy Agent (staging)
└── Verify Agent (smoke tests, depends on deploy)

Implementation:

kimi --mode swarm \
--decomposition pipeline \
--stage-gates \
"Optimize CI/CD pipeline for faster builds"

Governance Notes:

  • Stage gates require approval before progression
  • Each stage has defined success criteria
  • Rollback triggers defined per stage

When to Use: CI/CD improvements, release automation, quality gates

Pattern 5: Expertise-Based Decomposition

Divide by specialized knowledge areas.

Task: Comprehensive security audit

Orchestrator
├── Authentication Agent (auth flows)
├── Input Validation Agent (sanitization)
├── Secrets Agent (credential handling)
├── API Security Agent (endpoints)
└── Reporting Agent (aggregate findings)

Implementation:

kimi --mode swarm \
--decomposition expertise \
--expert-config ".security-experts.yaml" \
"Conduct security audit of payment module"

Governance Notes:

  • Expert agents configured with domain-specific rules
  • Findings require human security review
  • Severity classification per finding

When to Use: Security audits, compliance checks, specialized reviews

Advanced Patterns

Pattern 6: Hierarchical Swarm

Nested swarms for very large tasks.

Orchestrator (Level 1)
├── Service A Lead
│ ├── A-Frontend Agent
│ ├── A-Backend Agent
│ └── A-Test Agent
├── Service B Lead
│ ├── B-Frontend Agent
│ ├── B-Backend Agent
│ └── B-Test Agent
└── Integration Lead
└── Cross-Service Test Agent

Governance Notes:

  • Each lead manages their own sub-swarm
  • Clear escalation paths between levels
  • Aggregated reporting at each level

When to Use: Enterprise-scale migrations (100+ services)

Pattern 7: Competitive Swarm

Multiple agents solve same problem, best result selected.

Orchestrator
├── Algorithm A Agent (recursive approach)
├── Algorithm B Agent (iterative approach)
├── Algorithm C Agent (functional approach)
└── Evaluation Agent (benchmarks, selects winner)

Governance Notes:

  • Objective evaluation criteria defined upfront
  • Human review of selected solution
  • Alternative approaches documented

When to Use: Algorithm optimization, architectural decisions, complex problem-solving

Pattern 8: Verification Swarm

Separate agents implement and verify.

Orchestrator
├── Implementation Agent (generates code)
├── Review Agent 1 (checks correctness)
├── Review Agent 2 (checks performance)
├── Review Agent 3 (checks security)
└── Consolidation Agent (addresses findings)

Governance Notes:

  • Review agents use different criteria
  • Conflicting findings escalated to human
  • Final approval still requires human

When to Use: Critical path code, high-stakes implementations

Implementation Guidelines

Task Decomposition Best Practices

  1. Define Clear Boundaries

    Bad:  "Handle authentication"
    Good: "Implement JWT token generation in auth.service.ts"
  2. Minimize Cross-Dependencies

    • Aim for <10% of sub-tasks having dependencies
    • Document all dependencies explicitly
    • Consider dependency order in agent scheduling
  3. Size Appropriately

    Total WorkSub-TasksAgents
    Small (1-10 files)2-32-3
    Medium (10-50 files)5-105-10
    Large (50-200 files)10-2010-20
    Enterprise (200+ files)20-5020-50
  4. Include Validation Criteria Each sub-task must have measurable completion criteria:

    • Tests pass
    • Linting clean
    • Type checking passes
    • Human review checkpoint (for production)

Error Handling Strategies

Strategy 1: Fail Fast

kimi --mode swarm \
--fail-fast \
--notify-on-failure \
"Critical production fix"

Any sub-agent failure stops entire swarm. Use for: Critical changes where partial completion is dangerous.

Strategy 2: Continue with Logging

kimi --mode swarm \
--continue-on-failure \
--failure-log "/var/log/swarm-failures.log" \
"Batch documentation updates"

Failed agents logged, others continue. Use for: Non-critical batch work where partial completion is acceptable.

Strategy 3: Retry with Backoff

kimi --mode swarm \
--retry 3 \
--retry-delay 5s \
--retry-backoff exponential \
"API integration updates"

Automatic retry for transient failures. Use for: External dependency work, network-dependent tasks.

Strategy 4: Human Escalation

kimi --mode swarm \
--escalate-on-failure \
--escalation-contact "oncall@company.com" \
--escalation-timeout 300 \
"Complex database migration"

Human intervention required for failures. Use for: High-risk changes requiring expert judgment.

Cost Management

Swarm execution can consume significant token budgets:

# Estimate before execution
kimi --mode swarm --estimate-only "Large refactoring task"
# Output: Estimated 500K input tokens, 200K output tokens, ~$4.00

# Set hard budget
kimi --mode swarm \
--token-budget-input 1000000 \
--token-budget-output 500000 \
--action-on-budget-exceed notify-and-pause \
"Feature implementation"

Cost Optimization Tips:

  1. Use smaller models for sub-agents when possible
  2. Cache common context across agents
  3. Set checkpoints to enable early termination if quality degrades
  4. Review decomposition plan to minimize coordination overhead

Monitoring and Observability

# Real-time dashboard
kimi --mode swarm \
--dashboard-port 8080 \
--metrics-prometheus \
"Production deployment"

# Structured logging
kimi --mode swarm \
--log-format json \
--log-destination /var/log/swarm/$(date +%Y%m%d-%H%M%S).json \
"Audit-required change"

Key Metrics to Track:

  • Execution time per agent
  • Token usage per agent and total
  • Success/failure rates
  • Conflict count
  • Human intervention frequency
  • Time to completion vs. estimate

Anti-Patterns and Mitigations

Anti-Pattern 1: The Uncoordinated Stampede

Problem: Multiple agents editing same files simultaneously.

Symptoms:

  • Merge conflicts
  • Inconsistent changes
  • Lost work

Mitigation:

  • Clear file ownership per agent
  • Pre-execution file locking
  • Conflict detection in orchestrator

Anti-Pattern 2: The Cascade Failure

Problem: One failed agent causes entire swarm collapse.

Symptoms:

  • All agents stop on first failure
  • Partial work lost
  • No recovery mechanism

Mitigation:

  • --continue-on-failure for non-critical tasks
  • Dependency isolation
  • Checkpoint recovery

Anti-Pattern 3: The Silent Conflict

Problem: Agents produce contradictory outputs that auto-merge.

Symptoms:

  • Inconsistent code style
  • Conflicting implementations
  • Test failures post-merge

Mitigation:

  • Conflict detection rules
  • Human review for divergent outputs
  • Standardized patterns enforced

Anti-Pattern 4: The Runaway Swarm

Problem: Excessive token usage or indefinite execution.

Symptoms:

  • Budget exceeded
  • Agents stuck in loops
  • No progress visibility

Mitigation:

  • Token budgets with enforcement
  • Timeouts per agent and total
  • Progress checkpoints

Anti-Pattern 5: The Over-Decomposition

Problem: Too many sub-tasks create coordination overhead.

Symptoms:

  • More time coordinating than working
  • Excessive inter-agent messaging
  • Diminishing returns

Mitigation:

  • Target 5-20 sub-tasks per swarm
  • Batch small related tasks
  • Measure coordination overhead

Tool-Specific Implementation

Kimi Code Agent Swarm

Configuration File (.kimi/swarm-config.yaml):

swarm:
max_agents: 20
decomposition: domain
checkpoint_interval: 5
token_budget:
input: 1000000
output: 500000
error_handling:
strategy: continue_with_logging
max_retries: 3
logging:
level: debug
destination: ./logs/swarm/
governance:
owner: "senior-dev@company.com"
require_decomposition_review: true
human_approval_checkpoints: [5, 10, 15]

Execution:

kimi --mode swarm --config .kimi/swarm-config.yaml "Migrate to TypeScript"

Custom Swarm Implementation

For teams building custom orchestration:

# Simplified swarm orchestrator pattern
class SwarmOrchestrator:
def __init__(self, config):
self.max_agents = config.max_agents
self.token_budget = config.token_budget
self.agents = []
self.checkpoints = []

def decompose(self, task):
# Use AI to decompose task
subtasks = self.llm.decompose(task)
return subtasks

def spawn_agents(self, subtasks):
for subtask in subtasks:
if len(self.agents) >= self.max_agents:
break
agent = SubAgent(subtask, self.config)
self.agents.append(agent)

def coordinate(self):
# Parallel execution with dependency resolution
results = parallel_execute(self.agents)
return self.aggregate(results)

def checkpoint(self):
# Save recoverable state
self.checkpoints.append({
'agents': [a.state for a in self.agents],
'timestamp': now(),
'tokens_used': self.token_usage()
})