Agent Swarm Patterns

Agent Swarm enables parallel execution of complex development tasks by coordinating multiple AI agents. This guide provides proven patterns for implementing swarm workflows effectively while maintaining governance and quality.

Prerequisites

Before implementing swarm patterns, review PRD-STD-019: Agent Swarm Coordination for mandatory governance requirements.

When to Use Agent Swarm

Appropriate Use Cases

Scenario	Benefit	Example
Large codebase refactoring	4.5x speedup	Migrate 100+ files to TypeScript
Multi-component updates	Parallel execution	Update auth across frontend, backend, mobile
Comprehensive testing	Coverage in parallel	Generate tests for all API endpoints
Documentation updates	Consistency across scope	Update all README files with new API
Dependency upgrades	Ripple effect handling	React 17→18 upgrade across codebase

Inappropriate Use Cases

Scenario	Why Not Swarm	Better Approach
Single-file changes	Coordination overhead	Single agent
Sequential dependencies	Cannot parallelize	Sequential handoffs
Security-critical code	Requires focused review	Senior engineer + single agent
Novel architecture design	Needs coherent vision	Single agent with deep reasoning

The Parallelization Test

Before using swarm, verify your task passes the PARALLEL test:

Partitionable — Can be divided into independent sub-tasks?
Aggregatable — Can sub-results be combined into coherent output?
Reviewable — Can you verify each sub-task independently?
Accountable — Can you identify ownership for each sub-task?
Limited dependencies — Are cross-task dependencies minimal?
Logged — Can you capture full audit trail?
Estimable — Can you estimate cost/time for each sub-task?

Core Patterns

Pattern 1: Domain-Based Decomposition

Divide work by architectural domain.

Task: Migrate monolith to microservices

Orchestrator
├── Frontend Agent (React components)
├── Backend API Agent (Express routes)
├── Database Agent (schema, migrations)
├── Worker Agent (background jobs)
└── Integration Test Agent (depends on all)

Implementation:

kimi --mode swarm \
  --decomposition domain \
  --checkpoint-interval 3 \
  "Migrate user service to microservices architecture"

Governance Notes:

Each domain agent receives only relevant context
Integration agent waits for domain agents (explicit dependency)
Domain expertise documented in agent configuration

When to Use: Large architectural changes affecting multiple layers

Pattern 2: Component-Based Decomposition

Divide work by discrete components.

Task: Add OAuth2 authentication

Orchestrator
├── Login Component Agent (UI)
├── Token Service Agent (backend)
├── Middleware Agent (auth checks)
├── Database Agent (user sessions)
└── E2E Test Agent (depends on all)

Implementation:

kimi --mode swarm \
  --decomposition component \
  --max-agents 10 \
  "Implement OAuth2 authentication flow"

Governance Notes:

Component interfaces defined before swarm launch
Contract tests between components
Clear ownership per component

When to Use: Feature implementation spanning multiple services/modules

Pattern 3: Data-Based Decomposition

Divide work by data partitions.

Task: Process and migrate user data

Orchestrator
├── Shard A Agent (users A-F)
├── Shard B Agent (users G-M)
├── Shard C Agent (users N-S)
├── Shard D Agent (users T-Z)
└── Aggregation Agent (depends on all shards)

Implementation:

kimi --mode swarm \
  --decomposition data \
  --shards 4 \
  --shard-key "user_id" \
  "Migrate user preferences to new schema"

Governance Notes:

Idempotency required (safe to retry)
Shard boundaries must not overlap
Aggregation validates completeness

When to Use: Batch processing, data migrations, ETL workflows

Pattern 4: Stage-Based Decomposition

Divide by pipeline stages.

Task: CI/CD pipeline optimization

Orchestrator
├── Build Agent (compile, bundle)
├── Test Agent (unit, integration)
├── Security Agent (SAST, dependency scan)
├── Deploy Agent (staging)
└── Verify Agent (smoke tests, depends on deploy)

Implementation:

kimi --mode swarm \
  --decomposition pipeline \
  --stage-gates \
  "Optimize CI/CD pipeline for faster builds"

Governance Notes:

Stage gates require approval before progression
Each stage has defined success criteria
Rollback triggers defined per stage

When to Use: CI/CD improvements, release automation, quality gates

Pattern 5: Expertise-Based Decomposition

Divide by specialized knowledge areas.

Task: Comprehensive security audit

Orchestrator
├── Authentication Agent (auth flows)
├── Input Validation Agent (sanitization)
├── Secrets Agent (credential handling)
├── API Security Agent (endpoints)
└── Reporting Agent (aggregate findings)

Implementation:

kimi --mode swarm \
  --decomposition expertise \
  --expert-config ".security-experts.yaml" \
  "Conduct security audit of payment module"

Governance Notes:

Expert agents configured with domain-specific rules
Findings require human security review
Severity classification per finding

When to Use: Security audits, compliance checks, specialized reviews

Advanced Patterns

Pattern 6: Hierarchical Swarm

Nested swarms for very large tasks.

Orchestrator (Level 1)
├── Service A Lead
│   ├── A-Frontend Agent
│   ├── A-Backend Agent
│   └── A-Test Agent
├── Service B Lead
│   ├── B-Frontend Agent
│   ├── B-Backend Agent
│   └── B-Test Agent
└── Integration Lead
    └── Cross-Service Test Agent

Governance Notes:

Each lead manages their own sub-swarm
Clear escalation paths between levels
Aggregated reporting at each level

When to Use: Enterprise-scale migrations (100+ services)

Pattern 7: Competitive Swarm

Multiple agents solve same problem, best result selected.

Orchestrator
├── Algorithm A Agent (recursive approach)
├── Algorithm B Agent (iterative approach)
├── Algorithm C Agent (functional approach)
└── Evaluation Agent (benchmarks, selects winner)

Governance Notes:

Objective evaluation criteria defined upfront
Human review of selected solution
Alternative approaches documented

When to Use: Algorithm optimization, architectural decisions, complex problem-solving

Pattern 8: Verification Swarm

Separate agents implement and verify.

Orchestrator
├── Implementation Agent (generates code)
├── Review Agent 1 (checks correctness)
├── Review Agent 2 (checks performance)
├── Review Agent 3 (checks security)
└── Consolidation Agent (addresses findings)

Governance Notes:

Review agents use different criteria
Conflicting findings escalated to human
Final approval still requires human

When to Use: Critical path code, high-stakes implementations

Implementation Guidelines

Task Decomposition Best Practices

Define Clear Boundaries

Bad:  "Handle authentication"
Good: "Implement JWT token generation in auth.service.ts"

Minimize Cross-Dependencies
- Aim for <10% of sub-tasks having dependencies
- Document all dependencies explicitly
- Consider dependency order in agent scheduling
Size Appropriately

Total Work Sub-Tasks Agents
Small (1-10 files) 2-3 2-3
Medium (10-50 files) 5-10 5-10
Large (50-200 files) 10-20 10-20
Enterprise (200+ files) 20-50 20-50
Include Validation Criteria Each sub-task must have measurable completion criteria:
- Tests pass
- Linting clean
- Type checking passes
- Human review checkpoint (for production)

Total Work	Sub-Tasks	Agents
Small (1-10 files)	2-3	2-3
Medium (10-50 files)	5-10	5-10
Large (50-200 files)	10-20	10-20
Enterprise (200+ files)	20-50	20-50

Error Handling Strategies

Strategy 1: Fail Fast

kimi --mode swarm \
  --fail-fast \
  --notify-on-failure \
  "Critical production fix"

Any sub-agent failure stops entire swarm. Use for: Critical changes where partial completion is dangerous.

Strategy 2: Continue with Logging

kimi --mode swarm \
  --continue-on-failure \
  --failure-log "/var/log/swarm-failures.log" \
  "Batch documentation updates"

Failed agents logged, others continue. Use for: Non-critical batch work where partial completion is acceptable.

Strategy 3: Retry with Backoff

kimi --mode swarm \
  --retry 3 \
  --retry-delay 5s \
  --retry-backoff exponential \
  "API integration updates"

Automatic retry for transient failures. Use for: External dependency work, network-dependent tasks.

Strategy 4: Human Escalation

kimi --mode swarm \
  --escalate-on-failure \
  --escalation-contact "oncall@company.com" \
  --escalation-timeout 300 \
  "Complex database migration"

Human intervention required for failures. Use for: High-risk changes requiring expert judgment.

Cost Management

Swarm execution can consume significant token budgets:

# Estimate before execution
kimi --mode swarm --estimate-only "Large refactoring task"
# Output: Estimated 500K input tokens, 200K output tokens, ~$4.00

# Set hard budget
kimi --mode swarm \
  --token-budget-input 1000000 \
  --token-budget-output 500000 \
  --action-on-budget-exceed notify-and-pause \
  "Feature implementation"

Cost Optimization Tips:

Use smaller models for sub-agents when possible
Cache common context across agents
Set checkpoints to enable early termination if quality degrades
Review decomposition plan to minimize coordination overhead

Monitoring and Observability

# Real-time dashboard
kimi --mode swarm \
  --dashboard-port 8080 \
  --metrics-prometheus \
  "Production deployment"

# Structured logging
kimi --mode swarm \
  --log-format json \
  --log-destination /var/log/swarm/$(date +%Y%m%d-%H%M%S).json \
  "Audit-required change"

Key Metrics to Track:

Execution time per agent
Token usage per agent and total
Success/failure rates
Conflict count
Human intervention frequency
Time to completion vs. estimate

Anti-Patterns and Mitigations

Anti-Pattern 1: The Uncoordinated Stampede

Problem: Multiple agents editing same files simultaneously.

Symptoms:

Merge conflicts
Inconsistent changes
Lost work

Mitigation:

Clear file ownership per agent
Pre-execution file locking
Conflict detection in orchestrator

Anti-Pattern 2: The Cascade Failure

Problem: One failed agent causes entire swarm collapse.

Symptoms:

All agents stop on first failure
Partial work lost
No recovery mechanism

Mitigation:

--continue-on-failure for non-critical tasks
Dependency isolation
Checkpoint recovery

Anti-Pattern 3: The Silent Conflict

Problem: Agents produce contradictory outputs that auto-merge.

Symptoms:

Inconsistent code style
Conflicting implementations
Test failures post-merge

Mitigation:

Conflict detection rules
Human review for divergent outputs
Standardized patterns enforced

Anti-Pattern 4: The Runaway Swarm

Problem: Excessive token usage or indefinite execution.

Symptoms:

Budget exceeded
Agents stuck in loops
No progress visibility

Mitigation:

Token budgets with enforcement
Timeouts per agent and total
Progress checkpoints

Anti-Pattern 5: The Over-Decomposition

Problem: Too many sub-tasks create coordination overhead.

Symptoms:

More time coordinating than working
Excessive inter-agent messaging
Diminishing returns

Mitigation:

Target 5-20 sub-tasks per swarm
Batch small related tasks
Measure coordination overhead

Tool-Specific Implementation

Kimi Code Agent Swarm

Configuration File (.kimi/swarm-config.yaml):

swarm:
  max_agents: 20
  decomposition: domain
  checkpoint_interval: 5
  token_budget:
    input: 1000000
    output: 500000
  error_handling:
    strategy: continue_with_logging
    max_retries: 3
  logging:
    level: debug
    destination: ./logs/swarm/
  governance:
    owner: "senior-dev@company.com"
    require_decomposition_review: true
    human_approval_checkpoints: [5, 10, 15]

Execution:

kimi --mode swarm --config .kimi/swarm-config.yaml "Migrate to TypeScript"

Custom Swarm Implementation

For teams building custom orchestration:

# Simplified swarm orchestrator pattern
class SwarmOrchestrator:
    def __init__(self, config):
        self.max_agents = config.max_agents
        self.token_budget = config.token_budget
        self.agents = []
        self.checkpoints = []
    
    def decompose(self, task):
        # Use AI to decompose task
        subtasks = self.llm.decompose(task)
        return subtasks
    
    def spawn_agents(self, subtasks):
        for subtask in subtasks:
            if len(self.agents) >= self.max_agents:
                break
            agent = SubAgent(subtask, self.config)
            self.agents.append(agent)
    
    def coordinate(self):
        # Parallel execution with dependency resolution
        results = parallel_execute(self.agents)
        return self.aggregate(results)
    
    def checkpoint(self):
        # Save recoverable state
        self.checkpoints.append({
            'agents': [a.state for a in self.agents],
            'timestamp': now(),
            'tokens_used': self.token_usage()
        })

PRD-STD-019: Agent Swarm Coordination — Governance requirements
Kimi Code Guide — Tool-specific capabilities
Agentic Software Engineering — Theoretical foundations

When to Use Agent Swarm​

Appropriate Use Cases​

Inappropriate Use Cases​

The Parallelization Test​

Core Patterns​

Pattern 1: Domain-Based Decomposition​

Pattern 2: Component-Based Decomposition​

Pattern 3: Data-Based Decomposition​

Pattern 4: Stage-Based Decomposition​

Pattern 5: Expertise-Based Decomposition​

Advanced Patterns​

Pattern 6: Hierarchical Swarm​

Pattern 7: Competitive Swarm​

Pattern 8: Verification Swarm​

Implementation Guidelines​

Task Decomposition Best Practices​

Error Handling Strategies​

Strategy 1: Fail Fast​

Strategy 2: Continue with Logging​

Strategy 3: Retry with Backoff​

Strategy 4: Human Escalation​

Cost Management​

Monitoring and Observability​

Anti-Patterns and Mitigations​

Anti-Pattern 1: The Uncoordinated Stampede​

Anti-Pattern 2: The Cascade Failure​

Anti-Pattern 3: The Silent Conflict​

Anti-Pattern 4: The Runaway Swarm​

Anti-Pattern 5: The Over-Decomposition​

Tool-Specific Implementation​

Kimi Code Agent Swarm​

Custom Swarm Implementation​

Related Resources​

When to Use Agent Swarm

Appropriate Use Cases

Inappropriate Use Cases

The Parallelization Test

Core Patterns

Pattern 1: Domain-Based Decomposition

Pattern 2: Component-Based Decomposition

Pattern 3: Data-Based Decomposition

Pattern 4: Stage-Based Decomposition

Pattern 5: Expertise-Based Decomposition

Advanced Patterns

Pattern 6: Hierarchical Swarm

Pattern 7: Competitive Swarm

Pattern 8: Verification Swarm

Implementation Guidelines

Task Decomposition Best Practices

Error Handling Strategies

Strategy 1: Fail Fast

Strategy 2: Continue with Logging

Strategy 3: Retry with Backoff

Strategy 4: Human Escalation

Cost Management

Monitoring and Observability

Anti-Patterns and Mitigations

Anti-Pattern 1: The Uncoordinated Stampede

Anti-Pattern 2: The Cascade Failure

Anti-Pattern 3: The Silent Conflict

Anti-Pattern 4: The Runaway Swarm

Anti-Pattern 5: The Over-Decomposition

Tool-Specific Implementation

Kimi Code Agent Swarm

Custom Swarm Implementation

Related Resources