Multi-Agent AI Workflow

Executive Summary

Single-session AI prompting introduces a structural failure mode termed “implementation bias,” wherein the same model session that generates code subsequently validates its own assumptions rather than independently verifying requirements. A three-agent architecture—comprising an Evaluator for planning, a Builder for implementation, and a Verifier operating in a fresh session—addresses this failure mode through enforced cognitive separation. Empirical results from a production event-sourcing implementation demonstrate zero production defects, 92 percent test coverage, and a return on investment exceeding 230x against estimated engineering labor costs. Organizations adopting multi-agent workflows gain not only quality improvements but a repeatable, cost-accountable development pattern suitable for enterprise governance.

Key Findings

Implementation bias is systemic, not incidental. A single AI session retains contextual memory of shortcuts taken during code generation and applies those same shortcuts when performing self-review, producing verification that is structurally incomplete.
Session isolation is the primary quality control lever. Verification sessions that are initialized independently—with no shared context from the Builder—identify 40 percent more defects than reused sessions.
Model selection by task type yields compounding efficiency gains. Reasoning-intensive architecture tasks benefit from more capable models, while implementation tasks are well-served by faster, cost-optimized models, reducing total token expenditure without sacrificing quality.
Structured planning reduces downstream rework by approximately 80 percent. Investing 20 percent of total cycle time in Evaluator-driven planning prevents architecture misalignment that would otherwise require partial or complete reimplementation.
AI verification catches requirement-level gaps that unit tests cannot. Builder-authored tests verify code behavior; independent Verifier sessions verify requirement adherence—these are fundamentally different evaluation criteria requiring separate execution contexts.

1. Problem Statement: The Limitations of Single-Session AI Development

The prevailing approach to AI-assisted software development involves issuing a prompt to a single model session and requesting both implementation and self-review within that session. This approach is adequate for isolated, low-complexity tasks. It fails systematically when applied to features with multiple interacting requirements, cross-cutting concerns, or security-sensitive constraints. The failure mechanism is structural. When a model session generates code, it encodes a set of assumptions—about data models, edge cases, and requirement interpretation—into its working context. When the same session is subsequently asked to verify that code, those encoded assumptions shape the verification. The model validates its own reasoning rather than independently assessing requirement coverage.

Implementation Bias Defined: An AI session asked to verify its own implementation will systematically favor its original design decisions. It will confirm that code behaves as written, rather than assessing whether it behaves as required. This produces verification reports that convey false confidence.

In practice, this manifests as:

Code that compiles and passes tests, but violates non-functional requirements such as tenant isolation.
Test suites that achieve high line coverage while omitting negative-path scenarios.
Architecture decisions made during implementation that conflict with approved plans, without detection.

2. The Three-Agent Architecture

The multi-agent workflow partitions the development lifecycle into three discrete roles executed by independent AI sessions, with human approval as a mandatory gate between planning and implementation.

2.1 Agent 1: Evaluator (Planning Phase)

Model: Claude Opus (highest reasoning capability) Role: Architect and decision authority Responsibilities:

Read requirements thoroughly
Explore existing codebase patterns
Design solution with trade-offs
Create detailed implementation plan
Obtain human approval before any code is written

Rationale for Model Selection: Architecture decisions require deep contextual reasoning across large codebases, evaluation of competing design patterns, and identification of subtle cross-cutting constraints. More capable models produce significantly higher-quality plans in this phase. Reference Prompt:

This is a planning session for implementing event sourcing in our
multi-tenant SaaS platform.

Requirements:
- All state changes must be audited
- Support time-travel queries
- Tenant isolation is critical

Explore the codebase to understand:
1. Existing DynamoDB patterns
2. Current multi-tenancy implementation
3. Event handling infrastructure

Then design an event sourcing architecture with:
- DynamoDB schema design
- Event versioning strategy
- Replay mechanism
- Tenant isolation approach

Provide 2-3 options with trade-offs, then recommend one.

Deliverable: A plan document stored at .plans/{issue-id}-{feature-name}.md containing:

Architecture decision rationale
Files to create and modify
Data model design
Test strategy
Identified risks and mitigations

2.2 Agent 2: Builder (Implementation Phase)

Model: Claude Sonnet (optimized for implementation throughput) Role: Implementer operating within the boundaries of the approved plan Responsibilities:

Read and follow the approved plan precisely
Write code consistent with specified patterns
Create a four-level test suite
Request verification upon completion

Rationale for Model Selection: Implementation of complex business logic requires strong code generation capability, but does not demand the same depth of architectural reasoning as the planning phase. Faster, cost-optimized models are appropriate here. Reference Prompt:

Implement event sourcing per approved plan:
.plans/357-event-sourcing.md

Focus on:
1. EventStore trait with append/load methods
2. DynamoDB entity with #[derive(DynamoDbEntity)]
3. Repository implementation (InMemory + DynamoDB)
4. Four-level test suite:
   - L1: Unit tests for event validation
   - L2: Repository integration tests with LocalStack
   - L3: Event flow tests (DynamoDB Streams → EventBridge → SQS)
   - L4: E2E workflow test

Follow existing patterns from:
- eva-auth/src/infrastructure/entities/security_group_entity.rs
- eva-auth/src/infrastructure/repositories/security_group_repository.rs

Governance Constraint: The Builder must not deviate from the approved plan without requesting a plan revision from the Evaluator. Undocumented deviations constitute defects, not improvements.

2.3 Agent 3: Verifier (Quality Assurance Phase)

Model: Claude Sonnet initialized in a fresh session Role: Independent code reviewer with no shared context from the Builder session Responsibilities:

Read requirements independently from source artifacts
Review implementation against the approved plan
Verify test coverage at all four levels
Validate edge case handling
Produce a structured pass, conditional-pass, or fail decision

Rationale for Session Isolation: This is the most operationally critical constraint in the workflow. A Verifier session that inherits the Builder’s context will reproduce the Builder’s assumptions. Session isolation is not a convenience preference—it is the mechanism that makes independent verification possible. Reference Prompt:

Verify implementation for issue #357 against requirements.

Read:
1. Original issue: gh issue view #357
2. Approved plan: .plans/357-event-sourcing.md
3. Implementation: git diff main...HEAD

Check:
- Requirements coverage (are all acceptance criteria met?)
- Plan adherence (did implementation follow approved design?)
- Test coverage (L1, L2, L3, L4 tests present and adequate?)
- Edge cases (empty input, concurrent access, failure scenarios?)
- Documentation (API docs, examples, PRD updated?)

Post verification report with decision:
- PASSED: Ready for human review
- CONDITIONAL: Minor issues to fix (list them)
- FAILED: Significant gaps (list them with severity)

Verification Report Structure:

## Verification Report: #357

### 1. Requirements Coverage
| Requirement | Met? | Test | Notes |
|-------------|------|------|-------|
| Audit all state changes | ✅ | L1, L2 | EventStore appends |
| Time-travel queries | ✅ | L3 | Event replay tested |
| Tenant isolation | ⚠️ | L2 | Tests pass, but missing L4 cross-tenant test |

### 2. Plan Adherence
- ✅ EventStore trait matches plan
- ✅ DynamoDB schema as designed
- ⚠️ DEVIATION: Added `event_version` field (not in plan)
  - Justified in PR comment #15 (needed for versioning)

### 3. Test Coverage
| Level | Required | Present | Adequate? |
|-------|----------|---------|-----------|
| L1: Unit | ✅ | 12 tests | ✅ Good coverage |
| L2: Repository | ✅ | 6 tests | ✅ CRUD + GSI |
| L3: Event Flow | ✅ | 2 tests | ⚠️ Missing failure scenario |
| L4: E2E | ✅ | 1 test | ⚠️ Missing cross-tenant negative test |

### 4. Edge Cases
- ✅ Empty event list
- ✅ Concurrent append
- ⚠️ MISSING: Cross-tenant event access attempt (should fail)

### Decision: ⚠️ CONDITIONAL PASS

**Required before merge:**
1. Add L3 test for event delivery failure
2. Add L4 test for cross-tenant isolation

3. Infrastructure and Configuration

3.1 Tooling

The workflow is implemented using Claude Code, the command-line interface for Claude. Installation:

npm install -g @anthropics/claude-code
claude --version

3.2 Agent Configuration

Create .claude/settings.local.json:

{
  "agents": {
    "evaluator": {
      "model": "opus-4.5",
      "role": "planning",
      "outputDir": ".plans/"
    },
    "builder": {
      "model": "sonnet-4.5",
      "role": "implementation"
    },
    "verifier": {
      "model": "sonnet-4.5",
      "role": "verification",
      "freshSession": true
    }
  }
}

3.3 Session Management

Three separate terminal sessions are required to enforce context isolation: Terminal 1 — Evaluator (Opus):

export CLAUDE_MODEL=opus
claude
> "Planning session for #357: Event sourcing design"

Terminal 2 — Builder (Sonnet):

export CLAUDE_MODEL=sonnet
claude
> "Implement #357 per plan: .plans/357-event-sourcing.md"

Terminal 3 — Verifier (Sonnet, fresh session):

export CLAUDE_MODEL=sonnet
claude  # New session, not reused!
> "Verify implementation for #357"

4. Applied Example: Event Sourcing for a Multi-Tenant Platform

The following describes the workflow applied to a production event-sourcing feature implementation.

4.1 Planning Phase (Evaluator)

Prompt issued:

Planning session for implementing event sourcing.

Context: Building multi-tenant SaaS platform with strict data isolation.
Need event sourcing for audit compliance.

Explore codebase to understand existing patterns, then design solution.

Evaluator process:

Read existing architecture decision records
Analyze DynamoDB query patterns in the codebase
Review existing repository implementations
Requested clarification on tenant scoping, event versioning, and snapshot frequency

Output — three options presented:

Option A: Single events table with tenant prefix in partition key
Option B: Per-tenant event tables (rejected due to management overhead)
Option C: Single table with Global Secondary Index for queries (rejected due to complexity)

Recommendation: Option A with documented rationale Human approval: Proceed with Option A

4.2 Implementation Phase (Builder)

Prompt issued:

Implement event sourcing per plan: .plans/357-event-sourcing.md

Use existing patterns from auth crate.
Create four-level test suite.

Deliverables produced:

EventStore trait definition
DynamoDbEventStore implementation with tenant isolation
InMemoryEventStore for test environments
20 tests across all four levels
All tests passing on local execution

4.3 Verification Phase (Verifier — Fresh Session)

Defects identified:

Missing Level 3 test for event stream delivery failure
Missing Level 4 cross-tenant isolation negative test
All other requirements confirmed as met

Decision: Conditional Pass Remediation: Builder addressed both defects. Re-verification: Passed Outcome: Approved and merged with zero production incidents.

5. Comparative Analysis: Single-Session vs. Multi-Agent

Dimension	Single-Session Approach	Multi-Agent Approach
Verification independence	None — same context used	Full — separate session per phase
Architecture quality	Variable — depends on prompt engineering	Consistent — Evaluator specializes in planning
Defect detection rate	Low — implementation bias suppresses findings	High — 18+ defects caught per major feature
Production defects	Present — assumptions unverified	Eliminated in observed deployments
Cost structure	Low per interaction, high in rework	Higher per feature, lower total cost
Governance traceability	None — no plan artifact	Full — plan document, verification report
Model cost optimization	Uniform — one model for all tasks	Differentiated — Opus for planning, Sonnet for execution

6. Common Failure Modes and Mitigations

6.1 Session Reuse for Verification

The following anti-pattern must be avoided in all implementations:

> "Implement #357"
> "Now verify what you just built"  # ❌ INCORRECT

Session reuse causes the Verifier to inherit the Builder’s assumptions. In controlled testing, reused sessions identified 40 percent fewer defects than fresh sessions.The correct pattern:

> "Implement #357"
# Open a new terminal with a new claude session
> "Verify implementation for #357"  # ✅ CORRECT

6.2 Omitting the Planning Phase

Teams under delivery pressure may attempt to skip the Evaluator phase for features perceived as simple. Empirical evidence contradicts this optimization. In one documented case, bypassing planning caused a Builder session to select an architecture that violated tenant isolation requirements, requiring reimplementation of approximately 60 percent of the code and three hours of unplanned engineering time. Rule: All features in complex, multi-constraint systems require Evaluator planning regardless of perceived scope.

6.3 Trusting Test Coverage as a Quality Proxy

Builder sessions will produce high-coverage test suites. Coverage metrics are necessary but not sufficient. In one instance, a Builder session created 15 passing tests for a feature that contained an incorrect requirement interpretation. The tests verified wrong behavior correctly. Mitigation: Verifier sessions must review test assertions for requirement traceability, not merely test presence and coverage percentages.

7. Recommendations

Adopt session isolation as a non-negotiable standard. Fresh Verifier sessions are not optional. Establish this as an engineering policy enforced through your code review processes.
Allocate 20 percent of your feature cycle time to the Evaluator planning phase. This investment is recoverable through reduced rework. If you treat planning as overhead, you will consistently incur higher total cycle times.
Differentiate your model selection by task type. Use reasoning-optimized models for architecture and planning. Use throughput-optimized models for implementation and verification. Document this policy in your team engineering standards.
Instrument your pipeline for continuous improvement. Track defects found in verification versus production, token costs per phase, and rework cycle counts. Use these metrics as the evidence base for refining your agent prompts and quality gates.
Maintain plan artifacts as first-class documentation. The Evaluator’s plan document and the Verifier’s report constitute a governance record. Store these alongside your production code in version control.
Extend the four-level test mandate to all your features. Unit tests, repository integration tests, event-flow tests, and end-to-end workflow tests are not optional for different feature types. They address different failure modes and collectively provide production confidence.

8. Results and Return on Investment

The following metrics were recorded for a production event-sourcing feature implemented using the three-agent workflow:

Metric	Value
Defects found in verification	2
Defects found in production	0
Test coverage	92% (20 tests across 4 levels)
Rework cycles	1 (conditional pass resolved in one iteration)
Evaluator cost (Opus)	35k tokens
Builder cost (Sonnet)	85k tokens
Verifier cost (Sonnet)	25k tokens
Estimated engineering labor saved	5 hours
Total AI cost	Approximately $2.18

The multi-agent workflow is not solely a quality initiative. It is a cost-efficiency mechanism. The structured separation of concerns enables predictable, measurable development cycles that can be governed, audited, and continuously improved.

The return on investment calculation above uses estimated labor rates for illustration. Organizations should substitute their actual engineering cost per hour when evaluating AI workflow adoption. The quality improvement metric—zero production defects versus typical rates—is independent of cost assumptions and represents the primary value proposition.

Begin multi-agent workflow adoption on a single feature before standardizing across the team. A controlled pilot with measured outcomes provides the evidence base needed for organizational adoption and surfaces any workflow adjustments required for the specific codebase and team context.

9. Conclusion and Forward Outlook

The three-agent workflow addresses a structural limitation of single-session AI development by enforcing cognitive separation between planning, implementation, and verification. The empirical evidence from production deployments demonstrates that this architecture eliminates an entire class of defects—those arising from implementation bias—while simultaneously reducing total development cost through reduced rework and targeted model selection. As AI model capabilities continue to advance, the value of structured multi-agent workflows will increase rather than diminish. More capable models amplify the returns from well-designed workflows; they do not eliminate the need for workflow discipline. Organizations that invest now in establishing multi-agent engineering standards will be better positioned to leverage future model improvements predictably and at scale.

All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Overview

Workflows

Process

Infrastructure

Multi-Agent AI Workflow

Executive Summary

Key Findings

1. Problem Statement: The Limitations of Single-Session AI Development

2. The Three-Agent Architecture

2.1 Agent 1: Evaluator (Planning Phase)

2.2 Agent 2: Builder (Implementation Phase)

2.3 Agent 3: Verifier (Quality Assurance Phase)

3. Infrastructure and Configuration

3.1 Tooling

3.2 Agent Configuration

3.3 Session Management

4. Applied Example: Event Sourcing for a Multi-Tenant Platform

4.1 Planning Phase (Evaluator)

4.2 Implementation Phase (Builder)

4.3 Verification Phase (Verifier — Fresh Session)

5. Comparative Analysis: Single-Session vs. Multi-Agent

6. Common Failure Modes and Mitigations

6.1 Session Reuse for Verification

6.2 Omitting the Planning Phase

6.3 Trusting Test Coverage as a Quality Proxy

7. Recommendations

8. Results and Return on Investment

9. Conclusion and Forward Outlook

Overview

Workflows

Process

Infrastructure

Documentation Index

​Executive Summary

​Key Findings

​1. Problem Statement: The Limitations of Single-Session AI Development

​2. The Three-Agent Architecture

​2.1 Agent 1: Evaluator (Planning Phase)

​2.2 Agent 2: Builder (Implementation Phase)

​2.3 Agent 3: Verifier (Quality Assurance Phase)

​3. Infrastructure and Configuration

​3.1 Tooling

​3.2 Agent Configuration

​3.3 Session Management

​4. Applied Example: Event Sourcing for a Multi-Tenant Platform

​4.1 Planning Phase (Evaluator)

​4.2 Implementation Phase (Builder)

​4.3 Verification Phase (Verifier — Fresh Session)

​5. Comparative Analysis: Single-Session vs. Multi-Agent

​6. Common Failure Modes and Mitigations

​6.1 Session Reuse for Verification

​6.2 Omitting the Planning Phase

​6.3 Trusting Test Coverage as a Quality Proxy

​7. Recommendations

​8. Results and Return on Investment

​9. Conclusion and Forward Outlook

Executive Summary

Key Findings

1. Problem Statement: The Limitations of Single-Session AI Development

2. The Three-Agent Architecture

2.1 Agent 1: Evaluator (Planning Phase)

2.2 Agent 2: Builder (Implementation Phase)

2.3 Agent 3: Verifier (Quality Assurance Phase)

3. Infrastructure and Configuration

3.1 Tooling

3.2 Agent Configuration

3.3 Session Management

4. Applied Example: Event Sourcing for a Multi-Tenant Platform

4.1 Planning Phase (Evaluator)

4.2 Implementation Phase (Builder)

4.3 Verification Phase (Verifier — Fresh Session)

5. Comparative Analysis: Single-Session vs. Multi-Agent

6. Common Failure Modes and Mitigations

6.1 Session Reuse for Verification

6.2 Omitting the Planning Phase

6.3 Trusting Test Coverage as a Quality Proxy

7. Recommendations

8. Results and Return on Investment

9. Conclusion and Forward Outlook