Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
Single-session AI prompting introduces a structural failure mode termed “implementation bias,” wherein the same model session that generates code subsequently validates its own assumptions rather than independently verifying requirements. A three-agent architecture—comprising an Evaluator for planning, a Builder for implementation, and a Verifier operating in a fresh session—addresses this failure mode through enforced cognitive separation. Empirical results from a production event-sourcing implementation demonstrate zero production defects, 92 percent test coverage, and a return on investment exceeding 230x against estimated engineering labor costs. Organizations adopting multi-agent workflows gain not only quality improvements but a repeatable, cost-accountable development pattern suitable for enterprise governance.Key Findings
- Implementation bias is systemic, not incidental. A single AI session retains contextual memory of shortcuts taken during code generation and applies those same shortcuts when performing self-review, producing verification that is structurally incomplete.
- Session isolation is the primary quality control lever. Verification sessions that are initialized independently—with no shared context from the Builder—identify 40 percent more defects than reused sessions.
- Model selection by task type yields compounding efficiency gains. Reasoning-intensive architecture tasks benefit from more capable models, while implementation tasks are well-served by faster, cost-optimized models, reducing total token expenditure without sacrificing quality.
- Structured planning reduces downstream rework by approximately 80 percent. Investing 20 percent of total cycle time in Evaluator-driven planning prevents architecture misalignment that would otherwise require partial or complete reimplementation.
- AI verification catches requirement-level gaps that unit tests cannot. Builder-authored tests verify code behavior; independent Verifier sessions verify requirement adherence—these are fundamentally different evaluation criteria requiring separate execution contexts.
1. Problem Statement: The Limitations of Single-Session AI Development
The prevailing approach to AI-assisted software development involves issuing a prompt to a single model session and requesting both implementation and self-review within that session. This approach is adequate for isolated, low-complexity tasks. It fails systematically when applied to features with multiple interacting requirements, cross-cutting concerns, or security-sensitive constraints. The failure mechanism is structural. When a model session generates code, it encodes a set of assumptions—about data models, edge cases, and requirement interpretation—into its working context. When the same session is subsequently asked to verify that code, those encoded assumptions shape the verification. The model validates its own reasoning rather than independently assessing requirement coverage. In practice, this manifests as:- Code that compiles and passes tests, but violates non-functional requirements such as tenant isolation.
- Test suites that achieve high line coverage while omitting negative-path scenarios.
- Architecture decisions made during implementation that conflict with approved plans, without detection.
2. The Three-Agent Architecture
The multi-agent workflow partitions the development lifecycle into three discrete roles executed by independent AI sessions, with human approval as a mandatory gate between planning and implementation.2.1 Agent 1: Evaluator (Planning Phase)
Model: Claude Opus (highest reasoning capability) Role: Architect and decision authority Responsibilities:- Read requirements thoroughly
- Explore existing codebase patterns
- Design solution with trade-offs
- Create detailed implementation plan
- Obtain human approval before any code is written
.plans/{issue-id}-{feature-name}.md containing:
- Architecture decision rationale
- Files to create and modify
- Data model design
- Test strategy
- Identified risks and mitigations
2.2 Agent 2: Builder (Implementation Phase)
Model: Claude Sonnet (optimized for implementation throughput) Role: Implementer operating within the boundaries of the approved plan Responsibilities:- Read and follow the approved plan precisely
- Write code consistent with specified patterns
- Create a four-level test suite
- Request verification upon completion
2.3 Agent 3: Verifier (Quality Assurance Phase)
Model: Claude Sonnet initialized in a fresh session Role: Independent code reviewer with no shared context from the Builder session Responsibilities:- Read requirements independently from source artifacts
- Review implementation against the approved plan
- Verify test coverage at all four levels
- Validate edge case handling
- Produce a structured pass, conditional-pass, or fail decision
3. Infrastructure and Configuration
3.1 Tooling
The workflow is implemented using Claude Code, the command-line interface for Claude. Installation:3.2 Agent Configuration
Create.claude/settings.local.json:
3.3 Session Management
Three separate terminal sessions are required to enforce context isolation: Terminal 1 — Evaluator (Opus):4. Applied Example: Event Sourcing for a Multi-Tenant Platform
The following describes the workflow applied to a production event-sourcing feature implementation.4.1 Planning Phase (Evaluator)
Prompt issued:- Read existing architecture decision records
- Analyze DynamoDB query patterns in the codebase
- Review existing repository implementations
- Requested clarification on tenant scoping, event versioning, and snapshot frequency
- Option A: Single events table with tenant prefix in partition key
- Option B: Per-tenant event tables (rejected due to management overhead)
- Option C: Single table with Global Secondary Index for queries (rejected due to complexity)
4.2 Implementation Phase (Builder)
Prompt issued:EventStoretrait definitionDynamoDbEventStoreimplementation with tenant isolationInMemoryEventStorefor test environments- 20 tests across all four levels
- All tests passing on local execution
4.3 Verification Phase (Verifier — Fresh Session)
Defects identified:- Missing Level 3 test for event stream delivery failure
- Missing Level 4 cross-tenant isolation negative test
- All other requirements confirmed as met
5. Comparative Analysis: Single-Session vs. Multi-Agent
| Dimension | Single-Session Approach | Multi-Agent Approach |
|---|---|---|
| Verification independence | None — same context used | Full — separate session per phase |
| Architecture quality | Variable — depends on prompt engineering | Consistent — Evaluator specializes in planning |
| Defect detection rate | Low — implementation bias suppresses findings | High — 18+ defects caught per major feature |
| Production defects | Present — assumptions unverified | Eliminated in observed deployments |
| Cost structure | Low per interaction, high in rework | Higher per feature, lower total cost |
| Governance traceability | None — no plan artifact | Full — plan document, verification report |
| Model cost optimization | Uniform — one model for all tasks | Differentiated — Opus for planning, Sonnet for execution |
6. Common Failure Modes and Mitigations
6.1 Session Reuse for Verification
6.2 Omitting the Planning Phase
Teams under delivery pressure may attempt to skip the Evaluator phase for features perceived as simple. Empirical evidence contradicts this optimization. In one documented case, bypassing planning caused a Builder session to select an architecture that violated tenant isolation requirements, requiring reimplementation of approximately 60 percent of the code and three hours of unplanned engineering time. Rule: All features in complex, multi-constraint systems require Evaluator planning regardless of perceived scope.6.3 Trusting Test Coverage as a Quality Proxy
Builder sessions will produce high-coverage test suites. Coverage metrics are necessary but not sufficient. In one instance, a Builder session created 15 passing tests for a feature that contained an incorrect requirement interpretation. The tests verified wrong behavior correctly. Mitigation: Verifier sessions must review test assertions for requirement traceability, not merely test presence and coverage percentages.7. Recommendations
- Adopt session isolation as a non-negotiable standard. Fresh Verifier sessions are not optional. Establish this as an engineering policy enforced through your code review processes.
- Allocate 20 percent of your feature cycle time to the Evaluator planning phase. This investment is recoverable through reduced rework. If you treat planning as overhead, you will consistently incur higher total cycle times.
- Differentiate your model selection by task type. Use reasoning-optimized models for architecture and planning. Use throughput-optimized models for implementation and verification. Document this policy in your team engineering standards.
- Instrument your pipeline for continuous improvement. Track defects found in verification versus production, token costs per phase, and rework cycle counts. Use these metrics as the evidence base for refining your agent prompts and quality gates.
- Maintain plan artifacts as first-class documentation. The Evaluator’s plan document and the Verifier’s report constitute a governance record. Store these alongside your production code in version control.
- Extend the four-level test mandate to all your features. Unit tests, repository integration tests, event-flow tests, and end-to-end workflow tests are not optional for different feature types. They address different failure modes and collectively provide production confidence.
8. Results and Return on Investment
The following metrics were recorded for a production event-sourcing feature implemented using the three-agent workflow:| Metric | Value |
|---|---|
| Defects found in verification | 2 |
| Defects found in production | 0 |
| Test coverage | 92% (20 tests across 4 levels) |
| Rework cycles | 1 (conditional pass resolved in one iteration) |
| Evaluator cost (Opus) | 35k tokens |
| Builder cost (Sonnet) | 85k tokens |
| Verifier cost (Sonnet) | 25k tokens |
| Estimated engineering labor saved | 5 hours |
| Total AI cost | Approximately $2.18 |
The return on investment calculation above uses estimated labor rates for illustration. Organizations should substitute their actual engineering cost per hour when evaluating AI workflow adoption. The quality improvement metric—zero production defects versus typical rates—is independent of cost assumptions and represents the primary value proposition.
9. Conclusion and Forward Outlook
The three-agent workflow addresses a structural limitation of single-session AI development by enforcing cognitive separation between planning, implementation, and verification. The empirical evidence from production deployments demonstrates that this architecture eliminates an entire class of defects—those arising from implementation bias—while simultaneously reducing total development cost through reduced rework and targeted model selection. As AI model capabilities continue to advance, the value of structured multi-agent workflows will increase rather than diminish. More capable models amplify the returns from well-designed workflows; they do not eliminate the need for workflow discipline. Organizations that invest now in establishing multi-agent engineering standards will be better positioned to leverage future model improvements predictably and at scale.All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.