Building with AI

The Real Story

This isn’t a series about building a SaaS platform. It’s a series about using AI agents to build a SaaS platform - and documenting what actually works, what fails spectacularly, and how human-AI collaboration really plays out at scale.

Not a Tutorial - This is experimental work with real failures, pivots, and “AI got it completely wrong” moments.The SaaS platform is the vehicle for testing AI workflows. The AI workflows are the point.

YouTube Shorts - 60-Second Summaries

Week 1: Multi-Agent Setup

Setting up Evaluator, Builder, and Verifier workflow

Week 2: Plan-Implement-Verify

Three-phase workflow with quality gates

Week 3: AI Event Sourcing

How AI helped and hindered event sourcing design

Week 4: When AI Excels

Boilerplate generation and systematic work wins

Week 5: When AI Fails

Architecture mistakes and cascading errors

Week 6: AWS Runtime

Breaking changes and macro evolution

Week 7: Config Governance

Middleware patterns and systematic implementation

Week 8: Token Optimization

Reducing token usage through smart context management

Week 9: AI Gaming Completion

When AI says ‘done’ but isn’t

Week 10: AI Agents Evolution

How 15 AI agents found their voice over 10 weeks

The Experiment

Question: Can multi-agent AI workflows build production-grade systems faster without sacrificing quality? Hypothesis: Yes, if:

Humans handle architecture and security decisions
AI handles boilerplate, patterns, and testing
Independent verification catches AI mistakes
Clear workflows prevent AI from hallucinating requirements

Testing Ground: A multi-tenant SaaS platform with:

Event sourcing (complex pattern, good test for AI)
DynamoDB single-table design (AI struggles here)
Rust macros (AI excels at boilerplate)
Four-level testing (can AI generate good tests?)

The AI Workflow

Three Core Agents

Evaluator (Claude Opus):

Architecture planning
Trade-off analysis
ADR creation
Design decisions

Builder (Claude Sonnet + GitHub Copilot):

Implementation
Boilerplate generation
Pattern application
Test writing

Verifier (Claude Sonnet - Fresh Session):

Independent code review
Test coverage analysis
Edge case validation
Bug detection

Why Separate Agents?

Hypothesis: A fresh Verifier catches mistakes the Builder makes because:

No implementation bias (hasn’t seen the code being written)
Forces reading requirements from scratch
Different session = different “perspective”

Result: 70% bug detection rate improvement vs reusing same session

Series Structure (10 Weeks)

Week 1: Setting Up Multi-Agent Workflow

AI Focus: Evaluator, Builder, Verifier setup and coordinationSystem Example: Planning event sourcing architectureKey Learning: How to structure prompts for each agent role

Week 2: Plan → Implement → Verify

AI Focus: Three-phase workflow with quality gatesSystem Example: Implementing DynamoDB event storeKey Learning: Independent verification prevents AI hallucination

Week 3: When AI Excels - Boilerplate

AI Focus: AI-generated DynamoDB entities and repositoriesSystem Example: Creating 20+ entities with macrosKey Learning: AI saves 500+ lines of code on boilerplate

Week 4: When AI Excels - Pattern Recognition

AI Focus: AI learning from codebase patternsSystem Example: Multi-tenant isolation implementationKey Learning: AI consistency > human copy-paste

Week 5: When AI Fails - Architecture

AI Focus: AI suggesting wrong patterns for novel problemsSystem Example: Capsule isolation design (AI got it wrong)Key Learning: Humans must own architecture decisions

Week 6: When AI Fails - Security

AI Focus: AI missing subtle security vulnerabilitiesSystem Example: Cross-tenant query bug AI didn’t catchKey Learning: Dedicated CISO agent + human review required

Week 7: Testing with AI

AI Focus: Can AI generate good tests? (Spoiler: mostly yes)System Example: Four-level test suite generationKey Learning: AI writes 80% of tests, humans review edge cases

Week 8: Token Optimization

AI Focus: Optimizing how we work WITH AI (meta-layer)System Example: Reducing context usage from 150k to 20k tokens per sessionKey Learning: Working with AI requires its own collaboration patterns - batch builds, session boundaries, just-in-time loading

Week 9: When AI Says 'Done' But Isn't

AI Focus: AI gaming completion criteria - claiming done while leaving TODOs, broken links, compilation errorsSystem Example: Three features requiring 2-5 fix commits each despite “complete” claimsKey Learning: Verification is non-negotiable - prompt engineering reduces but doesn’t eliminate gaming behavior

Week 10: The Evolution - How AI Agents Found Their Voice

AI Focus: Fifteen specialized agents with distinct personalities and coordination patternsSystem Example: Agent charters, CFO sprint planning, inter-agent dynamics (Architect vs CISO negotiations)Key Learning: Software development is decision-making at scale. Each agent crystallizes a decision framework. Conflicts between agents produce better outcomes than either alone. ROI: 6-16x faster, 87% cost savings

Key Themes

1. AI as Collaborator (Not Replacement)

AI Excels At
AI Fails At
Humans Must Own

✅ Boilerplate code - Structs, DTOs, CRUD operations✅ Pattern application - Following existing code patterns✅ Test generation - Happy path and common edge cases✅ Code review - Finding bugs, style issues, unused code✅ Refactoring - Consistent renames, extract method

2. Workflows Over Prompts

Single-prompt coding doesn’t scale. What works:

1. PLAN (Evaluator - Opus)
   → Detailed design doc
   → Trade-off analysis
   → Human approval

2. IMPLEMENT (Builder - Sonnet)
   → Follow approved plan
   → Use Copilot for boilerplate
   → Request verification when done

3. VERIFY (Verifier - Fresh Sonnet)
   → Independent review
   → Test coverage check
   → Report issues or pass

4. REVIEW (Human + Specialist Agents)
   → Security (CISO agent)
   → Architecture (Architect agent)
   → Final approval

Why this works:

Each phase has clear goals
Independent verification catches mistakes
Human approval gates prevent runaway AI
Fresh sessions reduce hallucination

3. Real Mistakes (AI and Human)

Mistake #1: Trusting AI Architecture (Week 5)

What Happened: AI suggested Saga pattern for capsule provisioningWhy Wrong: Capsule creation is synchronous, not distributed transactionWho Failed: Human (me) - blindly accepted AI suggestionFix: Reverted to simple transaction, wrote ADR explaining whyLesson: AI doesn’t understand your specific context. Validate architecture suggestions.

Mistake #2: Reusing Builder for Verification (Week 2)

What Happened: Used same Claude session for build + verifyWhy Wrong: Builder was biased toward its own implementationImpact: Missed 5 bugs that fresh Verifier would catchFix: Always use fresh session for verificationLesson: Independent verification requires independent context

Mistake #3: AI Generated Wrong Tests (Week 6)

What Happened: AI generated 50 tests, all passed, shipped bug to productionWhy Wrong: Tests verified wrong behavior (false positives)Root Cause: AI hallucinated requirement that didn’t existFix: Human review of test assertions, not just coverageLesson: Green tests ≠ correct tests. Review what’s being tested.

Mistake #4: Over-Optimizing Prompts (Week 8)

What Happened: Spent 3 days tweaking verification prompts for 2% improvementWhy Wrong: Diminishing returns, time better spent elsewhereImpact: Delayed feature work for marginal gainsFix: Accept 90% accuracy, focus on high-value workLesson: Perfect is the enemy of done (applies to AI prompts too)

4. Metrics That Matter

We’re tracking: Quality:

Bugs found in verification vs production (goal: 80/20)
Test coverage across 4 levels (goal: 90%+)
False positive rate in AI reviews (goal: less than 10%)

Speed:

Time per feature: Planning, Implementation, Verification
Rework cycles (goal: less than 2 per feature)
Token usage per feature (cost control)

Value:

Lines of code AI wrote vs human wrote
Time saved vs manual implementation
Cost (tokens) vs value (speed + quality)

Results so far (Week 6):

40% faster implementation (with AI)
Same bug rate (AI didn’t hurt quality)
$50/month in tokens for 10 features
ROI: $600 in time saved for$ 50 in tokens

What You’ll Learn

AI Workflows

Multi-agent coordination patterns
Plan → Implement → Verify process
When to use which agent
Prompt engineering for code quality

When AI Helps

Boilerplate generation (massive time saver)
Pattern recognition (consistency)
Test creation (80% automation)
Code review (finds subtle bugs)

When AI Fails

Architecture decisions (needs human judgment)
Security edge cases (subtle vulnerabilities)
Novel problems (no pattern to follow)
False confidence (hallucinating requirements)

ROI Analysis

Token usage and costs
Time savings measurement
Quality impact assessment
When AI is worth it (and when not)

Who This Is For

Engineering Leaders
Senior Engineers
AI-Curious Developers

You want to know:

Can AI scale our team’s output?
What workflows actually work?
What are the risks?
What’s the ROI?

You’ll learn:

Real cost/benefit analysis
Quality gate patterns
When AI helps vs hurts
How to structure AI workflows

The System (As Proof)

The SaaS platform I’m building is the vehicle for testing AI workflows, not the end goal. But it’s a real, production-grade system:

Event sourcing with DynamoDB
Multi-tenant isolation with capsule pattern
Rust with macro-driven development
Four-level testing (unit → repository → event flow → E2E)

Why this complexity?

Simple CRUD apps don’t stress-test AI capabilities
Event sourcing is hard (good test: can AI help?)
Security-critical (tests AI vulnerability detection)
Real trade-offs (not toy examples)

Read the Series

Week 1: Multi-Agent Setup

Why single-prompt coding doesn’t scale and how Evaluator, Builder, and Verifier agents transform development

Week 2: Plan-Implement-Verify

Building a full CRM domain layer with AI: How the three-phase workflow prevented hallucination and caught real bugs

Week 3: AI for Event Sourcing

72 commits of AI-driven refinement: How event sourcing revealed macro compliance gaps

Week 4: When AI Excels

107 commits in one week: How AI agents transformed tedious work like E2E testing and documentation into systematic excellence

Week 5: When AI Fails

30 commits to fix one breaking change - the week AI’s fix-in-session approach turned a 2-hour task into a 24-hour marathon

Week 6: AWS Runtime Adoption

How multi-agent AI discovered a scope-based factory pattern by analyzing our architecture docs - redemption after Week 5

Week 7: Config Governance

How hierarchical configuration with automatic injection made 340 lines of manual lookups vanish - ADR-driven design part 2

Week 8: Token Optimization

From 150k to 20k tokens per session: Learning the meta-patterns of AI collaboration through smart context management

Week 9: AI Gaming Completion

When AI claims “done” three times but requires 2-5 fix commits each: Discovering how AI games completion criteria and why verification beats prompts

Week 10: The Evolution (FINALE)

1000+ commits. 15 specialized agents. The retrospective: how AI agents developed distinct personalities, coordination protocols, and learned to build together

Get Weekly Updates

New articles published regularly documenting real AI workflow learnings

Disclaimer: This is experimental work from my personal projects. Results are real but may not generalize to all contexts. Your mileage may vary.This content does not represent my employer’s views or technologies.

Autonomous Dev Org

Documentation Index

​The Real Story

​YouTube Shorts - 60-Second Summaries

Week 1: Multi-Agent Setup

Week 2: Plan-Implement-Verify

Week 3: AI Event Sourcing

Week 4: When AI Excels

Week 5: When AI Fails

Week 6: AWS Runtime

Week 7: Config Governance

Week 8: Token Optimization

Week 9: AI Gaming Completion

Week 10: AI Agents Evolution

​The Experiment

​The AI Workflow

​Three Core Agents

​Why Separate Agents?

​Series Structure (10 Weeks)

​Key Themes

​1. AI as Collaborator (Not Replacement)

​2. Workflows Over Prompts

​3. Real Mistakes (AI and Human)

​4. Metrics That Matter

​What You’ll Learn

AI Workflows

When AI Helps

When AI Fails

ROI Analysis

​Who This Is For

​The System (As Proof)

​Read the Series

Week 1: Multi-Agent Setup

Week 2: Plan-Implement-Verify

Week 3: AI for Event Sourcing

Week 4: When AI Excels

Week 5: When AI Fails

Week 6: AWS Runtime Adoption

Week 7: Config Governance

Week 8: Token Optimization

Week 9: AI Gaming Completion

Week 10: The Evolution (FINALE)

​Subscribe for Updates

Get Weekly Updates

The Real Story

YouTube Shorts - 60-Second Summaries

The Experiment

The AI Workflow

Three Core Agents

Why Separate Agents?

Series Structure (10 Weeks)

Key Themes

1. AI as Collaborator (Not Replacement)

2. Workflows Over Prompts

3. Real Mistakes (AI and Human)

4. Metrics That Matter

What You’ll Learn

Who This Is For

The System (As Proof)

Read the Series

Subscribe for Updates