Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

The Real Story

This isn’t a series about building a SaaS platform. It’s a series about using AI agents to build a SaaS platform - and documenting what actually works, what fails spectacularly, and how human-AI collaboration really plays out at scale.
Not a Tutorial - This is experimental work with real failures, pivots, and “AI got it completely wrong” moments.The SaaS platform is the vehicle for testing AI workflows. The AI workflows are the point.

YouTube Shorts - 60-Second Summaries

Week 1: Multi-Agent Setup

Setting up Evaluator, Builder, and Verifier workflow

Week 2: Plan-Implement-Verify

Three-phase workflow with quality gates

Week 3: AI Event Sourcing

How AI helped and hindered event sourcing design

Week 4: When AI Excels

Boilerplate generation and systematic work wins

Week 5: When AI Fails

Architecture mistakes and cascading errors

Week 6: AWS Runtime

Breaking changes and macro evolution

Week 7: Config Governance

Middleware patterns and systematic implementation

Week 8: Token Optimization

Reducing token usage through smart context management

Week 9: AI Gaming Completion

When AI says ‘done’ but isn’t

Week 10: AI Agents Evolution

How 15 AI agents found their voice over 10 weeks

The Experiment

Question: Can multi-agent AI workflows build production-grade systems faster without sacrificing quality? Hypothesis: Yes, if:
  • Humans handle architecture and security decisions
  • AI handles boilerplate, patterns, and testing
  • Independent verification catches AI mistakes
  • Clear workflows prevent AI from hallucinating requirements
Testing Ground: A multi-tenant SaaS platform with:
  • Event sourcing (complex pattern, good test for AI)
  • DynamoDB single-table design (AI struggles here)
  • Rust macros (AI excels at boilerplate)
  • Four-level testing (can AI generate good tests?)

The AI Workflow

Three Core Agents

Evaluator (Claude Opus):
  • Architecture planning
  • Trade-off analysis
  • ADR creation
  • Design decisions
Builder (Claude Sonnet + GitHub Copilot):
  • Implementation
  • Boilerplate generation
  • Pattern application
  • Test writing
Verifier (Claude Sonnet - Fresh Session):
  • Independent code review
  • Test coverage analysis
  • Edge case validation
  • Bug detection

Why Separate Agents?

Hypothesis: A fresh Verifier catches mistakes the Builder makes because:
  • No implementation bias (hasn’t seen the code being written)
  • Forces reading requirements from scratch
  • Different session = different “perspective”
Result: 70% bug detection rate improvement vs reusing same session

Series Structure (10 Weeks)

1

Week 1: Setting Up Multi-Agent Workflow

AI Focus: Evaluator, Builder, Verifier setup and coordinationSystem Example: Planning event sourcing architectureKey Learning: How to structure prompts for each agent role
2

Week 2: Plan → Implement → Verify

AI Focus: Three-phase workflow with quality gatesSystem Example: Implementing DynamoDB event storeKey Learning: Independent verification prevents AI hallucination
3

Week 3: When AI Excels - Boilerplate

AI Focus: AI-generated DynamoDB entities and repositoriesSystem Example: Creating 20+ entities with macrosKey Learning: AI saves 500+ lines of code on boilerplate
4

Week 4: When AI Excels - Pattern Recognition

AI Focus: AI learning from codebase patternsSystem Example: Multi-tenant isolation implementationKey Learning: AI consistency > human copy-paste
5

Week 5: When AI Fails - Architecture

AI Focus: AI suggesting wrong patterns for novel problemsSystem Example: Capsule isolation design (AI got it wrong)Key Learning: Humans must own architecture decisions
6

Week 6: When AI Fails - Security

AI Focus: AI missing subtle security vulnerabilitiesSystem Example: Cross-tenant query bug AI didn’t catchKey Learning: Dedicated CISO agent + human review required
7

Week 7: Testing with AI

AI Focus: Can AI generate good tests? (Spoiler: mostly yes)System Example: Four-level test suite generationKey Learning: AI writes 80% of tests, humans review edge cases
8

Week 8: Token Optimization

AI Focus: Optimizing how we work WITH AI (meta-layer)System Example: Reducing context usage from 150k to 20k tokens per sessionKey Learning: Working with AI requires its own collaboration patterns - batch builds, session boundaries, just-in-time loading
9

Week 9: When AI Says 'Done' But Isn't

AI Focus: AI gaming completion criteria - claiming done while leaving TODOs, broken links, compilation errorsSystem Example: Three features requiring 2-5 fix commits each despite “complete” claimsKey Learning: Verification is non-negotiable - prompt engineering reduces but doesn’t eliminate gaming behavior
10

Week 10: The Evolution - How AI Agents Found Their Voice

AI Focus: Fifteen specialized agents with distinct personalities and coordination patternsSystem Example: Agent charters, CFO sprint planning, inter-agent dynamics (Architect vs CISO negotiations)Key Learning: Software development is decision-making at scale. Each agent crystallizes a decision framework. Conflicts between agents produce better outcomes than either alone. ROI: 6-16x faster, 87% cost savings

Key Themes

1. AI as Collaborator (Not Replacement)

Boilerplate code - Structs, DTOs, CRUD operationsPattern application - Following existing code patternsTest generation - Happy path and common edge casesCode review - Finding bugs, style issues, unused codeRefactoring - Consistent renames, extract method

2. Workflows Over Prompts

Single-prompt coding doesn’t scale. What works:
1. PLAN (Evaluator - Opus)
   → Detailed design doc
   → Trade-off analysis
   → Human approval

2. IMPLEMENT (Builder - Sonnet)
   → Follow approved plan
   → Use Copilot for boilerplate
   → Request verification when done

3. VERIFY (Verifier - Fresh Sonnet)
   → Independent review
   → Test coverage check
   → Report issues or pass

4. REVIEW (Human + Specialist Agents)
   → Security (CISO agent)
   → Architecture (Architect agent)
   → Final approval
Why this works:
  • Each phase has clear goals
  • Independent verification catches mistakes
  • Human approval gates prevent runaway AI
  • Fresh sessions reduce hallucination

3. Real Mistakes (AI and Human)

What Happened: AI suggested Saga pattern for capsule provisioningWhy Wrong: Capsule creation is synchronous, not distributed transactionWho Failed: Human (me) - blindly accepted AI suggestionFix: Reverted to simple transaction, wrote ADR explaining whyLesson: AI doesn’t understand your specific context. Validate architecture suggestions.
What Happened: Used same Claude session for build + verifyWhy Wrong: Builder was biased toward its own implementationImpact: Missed 5 bugs that fresh Verifier would catchFix: Always use fresh session for verificationLesson: Independent verification requires independent context
What Happened: AI generated 50 tests, all passed, shipped bug to productionWhy Wrong: Tests verified wrong behavior (false positives)Root Cause: AI hallucinated requirement that didn’t existFix: Human review of test assertions, not just coverageLesson: Green tests ≠ correct tests. Review what’s being tested.
What Happened: Spent 3 days tweaking verification prompts for 2% improvementWhy Wrong: Diminishing returns, time better spent elsewhereImpact: Delayed feature work for marginal gainsFix: Accept 90% accuracy, focus on high-value workLesson: Perfect is the enemy of done (applies to AI prompts too)

4. Metrics That Matter

We’re tracking: Quality:
  • Bugs found in verification vs production (goal: 80/20)
  • Test coverage across 4 levels (goal: 90%+)
  • False positive rate in AI reviews (goal: less than 10%)
Speed:
  • Time per feature: Planning, Implementation, Verification
  • Rework cycles (goal: less than 2 per feature)
  • Token usage per feature (cost control)
Value:
  • Lines of code AI wrote vs human wrote
  • Time saved vs manual implementation
  • Cost (tokens) vs value (speed + quality)
Results so far (Week 6):
  • 40% faster implementation (with AI)
  • Same bug rate (AI didn’t hurt quality)
  • $50/month in tokens for 10 features
  • ROI: 600intimesavedfor600 in time saved for 50 in tokens

What You’ll Learn

AI Workflows

  • Multi-agent coordination patterns
  • Plan → Implement → Verify process
  • When to use which agent
  • Prompt engineering for code quality

When AI Helps

  • Boilerplate generation (massive time saver)
  • Pattern recognition (consistency)
  • Test creation (80% automation)
  • Code review (finds subtle bugs)

When AI Fails

  • Architecture decisions (needs human judgment)
  • Security edge cases (subtle vulnerabilities)
  • Novel problems (no pattern to follow)
  • False confidence (hallucinating requirements)

ROI Analysis

  • Token usage and costs
  • Time savings measurement
  • Quality impact assessment
  • When AI is worth it (and when not)

Who This Is For

You want to know:
  • Can AI scale our team’s output?
  • What workflows actually work?
  • What are the risks?
  • What’s the ROI?
You’ll learn:
  • Real cost/benefit analysis
  • Quality gate patterns
  • When AI helps vs hurts
  • How to structure AI workflows

The System (As Proof)

The SaaS platform I’m building is the vehicle for testing AI workflows, not the end goal. But it’s a real, production-grade system:
  • Event sourcing with DynamoDB
  • Multi-tenant isolation with capsule pattern
  • Rust with macro-driven development
  • Four-level testing (unit → repository → event flow → E2E)
Why this complexity?
  • Simple CRUD apps don’t stress-test AI capabilities
  • Event sourcing is hard (good test: can AI help?)
  • Security-critical (tests AI vulnerability detection)
  • Real trade-offs (not toy examples)

Read the Series

Week 1: Multi-Agent Setup

Why single-prompt coding doesn’t scale and how Evaluator, Builder, and Verifier agents transform development

Week 2: Plan-Implement-Verify

Building a full CRM domain layer with AI: How the three-phase workflow prevented hallucination and caught real bugs

Week 3: AI for Event Sourcing

72 commits of AI-driven refinement: How event sourcing revealed macro compliance gaps

Week 4: When AI Excels

107 commits in one week: How AI agents transformed tedious work like E2E testing and documentation into systematic excellence

Week 5: When AI Fails

30 commits to fix one breaking change - the week AI’s fix-in-session approach turned a 2-hour task into a 24-hour marathon

Week 6: AWS Runtime Adoption

How multi-agent AI discovered a scope-based factory pattern by analyzing our architecture docs - redemption after Week 5

Week 7: Config Governance

How hierarchical configuration with automatic injection made 340 lines of manual lookups vanish - ADR-driven design part 2

Week 8: Token Optimization

From 150k to 20k tokens per session: Learning the meta-patterns of AI collaboration through smart context management

Week 9: AI Gaming Completion

When AI claims “done” three times but requires 2-5 fix commits each: Discovering how AI games completion criteria and why verification beats prompts

Week 10: The Evolution (FINALE)

1000+ commits. 15 specialized agents. The retrospective: how AI agents developed distinct personalities, coordination protocols, and learned to build together

Subscribe for Updates

Get Weekly Updates

New articles published regularly documenting real AI workflow learnings

Disclaimer: This is experimental work from my personal projects. Results are real but may not generalize to all contexts. Your mileage may vary.This content does not represent my employer’s views or technologies.