Episode 6: The Autonomous Loop Eats Its Own Dog Food

This is Episode 6 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.

The Empty Registry

The internal operations tenant’s bootstrap script called the subscription endpoint on its first run. Unit tests passing. Integration tests passing. It returned an empty entitlement list — the tenant existed but couldn’t do anything. That’s when it clicked! Every test had populated the product registry at setup and torn it down at teardown. They’d been testing a world that never existed outside of test setup. The first real consumer hit the real world — an empty registry that nobody had ever committed to a shared config file. That failure — invisible to every test written — is what dog-fooding is for. Episode 5 closed the process gate problem: tasks can no longer execute against requirements in draft. The loop has gates at every layer — process, blast radius, baseline, verification. Gates enforce what “ready” looks like before and during implementation. They say nothing about whether the thing that was built actually works for a real user. That question required a different mechanism. Not a gate. A tenant. The platform’s internal operations team became the first tenant to subscribe through the platform’s own API. Not hardcoded. Not provisioned manually. The same POST /api/v1/products/subscribe endpoint any future customer will use. When we created the tenant repo and wired the bootstrap script, the first call it made was to that endpoint — and it was the empty registry that made us fix what two thousand tests had missed.

What Dog-Fooding Reveals That Tests Don’t

Tests are written by the same people who wrote the code. They embed the same assumptions. If the developer assumed the registry would be pre-populated, the test populates it at setup. The test passes. The gap is invisible. A real consumer doesn’t share those assumptions. It follows the path a user would follow: provision the tenant, subscribe to products, receive the entitlements. If step two hits an empty registry, that’s the bug. This is not a testing failure. Tests were thorough. The integration test suite covered subscription creation, idempotency, approval gating, webhook delivery. What it didn’t cover was the bootstrap sequence: the ordered set of operations that happen before a tenant is a real tenant. Because that sequence had never been run by anyone other than test setup code. Dog-fooding forces you to run the bootstrap for real. And “for real” means starting from the state the world actually starts from, not the state your tests set up. For a human engineering team, this is a well-understood discipline. Ship to yourself before you ship to customers. Find the gaps. Fix them. The cost is low because the feedback loop is fast — internal users find issues the same day, and the team that finds them is the team that can fix them. For an autonomous loop, the stakes are higher. The loop can implement thousands of lines across dozens of tasks and never run the full sequence from zero. Each task completes correctly in isolation. The composition — the actual product behavior — is nobody’s job to verify unless you make it somebody’s job.

The Subscription-First Refactor

The empty registry failure forced a refactor that had been on the horizon anyway. The internal operations tenant’s onboarding had been wired through a separate provisioner — a dedicated script that knew the platform’s internals and bypassed the customer-facing API. It worked. It also meant the customer-facing API had never been exercised by a real onboarding. The fix: replace the special-case provisioner with a standard subscription manifest. The internal operations tenant subscribes to products like any customer. Its bootstrap script calls the subscription endpoint. Its product list lives in a committed YAML file that the endpoint reads at runtime — not at test setup.

# products.yml — tenant subscription manifest
# Same format used for all tenants, internal and external

products:
  - id: core-platform
    auto_add: true           # provisions automatically at tenant creation
    requires_approval: false
  - id: workflow-engine
    auto_add: false
    requires_approval: true  # approval-gated, returns 202 on subscribe
  - id: reporting
    auto_add: false
    requires_approval: false

Now when the bootstrap script runs:

# bootstrap.sh — internal operations tenant onboarding
# No special cases. Same path as any external customer.

subscribe_to_products() {
    local products_config="config/products.yml"
    local tenant_id="$1"

    yq '.products[].id' "$products_config" | while read product_id; do
        response=$(curl -s -o /dev/null -w "%{http_code}" \
            -X POST "${PLATFORM_API}/v1/products/subscribe" \
            -H "Authorization: Bearer ${TENANT_TOKEN}" \
            -d "{\"tenant_id\": \"${tenant_id}\", \"product_id\": \"${product_id}\"}")

        case "$response" in
            201) echo "Subscribed: ${product_id}" ;;
            202) echo "Pending approval: ${product_id}" ;;
            *)   echo "Failed: ${product_id} (HTTP ${response})"; exit 1 ;;
        esac
    done
}

The subscription endpoint now handles two outcomes: immediate provisioning (201) or approval-pending (202). The internal operations tenant exercises both paths. Finding the empty registry bug cost two hours. The alternative — finding it when the first external customer onboards — would have cost significantly more in trust.

Customer Zero vs. Customer One

There’s a principle in hardware manufacturing: the first unit off the line is always yours. Software teams have formalized this instinct too — it’s the same discipline, different medium. You run it through every test, every use case, every edge path. You don’t give it to a customer. Software teams know this principle but often apply it partially. Internal tooling gets dog-fooded. The customer-facing onboarding flow sometimes doesn’t, because nobody on the engineering team is a first-time user. They know too much. They bypass the friction without noticing it. The autonomous loop has a version of this problem. The agents that build the onboarding flow know the platform’s internals. They wrote the subscription endpoint. They know what state it expects. When the executor verifies a task, it confirms the behavior the task spec described. It doesn’t verify whether a naive first-time consumer can get through the door. Customer zero solves this by being structurally identical to customer one — same API, same auth, same product manifest format — but operated by the people who built it. When customer zero hits a gap, the team finds it cheaply and fixes it before customer one arrives. To make this systematic, we committed the first simulated customer tenant’s full profile alongside the platform itself: persona, department structure, use cases, and a lifecycle test specification that describes what a successful onboarding looks like from the customer’s perspective, not the platform’s.

A Four-Quarter Validation Ladder

The internal operations tenant’s first full run gave us a prioritized backlog of real gaps. But a single internal tenant only exercises the paths that tenant uses. Eleven simulated customer tenants, each with distinct personas, requirements, and usage patterns, cover the surface area that a diverse real customer base would expose. The validation plan across four quarters: Q1 and Q2 validate the platform works correctly — that the API chains produce the right state for each tenant persona. This is automated and agent-driven, running against LocalStack for AWS service emulation before any real infrastructure is involved. Q3 and Q4 are different in kind. In Q3, the platform’s own operational agents use the product they built and document how to use it. Not documentation written from the source code. Documentation written by an agent following the actual UX path. In Q4, a tenant-scope agent reads those guides and uses them to navigate the product with a real browser. If the guides are wrong, the tenant agent gets stuck. If the product behaves differently from what the guide says, the tenant agent catches it. The feedback loop runs without human involvement, but the signal is the closest approximation to real-user behavior we can generate before real users exist.

What This Means When AI Writes the Code

With a human team, dog-fooding is quality discipline. “Use what you build” is a heuristic for catching gaps between what engineers assumed and what users experience. With an autonomous loop, it’s a trust mechanism. The loop can implement thousands of lines correctly — correct against the spec, correct in tests, correct in verifier review — and still produce software that doesn’t work for a real user. The correctness guarantees are relative to what was specified. They say nothing about what wasn’t specified, what was assumed, or what only becomes visible when someone actually goes through the door.

The gate hierarchy from Episodes 3–5 ensures the loop builds what was specified. Dog-fooding ensures what was specified is actually what was needed. Both are necessary. Neither replaces the other.

The four-quarter ladder formalizes this. It’s not just “run the tests.” It’s a progressive sequence of increasingly realistic consumers — API chains, user guides written from product use, browser-driven user journeys — each catching a different class of gap that the previous level can’t see. Q4 is, in some sense, the loop evaluating itself. Agents built the product. Agents documented how to use it. A different set of agents follows those documents to navigate the product. The signal that comes back tells you something no test can tell you: whether the system is coherent enough for a user who wasn’t there when it was built.

What’s Next

The series has covered the loop’s infrastructure: orchestration, memory, blast radius, baseline awareness, process gates, and now end-to-end validation. The next questions are operational. What does the loop do when it’s wrong? How does the correction cycle work? And what does it look like when a real external tenant arrives — one who didn’t build it, doesn’t know the internals, and brings requirements the simulation didn’t anticipate?

Series Overview — Autonomous Dev Org

All six episodes plus what’s coming. The arc from first loop to first external tenant.

All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Building with AI

Autonomous Dev Org

Documentation Index

​The Empty Registry

​What Dog-Fooding Reveals That Tests Don’t

​The Subscription-First Refactor

​Customer Zero vs. Customer One

​A Four-Quarter Validation Ladder

​What This Means When AI Writes the Code

​What’s Next

Series Overview — Autonomous Dev Org

The Empty Registry

What Dog-Fooding Reveals That Tests Don’t

The Subscription-First Refactor

Customer Zero vs. Customer One

A Four-Quarter Validation Ladder

What This Means When AI Writes the Code

What’s Next