This is Episode 4 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
The Number That Didn’t Add Up
Episode 3 gave the loop blast radius awareness. Agents now query the impact graph before changing a function signature. They know which modules call what before touching anything. That solved one class of silent failure. A different class was already hiding in the output. We were 11 weeks into a 19-week refactor. The executor had been running daily tasks — splitting a module, migrating types, updating callers. Each session ended the same way: “1405/1405 tests passing ✅.” At some point I asked: “Why 1405?” The agent didn’t have an answer. It reported the current state accurately. What it didn’t know — couldn’t know — was that the baseline was 1408. Three tests had quietly disappeared over several weeks. They weren’t failing. They weren’t erroring. They were just gone. The agent had no concept of “gone.” It tracked what existed. It had no record of what should exist. This is baseline drift: output that passes all checks and is nonetheless wrong.The Asymmetry
This is a different problem from the ones Episodes 2 and 3 addressed. Episode 2 was about task memory — agents starting new sessions with no context about past work. The bead format solved that by encoding friction points and outcomes directly into GitHub Issues. Episode 3 was about spatial awareness — agents making changes without knowing what else depended on what they were touching. The impact graph solved that. This is baseline awareness. It’s not about past tasks or code dependencies. It’s about the agent’s relationship to a known-good state. AI is excellent at measuring current state. It runs the tests, counts the results, reports the number — perfectly, every time, without fatigue. What it doesn’t carry is expectation. Every session is a fresh observation. “1405 tests passing” is not a surprising number to the agent. It’s just the number. Humans carry expected state across sessions. Not consciously, not as a spreadsheet — as intuition. When I asked “why 1405?”, I wasn’t doing arithmetic. I had a vague sense that the number felt smaller than it used to. That sense flagged a real problem. The agent had run the same check dozens of times and never flagged anything. Not because it was wrong. Because it had nothing to compare against. The asymmetry: AI as executor, human as anomaly detector. The executor runs the check consistently. The human catches when the check passes but the result is still wrong.What Long Projects Reveal
This asymmetry hides in short sessions. Long projects make it undeniable. A 10-hour session starts fresh and ends fresh. If 3 tests go missing in that session, you probably notice — you’re present for the whole arc. You have context. A 19-week refactor is different. No single session holds the full picture. Week 6 introduces a type migration. Week 9 removes a deprecated module. Week 12 the test file for that module gets caught in a cleanup commit that was slightly too broad. Week 13 the agent reports 1405 passing and nobody has held the number 1408 in mind since week 4. The implicit baseline — how many tests should we have, roughly — lives in human memory across sessions the executor doesn’t have access to. As a refactor gets longer and the sessions get more distributed, that implicit knowledge has more ways to leak out. This isn’t a model capability problem. A capable model running a single session with full history would catch this. The problem is that no single session has full history. The baseline that matters was established weeks ago in a session that no longer exists in the current context window. Long projects are where the asymmetry becomes a reliability problem rather than a quirk.Canary Metrics: What We Tracked
After the test count incident, we started explicitly recording four numbers at each verified milestone. Not as a formal system — as a discipline. A check before marking a task complete.| Metric | What It Measures | What a Delta Flagged |
|---|---|---|
| Test count | Number of tests that exist, not just pass | Caught 3 silently deleted tests over 11 weeks |
| LOC delta | Lines added vs expected for the task scope | Caught a copy-paste of an entire module (1,400 unexpected lines) |
| Compilation time | Build time in seconds | Flagged a circular dependency regression (+40% build time) |
| Module count | Number of modules/crates in the project | Confirmed the split was progressing, not inadvertently reversing |
cargo test gives test count. git diff --stat
gives LOC. time cargo build gives compilation time. The tools already existed.
What was missing was the habit of comparing current output against a recorded
expectation.
The LOC example is worth dwelling on. A task to migrate a single type shouldn’t
change line count by 1,400 lines. The agent completed the migration correctly and
then — reasoning that related code would be needed — copied an entire utility
module rather than importing it. The task passed. The tests passed. The LOC delta
surfaced what the test suite couldn’t: the scope had drifted from what the task
actually called for.
Canary metrics don’t replace tests. They catch what tests aren’t designed to
catch.
The Division of Cognitive Labor
The right mental model isn’t “AI does the work, human reviews.” It’s a more specific division. The executor runs checks consistently: test count, LOC delta, compilation time, module count — every session, without forgetting. That’s where AI is genuinely better than a human: it never skips the check because it’s tired or because the session is running long. The human carries something the executor can’t: the sense of what normal looks like at a given project state. Not a specific number remembered across weeks, but the judgment to recognize when a reported number is inconsistent with where the project should be. Human intuition without consistent measurement produces missed signals — the number felt off but nobody looked. AI measurement without a baseline produces false confidence — “1405/1405 passing ✅” on a system that lost tests it shouldn’t have. The combination that actually works: the agent consistently measures and compares to a recorded snapshot, the human sets what “normal” looks like at each verified milestone, and anomalies surface as explicit flags rather than gut feelings.This is a different layer from the hooks described in
Claude Code enforcement. Hooks enforce
architectural constraints — “don’t do this.” Baseline awareness catches drift from
expected state — “something changed that shouldn’t have.” Both matter. They target
orthogonal failure modes.
What This Means for the Autonomous Loop
In the loop as built through Episode 3, executor failures divide into two categories:- Loud failures: compile errors, test failures, explicit verifier rejections.
- Silent failures: blast radius surprises — things that compile and pass but broke something the agent didn’t know it was touching.
What’s Next
The loop has memory (Episode 2). It has blast radius awareness (Episode 3). It needs baseline awareness to close the third gap: passing output that is nonetheless wrong. That requires the manifest to evolve alongside the codebase — established at Phase 0, updated at each verified milestone, checked before every PR, used as a rejection signal when deltas are unexpected. Episode 5 covers the implementation: how the manifest gets created, who updates it, what the verifier does with a flag, and where human judgment is still necessary at the boundary between “unexpected but intentional” and “unexpected and wrong.”Episode 5: The Gate That Wasn't There
An autonomous loop needs hard process gates the same way a compiler needs
type checks. Soft norms don’t hold when there’s no human per task.
All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.