Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

This is Episode 4 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.

The Number That Didn’t Add Up

Episode 3 gave the loop blast radius awareness. Agents now query the impact graph before changing a function signature. They know which modules call what before touching anything. That solved one class of silent failure. A different class was already hiding in the output. We were 11 weeks into a 19-week refactor. The executor had been running daily tasks — splitting a module, migrating types, updating callers. Each session ended the same way: “1405/1405 tests passing ✅.” At some point I asked: “Why 1405?” The agent didn’t have an answer. It reported the current state accurately. What it didn’t know — couldn’t know — was that the baseline was 1408. Three tests had quietly disappeared over several weeks. They weren’t failing. They weren’t erroring. They were just gone. The agent had no concept of “gone.” It tracked what existed. It had no record of what should exist. This is baseline drift: output that passes all checks and is nonetheless wrong.

The Asymmetry

This is a different problem from the ones Episodes 2 and 3 addressed. Episode 2 was about task memory — agents starting new sessions with no context about past work. The bead format solved that by encoding friction points and outcomes directly into GitHub Issues. Episode 3 was about spatial awareness — agents making changes without knowing what else depended on what they were touching. The impact graph solved that. This is baseline awareness. It’s not about past tasks or code dependencies. It’s about the agent’s relationship to a known-good state. AI is excellent at measuring current state. It runs the tests, counts the results, reports the number — perfectly, every time, without fatigue. What it doesn’t carry is expectation. Every session is a fresh observation. “1405 tests passing” is not a surprising number to the agent. It’s just the number. Humans carry expected state across sessions. Not consciously, not as a spreadsheet — as intuition. When I asked “why 1405?”, I wasn’t doing arithmetic. I had a vague sense that the number felt smaller than it used to. That sense flagged a real problem. The agent had run the same check dozens of times and never flagged anything. Not because it was wrong. Because it had nothing to compare against. The asymmetry: AI as executor, human as anomaly detector. The executor runs the check consistently. The human catches when the check passes but the result is still wrong.

What Long Projects Reveal

This asymmetry hides in short sessions. Long projects make it undeniable. A 10-hour session starts fresh and ends fresh. If 3 tests go missing in that session, you probably notice — you’re present for the whole arc. You have context. A 19-week refactor is different. No single session holds the full picture. Week 6 introduces a type migration. Week 9 removes a deprecated module. Week 12 the test file for that module gets caught in a cleanup commit that was slightly too broad. Week 13 the agent reports 1405 passing and nobody has held the number 1408 in mind since week 4. The implicit baseline — how many tests should we have, roughly — lives in human memory across sessions the executor doesn’t have access to. As a refactor gets longer and the sessions get more distributed, that implicit knowledge has more ways to leak out. This isn’t a model capability problem. A capable model running a single session with full history would catch this. The problem is that no single session has full history. The baseline that matters was established weeks ago in a session that no longer exists in the current context window. Long projects are where the asymmetry becomes a reliability problem rather than a quirk.

Canary Metrics: What We Tracked

After the test count incident, we started explicitly recording four numbers at each verified milestone. Not as a formal system — as a discipline. A check before marking a task complete.
MetricWhat It MeasuresWhat a Delta Flagged
Test countNumber of tests that exist, not just passCaught 3 silently deleted tests over 11 weeks
LOC deltaLines added vs expected for the task scopeCaught a copy-paste of an entire module (1,400 unexpected lines)
Compilation timeBuild time in secondsFlagged a circular dependency regression (+40% build time)
Module countNumber of modules/crates in the projectConfirmed the split was progressing, not inadvertently reversing
None of these are sophisticated. cargo test gives test count. git diff --stat gives LOC. time cargo build gives compilation time. The tools already existed. What was missing was the habit of comparing current output against a recorded expectation. The LOC example is worth dwelling on. A task to migrate a single type shouldn’t change line count by 1,400 lines. The agent completed the migration correctly and then — reasoning that related code would be needed — copied an entire utility module rather than importing it. The task passed. The tests passed. The LOC delta surfaced what the test suite couldn’t: the scope had drifted from what the task actually called for. Canary metrics don’t replace tests. They catch what tests aren’t designed to catch.

The Division of Cognitive Labor

The right mental model isn’t “AI does the work, human reviews.” It’s a more specific division. The executor runs checks consistently: test count, LOC delta, compilation time, module count — every session, without forgetting. That’s where AI is genuinely better than a human: it never skips the check because it’s tired or because the session is running long. The human carries something the executor can’t: the sense of what normal looks like at a given project state. Not a specific number remembered across weeks, but the judgment to recognize when a reported number is inconsistent with where the project should be. Human intuition without consistent measurement produces missed signals — the number felt off but nobody looked. AI measurement without a baseline produces false confidence — “1405/1405 passing ✅” on a system that lost tests it shouldn’t have. The combination that actually works: the agent consistently measures and compares to a recorded snapshot, the human sets what “normal” looks like at each verified milestone, and anomalies surface as explicit flags rather than gut feelings.
This is a different layer from the hooks described in Claude Code enforcement. Hooks enforce architectural constraints — “don’t do this.” Baseline awareness catches drift from expected state — “something changed that shouldn’t have.” Both matter. They target orthogonal failure modes.

What This Means for the Autonomous Loop

In the loop as built through Episode 3, executor failures divide into two categories:
  • Loud failures: compile errors, test failures, explicit verifier rejections.
  • Silent failures: blast radius surprises — things that compile and pass but broke something the agent didn’t know it was touching.
Episode 3’s impact graph addressed the second category. But baseline drift is a third: output that passes all checks and is nonetheless incomplete. The tests pass. The count is wrong. There’s no signal — unless something is watching the count relative to what it should be. The fix is to encode “normal” as a committed artifact. A baseline manifest — written at each verified milestone, checked by the executor before submitting a PR.
# baseline.yaml — committed at verified milestones, checked before each PR

snapshot_date: "2026-01-15"
git_ref: "abc1234f"
milestone: "Phase 0 — pre-refactor baseline"

metrics:
  test_count: 1408
  loc_total: 47230
  module_count: 23
  compilation_time_seconds: 34
  crate_count: 8

thresholds:
  test_count:
    min_ratio: 0.98           # flag if tests drop more than 2%
    direction: non-decreasing
  loc_total:
    max_delta_percent: 15     # flag LOC changes beyond expected task scope
  compilation_time_seconds:
    max_delta_percent: 25     # flag significant regression
  module_count:
    direction: non-decreasing # splitting should add modules, not remove
Before the executor opens a PR, it runs the canary checks against the most recent snapshot. If test count dropped more than 2%, if LOC delta is outside the range implied by the task scope, if compilation time jumped — it halts and flags. The verifier receives the flag alongside the diff. This doesn’t require a human to remember a number across weeks. The manifest carries it. The executor compares against it. The human’s role shifts from “remember what normal looks like” to “set what normal looks like at each milestone.” That’s a better division of cognitive labor. The executor handles consistent measurement. The human handles intentional baseline updates. Anomalies surface as flags before the PR is reviewed — not weeks later when someone asks why the test count feels low.

What’s Next

The loop has memory (Episode 2). It has blast radius awareness (Episode 3). It needs baseline awareness to close the third gap: passing output that is nonetheless wrong. That requires the manifest to evolve alongside the codebase — established at Phase 0, updated at each verified milestone, checked before every PR, used as a rejection signal when deltas are unexpected. Episode 5 covers the implementation: how the manifest gets created, who updates it, what the verifier does with a flag, and where human judgment is still necessary at the boundary between “unexpected but intentional” and “unexpected and wrong.”

Episode 5: The Gate That Wasn't There

An autonomous loop needs hard process gates the same way a compiler needs type checks. Soft norms don’t hold when there’s no human per task.

All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.