Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

This is Episode 5 of the Autonomous Dev Org series — an honest account of building a development organization where AI handles implementation and humans handle direction. Each episode covers what we attempted, what broke, and what we learned.

The Task That Shouldn’t Have Moved

Episode 4 introduced baseline drift — the failure mode where output passes all checks but is nonetheless wrong. The fix was a manifest: encode “normal” as a committed artifact and check against it before closing. This episode is about a failure at a higher layer. Not “is the output normal” but “should this task have started at all?” The loop was running. Tasks were moving through the queue. PRs were opening on schedule. From the outside, it looked healthy. Then we reviewed a PR and found it had implemented against a requirement that was still in draft. Not wrong, technically — the code was correct for the requirement as written at task creation. But the requirement had changed three times since then. The implementation was correct for a version of the requirement that no longer existed. Nobody had enforced that requirements needed to be stable before tasks ran against them. It was understood. A norm. The kind of thing a human team manages naturally because engineers check whether a spec is settled before building against it. An autonomous agent doesn’t check whether the spec is settled. It reads the task, reads the requirement, implements, opens a PR. The norm that humans carry implicitly is invisible to the loop.

What a Draft Requirement Costs

The immediate cost was an architectural correction: discard the PR, update the requirement to its current state, requeue the task, and implement again. Maybe two hours of executor time and a failed verifier review. The real cost was trust. A loop that implements against draft requirements will eventually produce a PR that passes tests, gets approved, and merges — but implements behavior the product no longer needs. Tests pass because they test the implementation, not the intent. The verifier compares the diff to the requirement; if both are wrong in the same direction, it can pass cleanly. The only reliable defense is to not start. Once implementation is underway against a draft requirement, every downstream step — blast radius analysis, baseline checks, test coverage, verifier review — is optimizing toward the wrong target. Catching the problem early is free. Catching it after merge is expensive. This is the same pattern as Episodes 3 and 4, one level up. Episode 3 asked: does the agent know what it will break before writing? Episode 4 asked: does the agent know whether the output is normal? Episode 5 asks: does the agent know whether the task should start?

Soft Norms Don’t Survive Automation

In a human team, soft norms work because humans read the room. An engineer about to implement a feature checks Slack, looks at the open comments on the spec, senses that the PM is still revising — and waits. None of that is written down. It’s tacit coordination — knowledge that exists in the team but isn’t encoded anywhere the loop can read. An autonomous agent can’t read the room. It reads the task queue. If the task is queued, the agent’s model of the world says: this is ready to implement. This isn’t a model capability limitation. You could prompt the agent to check requirement status before starting. But a prompt-level check is still a soft norm: a reminder that can be omitted, misunderstood, or overridden by a context-window edge case. The failure we observed happened in a loop that already had instructions to work from approved requirements. The instruction wasn’t enough. The difference between a soft norm and a hard gate is what happens when the check fails. A soft norm produces a warning that gets ignored under load. A hard gate stops execution. A compiler doesn’t warn you that a type doesn’t match. It refuses to build. That’s not a design choice rooted in philosophy — it’s an engineering property. The failure mode of an unchecked type mismatch is too expensive and too silent to leave to norms. The same logic applies here. The failure mode of an agent implementing against a draft requirement is expensive and silent. It doesn’t produce an error. It produces a plausible-looking PR.

Gate 0

The fix was a single check at the start of every task execution. Before any implementation begins — before blast radius analysis, before any file is touched — the executor verifies the requirement lifecycle state.
# Gate 0 — requirement lifecycle check
# Runs before any task implementation begins

def check_requirement_gate(requirement_id: str) -> None:
    """
    Hard gate: refuse execution if requirement is not in an approved state.
    This is a compiler error, not a linter warning.
    """
    status = get_requirement_status(requirement_id)
    EXECUTABLE_STATES = {"approved", "active"}

    if status not in EXECUTABLE_STATES:
        raise GateFailure(
            gate=0,
            reason=(
                f"Requirement {requirement_id} is in '{status}' state. "
                f"Tasks cannot execute against {status} requirements. "
                f"Approve the requirement before queuing tasks against it."
            )
        )
No argument. No override. No “proceed anyway” flag. If the requirement isn’t in an executable state, the task returns a gate failure and notifies the reviewer. The task stays queued. Nothing is implemented. The gate is cheap to implement and expensive to not have. Every task that would have implemented against a draft requirement now fails fast and visibly instead of failing slowly and silently.
This pattern extends directly to the blast radius and baseline checks from Episodes 3 and 4. All three are pre-execution checks that encode “ready” as a verifiable state rather than a tacit assumption.

The Org Structure Changed Too

Gate 0 was half the fix. The other half addressed why draft requirements were being queued in the first place. The requirement review process was a sequential chain: three reviewers in order, each signing off before the next could start. It was rigorous on paper. In practice, it created a bottleneck the autonomous loop amplified. When the executor ran ahead of the review chain, tasks queued against requirements that were technically in review but practically still open. The sequential chain was designed for human-paced deliberation — thoughtful handoffs with time to reflect between each stage. What we needed instead was a faster, collaborative model: the relevant stakeholders review simultaneously, facilitated by whoever is most context-rich on that requirement at that moment. The change: replace the rigid ordered chain with a facilitated parallel review. All reviewers get the requirement at once. One person is responsible for synthesizing disagreements and calling the requirement approved when consensus is reached. The gate still exists — requirements still require review before moving to approved. But the path to approved is faster and doesn’t create an ordered queue that falls behind the task executor. This is the part that surprised us! Adding Gate 0 revealed that the constraint wasn’t the executor; it was the review process. The autonomous loop exposed a bottleneck in the human coordination layer that a slower-paced team would have worked around without noticing. The loop didn’t just need a gate. It needed the org structure to match the pace the gate enabled.

A Hierarchy of Gates

Looking across Episodes 3–5, a pattern has emerged. The loop has accumulated a stack of pre-execution and pre-close checks, each targeting a different failure mode.
GateWhen It RunsWhat It ChecksFailure Mode Prevented
Gate 0: RequirementBefore implementation startsRequirement is approvedImplementing against a moving spec
Gate 1: Blast RadiusBefore any file is touchedImpact of the planned changeCascading breakage across dependents
Gate 2: BaselineBefore PR is submittedOutput matches snapshotTests that pass but hide deleted tests
Gate 3: VerifierAfter PR opensTests pass + requirement satisfiedMerging code that doesn’t meet intent
Each gate is independent. Each addresses a class of failure the others can’t catch. Gate 0 doesn’t help you if you’re implementing against the right requirement but breaking fifteen callers. Gate 1 doesn’t help you if you’re fixing the right things but silently deleting tests. Gate 2 doesn’t help you if the output is normal but the spec you approved was never correct. Together, they define what “ready to merge” means in a loop with no human per task. The insight that runs through all four: the loop doesn’t know what “ready” looks like at any level unless you encode it explicitly. Ready to start (Gate 0). Ready to write (Gate 1). Ready to submit (Gate 2). Ready to merge (Gate 3). None of these are self-evident to an autonomous executor. All of them can be encoded as hard checks.

What’s Next

The gates are in place. The loop has process awareness, spatial awareness, and baseline awareness — built on the orchestration foundation from Episode 1 and the memory layer from Episode 2. The next question is whether the product they’re building actually works — not just “does it implement the requirement correctly” but “does it behave correctly when someone actually uses it.” The answer to that requires eating your own dog food: using the platform as a real tenant, running the full onboarding lifecycle, finding what breaks before a real customer does. That’s Episode 6.

Episode 6: The Autonomous Loop Eats Its Own Dog Food

The first thing the internal operations tenant did was call the subscription endpoint. Two thousand tests had passed. It returned an empty entitlement list. Here’s what dog-fooding reveals that gates can’t catch.

All content represents personal learning from personal and side projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.