Episode 11: AI Governance — The Org That Governs Itself

Director: We had a governance bug in production this week. Not in the code. In the org chart. Architect: The CPO problem. Two agents, two interpretations of the same three letters. The Control Plane Officer was issuing infrastructure mandates. The product roadmap was getting blocked on platform decisions that had nothing to do with product. Director: So we filed an ADR. Builder: You filed an ADR. For a job title. Director: We filed an ADR for a governance ambiguity that was causing conflicting decisions at the architecture layer. That it involved a job title is incidental. Architect: The title was the bug. “Control Plane Officer” made it sound like the CPO owned the control plane. Agents were routing infrastructure questions to that role when they should have gone to the CTO. The confusion wasn’t in anyone’s head — it was baked into the name. Builder: So what did you actually do? Director: Renamed it. “Chief Product Officer.” Archived the old role definition. Wrote ADR-008 formalizing that the VP of Engineering reports to the CTO, not the CPO. Two hours of deliberate governance work. That’s what prevents weeks of silent misalignment accumulating in the queue. Builder: And the DevOps Engineer? Director: Added to the RACI matrix. CI and infrastructure accountability now has an explicit owner. “Someone should handle the pipeline” is not a policy. Someone does is.

The Governance Bug

An autonomous development organization depends on every role being unambiguous. When roles are ambiguous, agents don’t argue — they guess. And guesses in a distributed system don’t surface as debates; they surface as silent misalignment. Two agents, independently interpreting the same directive, making locally-rational decisions that contradict each other at the boundary. That’s what happened with the CPO designation. “Control Plane Officer” meant something specific to the infrastructure layer — the role managing platform services, provisioning, and the control surface between tenants. But in every other organizational context, CPO means “Chief Product Officer.” Product ownership. Roadmap authority. The agents were not confused. They were correctly following their respective interpretations of an ambiguous term. That’s not a bug in the agents; it’s a bug in the specification. The fix required three things: rename the role, archive the old definition so it doesn’t linger as a ghost in the knowledge base, and write ADR-008 to make the reporting structure explicit. VP of Engineering to CTO, not CPO. Infrastructure decisions flow up through the engineering chain. Product decisions flow through product ownership. Two chains. No crossing. We also committed D-001 and D-002 — the first formal Founder Directives in the repository. These sit above the ADR layer: constraints that aren’t up for architectural debate. The distinction matters. Some decisions get reviewed, amended, and superseded as the system evolves. Others define the invariants of the system itself. Treating them the same collapses a hierarchy that exists for a reason.

ADRs as Org Code

Here’s the thing that took us a while to internalize: an org chart is software. Not metaphorically. Literally. An org chart is a specification that agents execute. When the spec is ambiguous, agents produce undefined behavior. When the spec is clear, agents produce consistent behavior. The compiler doesn’t care whether the spec describes a data structure or a reporting relationship — it parses what you give it. This week we ran the same process on org structure that we’d been running on the codebase for months:

Identify the ambiguity (CPO means two things)
Write the fix (rename, archive, ADR)
Peer review (does this conflict with any existing decision?)
Merge to main
Update affected agents

The ADR format — context, decision, consequences — turns out to be exactly the right structure for organizational decisions. “Context” documents what was unclear and why it mattered. “Decision” states the outcome precisely. “Consequences” maps the downstream effects: who gets added to RACI, who loses decision-making scope. The DevOps Engineer was added to the RACI matrix as a direct consequence of ADR-008’s clarity on the infrastructure chain. Once the reporting structure was unambiguous, the accountability gap became visible. No DevOps accountability on CI and infrastructure had been an implicit assumption. Now it’s an explicit assignment. Version-controlled governance means you can trace every organizational decision back to the reasoning that produced it. In a year, when someone asks why billing endpoints live in the billing service and not the platform layer, there’s ADR-007 with full context, alternatives considered, and trade-offs accepted. Institutional memory that doesn’t live in anyone’s head — and can’t walk out the door. Read more on ADRs as Architecture.

The Simulation Runner

Architect: ADRs fix the governance layer. But governance is only half the problem. How do you trust an agent to act before it acts? Director: You test it. The same way you test any code before shipping it. Builder: The simulation runner. Director: Exactly. If we’re hiring a DevOps Engineer agent to manage CI and infrastructure, we don’t want to discover its failure modes in production. We want to discover them in a test environment — isolated Kubernetes Job, mock API, full observability. Before it ever touches a cluster. The simulation runner is a staging environment for digital employees. A human engineer joining a team gets onboarding, pair programming, low-stakes exposure before they’re trusted with critical paths. An agent joining the org now gets a simulation harness: a controlled environment where it demonstrates its behavior before it’s granted real authority. The mock API layer is the key. It captures every request the agent makes, returns realistic responses, and surfaces the full interaction log. You can see exactly what the DevOps Engineer agent reaches for when given a “provision new tenant” task — which endpoints it calls, in what order, with what parameters. You can compare that against the expected behavior defined in its charter, and catch the gaps before they become incidents. We also added 24 smoke tests for the MCP server layer — the datastore agents use to retrieve and write persistent knowledge. These aren’t edge case tests; they’re baseline confidence checks. If the datastore can’t reliably serve memory operations, the entire pipeline runs on a broken foundation. Smoke tests run first because they’re the canary. Twenty-four basic operations that don’t pass mean nothing else should proceed. The simulation runner changed the trust model. Before, an agent was trusted until it did something wrong. Now, an agent earns trust by demonstrating correct behavior under controlled conditions first. Same model we use for code. Should have been obvious from the start. See also Episode 9: The Self-Healing Loop for the earlier iteration of this idea.

The Autoresearch Loop

Architect: So we test agents before they act. What about the intelligence layer? The system still needs to stay current — new libraries, emerging patterns, relevant papers. That’s not a problem you solve once. Director: We solved it by making research a scheduled process with human escalation on failure. Not human involvement by default. Builder: The human is the exception handler. Director: The human is the exception handler. The autoresearch loop runs autonomously using Gemma3:27b — a local model capable enough for research synthesis without requiring external API calls on every cycle. It monitors specified domains, synthesizes findings, and feeds them directly into sprint planning. When new findings are relevant to an open task, they surface in the task context automatically. Sprint plans update without human intervention. The human gets involved in exactly two cases: when the loop finds something that requires a decision above its authority level, and when the loop finds nothing. That second case took deliberate design work to get right. A zero-findings result isn’t a success state — it’s a signal. Either the search parameters are too narrow, the domain has gone quiet in a way worth knowing, or something broke in the research pipeline itself. Any of those outcomes warrant human attention. So the loop sends a Telegram alert on zero findings, not silence. Silence would mean “all good.” An alert means “I looked, and finding nothing is itself information you should evaluate.” Failure alerts work the same way. If the research loop hits an error — model timeout, malformed response, broken feed — it doesn’t silently fail and resume next cycle. It alerts immediately. The human knows within minutes, not at the next morning’s standup. The net effect: research happens continuously. Sprint planning incorporates current knowledge automatically. The human reviews the sprint plan — not the research process that produced it. That’s the right layer of abstraction. Related: Multi-Agent Workflow.

The Promise, Revisited

The pivot-draft ended with: “Make the GitHub Actions pipeline actually work. End to end. Task picked, implemented, verified, merged — without a human in the implementation loop. When that runs for real, we’ll write about it.” Here we are, writing about it. And we want to be honest about what “ran” means. The pipeline ran. The governance layer is real: ADRs formalize decisions, Founder Directives establish invariants, the RACI matrix has explicit owners. The simulation runner is real: agents can be tested before they act. The autoresearch loop is real: research feeds sprint planning without human intermediation, escalation happens on failure, not by default. What isn’t finished: the full end-to-end loop — task picked, implemented, verified, merged — running reliably at scale across all bounded contexts. The infrastructure is in place. The governance model holds. The simulation harness validates agent behavior before deployment. But “reliably at scale” is a different bar than “we saw it work.” That’s not a failure. That’s accurate reporting. The pivot-draft also noted: “The problem isn’t that the approach is wrong. It’s that the orchestration layer between the pieces doesn’t exist yet.” It exists now. The question is whether it holds.

What’s Next

The simulation runner needs real agent profiles — not just smoke tests, but full charter-level behavioral validation. We want the DevOps Engineer agent handling a tenant provisioning request end-to-end before it ever touches a production cluster. That means fleshing out the mock API layer to cover the full surface area of what a provisioning task actually touches. The autoresearch loop needs longer operating history before we understand its failure modes. Zero-findings and failure alerts are implemented — but we don’t yet know what the failure rate looks like at steady state, or whether the escalation threshold is calibrated correctly. A few more cycles will tell us. And the governance layer needs its first real test: an architectural decision that the org catches and corrects without human initiation. The ADR process exists. The institutional memory exists. The question is whether agents use them correctly when a novel situation arises — or find a way around them. The org is starting to govern itself. Episode 12 will tell us if it holds.

Episode 10: The Leadership Evolution

From bots to a boardroom — how we built the federated governance model that Episode 11 stress-tested.

Series Overview

The full arc from orchestration problem to autonomous org.

All examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Building with AI

Autonomous Dev Org

Episode 11: AI Governance — The Org That Governs Itself

The Governance Bug

ADRs as Org Code

The Simulation Runner

The Autoresearch Loop

The Promise, Revisited

What’s Next

Episode 10: The Leadership Evolution

Series Overview

Building with AI

Autonomous Dev Org

Documentation Index

​The Governance Bug

​ADRs as Org Code

​The Simulation Runner

​The Autoresearch Loop

​The Promise, Revisited

​What’s Next

Episode 10: The Leadership Evolution

Series Overview

The Governance Bug

ADRs as Org Code

The Simulation Runner

The Autoresearch Loop

The Promise, Revisited

What’s Next