We shipped a Kubernetes provisioning backend, a full customer onboarding saga, and a DynamoDB Streams fan-out materializer this week. None of it compiled in CI. That’s not a dramatic exaggeration. Our self-hosted Gitea runners were crashing — out-of-memory — on anyDocumentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
cargo command. Compilation, test binary linking, cargo-audit, clippy: anything that touched the Rust toolchain was enough to kill the process. For a codebase built on Rust across five services, that’s not a flaky test. That’s a production incident in your development pipeline.
This is the story of how we kept shipping while the build environment fought back, what we tried to fix it, what we got gloriously wrong, and what it finally taught us about CI debt.
The War Room
Before the week was over, a conversation happened that captures it perfectly.Director: The CI runners are OOMing onSpoiler: the shared CI repo lasted exactly twenty-three hours.cargo clippy. We can’t merge anything. Every service is red. Architect: I’m not surprised. We’ve been piling Rust compilation load onto runners that were sized for YAML parsing. The feedback loop has been getting slower for weeks — we just didn’t name it as a problem until it became a wall. Builder: I can work around it. Disable parallelism, skip compilation checks in CI, manually runcargo testbefore pushing. Features are still shippable. Director: We’re going to keep working around it — right up to the moment the workarounds collapse. The build environment is the bottleneck now. It deserves the same attention we’d give a production outage. Architect: Agreed. And while we’re fixing it — can we extract the hard-won CI logic into a shared repo? DRY applies to pipelines too. Builder: Famous last words.
What We Built
Week 11 was 68 meaningful commits across seven themes. Here’s the inventory:- Kubernetes provisioning backend: A
KubernetesProvisioningBackendin the foundation layer — the first Kubernetes-native provisioner, converting Rust business logic into live tenant namespaces via a Helm chart. - Customer onboarding saga: Sign-up → ITSM approval → subscription provisioning → live tenant namespace, as a single coordinated saga with eight E2E lifecycle test scenarios.
- Capsule-scoped data filter middleware: Tenant isolation enforced at the query layer with adversarial tests that deliberately try to escape tenant boundaries.
- Feed materializer: DynamoDB Streams → EventBridge Pipes fan-out, completing the write path for the global activity feed.
- DevOps Engineer digital employee: A new AI agent with a CI/infrastructure runbook, wired into the RACI matrix and sprint planning.
- CI/CD infrastructure battle: Thirty-five commits across five services fighting OOM runners, a broken SonarQube token, a promising reusable workflow repo, and a Gitea 1.24 incompatibility that made us roll all of it back.
The Journey
Day 1–3: The OOM Cascade
It started Monday morning. The pipeline turned red. Not a failing test — the runner itself was dying. The error signature was always the same: the Gitea runner process exhausting memory mid-compilation, leaving a job summary that simply said “process killed.” We started with the obvious mitigations. Disable parallelism. SetCARGO_BUILD_JOBS=1. The runner still died, just more slowly.
Skip compilation in clippy. Add --no-deps to clippy invocations. Helped a little. Not enough.
Replace actions/checkout with manual git clone. The stock checkout action was pulling more than we needed, and its token scoping was leaking into subsequent steps in ways we hadn’t fully audited. Manual clone gave us tighter control.
CARGO_HOME and registry credentials. We moved the Cargo registry cache to a persistent volume path, reducing per-job I/O overhead that was contributing to memory spikes.
Serialize test suites with Mutex-style locking. For integration tests sharing database state, we moved from parallel to explicit sequential execution via a test-level lock macro. Less parallelism, more predictability.
By mid-week we had working CI again — but it was a tangle of workarounds, not a solution. For more context on building and hardening self-hosted Gitea runner infrastructure from scratch, see our earlier guide: The CI Pipeline That Runs in Your Cluster.
Day 4: The YAML Heredoc That Broke the YAML Parser
In the middle of diagnosing the OOM issues, a pipeline step refused to parse entirely — cryptic YAML syntax error, unchanged for two weeks. Tracking it down took three hours. The culprit was a multi-line shell command using a heredoc (<<EOF). The Gitea YAML parser treats << in a run: field as a YAML literal block scalar sequence, not a shell heredoc. The result surfaces as “invalid YAML” rather than “your shell syntax is wrong.”
The fix is counterintuitive: rewrite the heredoc as a regular multi-line string using | block scalar notation:
Day 5: The Reusable Workflow Experiment
By Thursday, CI was stable enough to ship features. Our architect proposed a structural fix: extract all the hard-won CI logic into a sharedci-workflows repository. One source of truth for Rust CI, Docker publishing, Node.js builds, and SonarQube scanning.
We built four reusable workflow files:
rust-ci.yml: cargo fmt, clippy, test, cargo-audit (with OOM guards)docker-publish.yml: Kaniko build → distroless image → push to our internal container registrynode-ci.yml: lint, test, type-checksonar.yml: SonarQube scan with explicit token passing
Day 6: The Rollback
Friday morning, every service CI was broken in a new way. Not OOM — the runner simply refused to resolve the external workflow reference. The same error across all five services:uses: references to external repositories. The feature exists in the UI. The YAML is syntactically valid. The runner accepts the job definition. Then it silently fails to fetch the reference and exits with an opaque error. You can track this behavior in the Gitea issue tracker — search for reusable workflow cross-repository support. (Note: verify this behavior against your own instance before drawing conclusions — it may be version-specific.)
This is not a configuration error. It’s a platform limitation in the version we’re running.
We rolled back. All five services reverted to inline CI within two hours. The ci-workflows repo still exists — with a README that says “parked until Gitea upgrade.” Total lifespan: twenty-three hours.
What We Shipped Despite the Chaos
The Kubernetes Provisioner
While CI was on fire, we merged theKubernetesProvisioningBackend into the foundation layer — the infrastructure primitive that makes multi-tenant Kubernetes deployments programmable from Rust business logic.
Tenant sign-up received
The platform’s onboarding saga receives a sign-up request and creates a pending provisioning record.
ITSM approval gate
The request routes to the ITSM service for approval. No self-serve bypass — every tenant provisioning flows through a tracked approval gate.
Subscription provisioned
On approval, the subscription record is created and product entitlements are assigned.
Kubernetes backend invoked
The
KubernetesProvisioningBackend applies the tenant Helm chart, creating the namespace, RBAC, and per-tenant resource manifests in Kubernetes.Capsule-Scoped Data Filter Middleware
The capsule-scoped middleware enforces tenant isolation between the API handler and the database query. Every query — regardless of handler code — is automatically scoped to the authenticated tenant’s capsule before hitting the database. We wrote adversarial tests for this. Tests that deliberately construct queries attempting to escape the capsule boundary. Tests that pass a different tenant ID in the query than in the auth context. Tests that try to enumerate cross-tenant data via sort key prefix scans.The Feed Materializer
The DynamoDB Streams → EventBridge Pipes fan-out materializer completes the write path for the platform’s global activity feed. When any domain event lands in DynamoDB, the change stream is captured, EventBridge Pipes routes it to the appropriate downstream consumers, and the read models are updated.The Meta-Story: Fighting a Fire While Building the Fire Station
Here’s what’s slightly absurd about this week: while we were drowning in CI firefighting, we built an AI agent to own exactly that class of problem. The DevOps Engineer digital employee was created Thursday — mid-firefight — with a full CI/infrastructure runbook covering OOM failures, container registry management, Kubernetes deployments, SonarQube configuration, and runner triage. It was wired into the RACI matrix alongside existing digital employees. The MCP server expanded with 28 new tools,readOnlyHint flags to prevent accidental mutations during investigation, and an autoresearch upgrade to Gemma3 27B with mandatory human escalation on zero findings.
We were encoding the institutional knowledge for handling CI failures into an agent while we were actively experiencing those failures.
It felt less absurd by Friday. Having a structured runbook — even one written and read by the same person — cut the second OOM incident response from three hours to thirty minutes. This mirrors what we wrote about in The Runbook Is a Failure Ledger: organizing institutional knowledge has value before you hand it to an agent. Turns out it’s also valuable just for you.
What We Learned
1. CI OOM is a production incident, not a flaky test. We treated the OOM runners as an annoyance for two weeks before it forced a stop. The cost was a full week of degraded developer feedback loops across five services. When your build environment is failing, stop and fix it — it compounds. 2. Validate platform compatibility before org-wide rollout. The reusable CI workflows idea was correct. The execution — adopt everywhere in one day without verifying platform support — was overconfident. The right sequence: create → test in one service → confirm platform compatibility → roll out. We skipped step three. 3. Adversarial tests are a different discipline than coverage tests. The capsule-scoped middleware has 100% test coverage. That’s not what we’re proud of. We’re proud that the suite includes tests specifically designed to break the boundary. Coverage tells you code paths were executed. Adversarial tests tell you the security property held under attack. 4. Build environment debt taxes every commit. Every feature shipped this week cost extra validation overhead. No fast CI loop means longer manual test cycles, higher merge anxiety, slower iteration. Technical debt in the build environment is not invisible — it compounds across every team member on every push. 5. Runbooks pay dividends immediately. The DevOps runbook — even written and read by the same person — cut the second OOM incident response by 90%. Organizing institutional knowledge has value before you hand it to an agent.What Didn’t Work
The Reusable Workflow Repo
This deserves a full post-mortem. The idea was right. Reusable CI workflows are a genuine DRY win. Consistent tooling across five services reduces the surface area for drift. One place to update OOM guard settings means one fix reaches every service simultaneously. We executed it well: four structured workflow files, cleanuses: invocations, same-day adoption. By Thursday afternoon, every service CI was green and per-service pipeline files had shrunk by 80%.
Then Friday morning happened.
The failure mode is frustrating because it’s invisible until it breaks. The YAML is valid. The workflow references resolve in the Gitea UI. The runner accepts the job. And then it silently fails to fetch the external workflow file — no pre-flight check would have caught this without actually running the pipeline against the real platform version.
The lesson is not “don’t build shared CI infrastructure.” It’s: test reusable workflow references on a throwaway pipeline before migrating production services. One hour of validation would have saved a morning of rollback.
The ci-workflows repo is parked. When we upgrade Gitea, it’s the first thing we’ll re-enable.
How AI Helped (and Where It Struggled)
Where AI excelled: Describing the onboarding saga semantics — “a sign-up that must survive ITSM approval, subscription provisioning, and K8s provisioning, each of which can fail independently” — and getting back eight well-structured E2E test scenarios saved several hours. The test structure required only minor adjustment. The DevOps Engineer’s runbook was co-authored with AI. We described the CI problems we’d encountered — OOM on Rust compilation, SonarQube token scoping, Kaniko build failures, distroless ENTRYPOINT issues — and the AI organized them into a structured runbook with diagnostic steps and recovery procedures. That runbook is now the starting context for the DevOps digital employee. Where AI struggled: The Gitea 1.24 reusable workflow bug was outside the AI’s knowledge. We described the failure mode, got reasonable suggestions — file path casing, secret scoping, YAML indentation — but none were the root cause. The answer required reading the Gitea 1.24 release notes and issue tracker. Live knowledge problem; training data can’t solve it. The OOM diagnosis also required understanding how Rust’s compilation model interacts with memory-constrained environments. AI could suggestCARGO_BUILD_JOBS=1 immediately. It couldn’t tell us that pre-warming the Cargo registry on a persistent volume was the real fix — that came from profiling the runner’s I/O during a failed job.
The pattern from previous weeks holds: AI is strong on patterns and boilerplate, weak on novel platform-specific failure modes. The DevOps digital employee will face the same limitation. Its value is organizing institutional knowledge the team has already acquired — not independently diagnosing novel infrastructure failures. For a deeper look at where AI assistance breaks down, see Episode 10: The Leadership Evolution.
What’s Next
Week 12: Gitea upgrade andci-workflows re-enablement. The shared workflow infrastructure deserves to actually work.
Beyond CI: the global activity feed now has a complete write path. Week 12 will wire up the read side — feed query APIs and real-time update propagation. The onboarding saga shipped with ITSM approval as a required gate; we’ll be hardening the ITSM service’s HTTP layer and integrating its first Docker image into Kubernetes deployment manifests.
One more thing we’re committing to: stop treating “the build environment is broken” as something you work around. The DevOps digital employee now owns CI health as a first-class operational concern. We’ll see if structured ownership makes a measurable difference.
The engine is running. The toolchain, finally, is next.
68 meaningful commits. 5 services touched. 35 CI commits just to keep the lights on. 1 reusable workflow repo, created and rolled back in 23 hours. This is what building looks like when the infrastructure fights back.
All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.