Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

We shipped a Kubernetes provisioning backend, a full customer onboarding saga, and a DynamoDB Streams fan-out materializer this week. None of it compiled in CI. That’s not a dramatic exaggeration. Our self-hosted Gitea runners were crashing — out-of-memory — on any cargo command. Compilation, test binary linking, cargo-audit, clippy: anything that touched the Rust toolchain was enough to kill the process. For a codebase built on Rust across five services, that’s not a flaky test. That’s a production incident in your development pipeline. This is the story of how we kept shipping while the build environment fought back, what we tried to fix it, what we got gloriously wrong, and what it finally taught us about CI debt.

The War Room

Before the week was over, a conversation happened that captures it perfectly.
Director: The CI runners are OOMing on cargo clippy. We can’t merge anything. Every service is red. Architect: I’m not surprised. We’ve been piling Rust compilation load onto runners that were sized for YAML parsing. The feedback loop has been getting slower for weeks — we just didn’t name it as a problem until it became a wall. Builder: I can work around it. Disable parallelism, skip compilation checks in CI, manually run cargo test before pushing. Features are still shippable. Director: We’re going to keep working around it — right up to the moment the workarounds collapse. The build environment is the bottleneck now. It deserves the same attention we’d give a production outage. Architect: Agreed. And while we’re fixing it — can we extract the hard-won CI logic into a shared repo? DRY applies to pipelines too. Builder: Famous last words.
Spoiler: the shared CI repo lasted exactly twenty-three hours.

What We Built

Week 11 was 68 meaningful commits across seven themes. Here’s the inventory:
  • Kubernetes provisioning backend: A KubernetesProvisioningBackend in the foundation layer — the first Kubernetes-native provisioner, converting Rust business logic into live tenant namespaces via a Helm chart.
  • Customer onboarding saga: Sign-up → ITSM approval → subscription provisioning → live tenant namespace, as a single coordinated saga with eight E2E lifecycle test scenarios.
  • Capsule-scoped data filter middleware: Tenant isolation enforced at the query layer with adversarial tests that deliberately try to escape tenant boundaries.
  • Feed materializer: DynamoDB Streams → EventBridge Pipes fan-out, completing the write path for the global activity feed.
  • DevOps Engineer digital employee: A new AI agent with a CI/infrastructure runbook, wired into the RACI matrix and sprint planning.
  • CI/CD infrastructure battle: Thirty-five commits across five services fighting OOM runners, a broken SonarQube token, a promising reusable workflow repo, and a Gitea 1.24 incompatibility that made us roll all of it back.
The last item dominated the week. Everything else was shipped in the gaps.

The Journey

Day 1–3: The OOM Cascade

It started Monday morning. The pipeline turned red. Not a failing test — the runner itself was dying. The error signature was always the same: the Gitea runner process exhausting memory mid-compilation, leaving a job summary that simply said “process killed.” We started with the obvious mitigations. Disable parallelism. Set CARGO_BUILD_JOBS=1. The runner still died, just more slowly. Skip compilation in clippy. Add --no-deps to clippy invocations. Helped a little. Not enough. Replace actions/checkout with manual git clone. The stock checkout action was pulling more than we needed, and its token scoping was leaking into subsequent steps in ways we hadn’t fully audited. Manual clone gave us tighter control.
# Before: opaque checkout action
- uses: actions/checkout@v4

# After: explicit clone with scoped credentials
- name: Clone repository
  run: |
    git clone --depth=1 \
      "https://oauth2:${GITEA_TOKEN}@${GITEA_HOST}/${GITEA_REPO}.git" .
    git checkout ${GITHUB_SHA}
This also fixed a parallel issue: SonarQube scans were failing because the checkout step wasn’t receiving the SonarQube scan token. The manual clone made credential passing explicit and auditable. Hardcode CARGO_HOME and registry credentials. We moved the Cargo registry cache to a persistent volume path, reducing per-job I/O overhead that was contributing to memory spikes. Serialize test suites with Mutex-style locking. For integration tests sharing database state, we moved from parallel to explicit sequential execution via a test-level lock macro. Less parallelism, more predictability. By mid-week we had working CI again — but it was a tangle of workarounds, not a solution. For more context on building and hardening self-hosted Gitea runner infrastructure from scratch, see our earlier guide: The CI Pipeline That Runs in Your Cluster.

Day 4: The YAML Heredoc That Broke the YAML Parser

In the middle of diagnosing the OOM issues, a pipeline step refused to parse entirely — cryptic YAML syntax error, unchanged for two weeks. Tracking it down took three hours. The culprit was a multi-line shell command using a heredoc (<<EOF). The Gitea YAML parser treats << in a run: field as a YAML literal block scalar sequence, not a shell heredoc. The result surfaces as “invalid YAML” rather than “your shell syntax is wrong.” The fix is counterintuitive: rewrite the heredoc as a regular multi-line string using | block scalar notation:
# Breaks in Gitea YAML parser
- run: |
    cat <<EOF > config.toml
    [section]
    key = "value"
    EOF

# Works: write the file differently
- run: |
    printf '[section]\nkey = "value"\n' > config.toml
Three hours for a one-line fix. We added a comment in the CI file explaining exactly why the heredoc form is forbidden.

Day 5: The Reusable Workflow Experiment

By Thursday, CI was stable enough to ship features. Our architect proposed a structural fix: extract all the hard-won CI logic into a shared ci-workflows repository. One source of truth for Rust CI, Docker publishing, Node.js builds, and SonarQube scanning. We built four reusable workflow files:
  • rust-ci.yml: cargo fmt, clippy, test, cargo-audit (with OOM guards)
  • docker-publish.yml: Kaniko build → distroless image → push to our internal container registry
  • node-ci.yml: lint, test, type-check
  • sonar.yml: SonarQube scan with explicit token passing
Then we migrated all five services in a single afternoon. Per-service CI files shrank from 80–120 lines to under 20. It was beautiful.
# Per-service CI after migration (20 lines vs 100+)
jobs:
  rust-ci:
    uses: org/ci-workflows/.gitea/workflows/rust-ci.yml@main
    secrets:
      CARGO_REGISTRY_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
      SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
We wrote it up as a win in the daily standup notes. We went home.

Day 6: The Rollback

Friday morning, every service CI was broken in a new way. Not OOM — the runner simply refused to resolve the external workflow reference. The same error across all five services:
Error: Unable to fetch workflow file from org/ci-workflows
Root cause: Gitea 1.24 has a known incompatibility with reusable workflow uses: references to external repositories. The feature exists in the UI. The YAML is syntactically valid. The runner accepts the job definition. Then it silently fails to fetch the reference and exits with an opaque error. You can track this behavior in the Gitea issue tracker — search for reusable workflow cross-repository support. (Note: verify this behavior against your own instance before drawing conclusions — it may be version-specific.) This is not a configuration error. It’s a platform limitation in the version we’re running. We rolled back. All five services reverted to inline CI within two hours. The ci-workflows repo still exists — with a README that says “parked until Gitea upgrade.” Total lifespan: twenty-three hours.

What We Shipped Despite the Chaos

The Kubernetes Provisioner

While CI was on fire, we merged the KubernetesProvisioningBackend into the foundation layer — the infrastructure primitive that makes multi-tenant Kubernetes deployments programmable from Rust business logic.
1

Tenant sign-up received

The platform’s onboarding saga receives a sign-up request and creates a pending provisioning record.
2

ITSM approval gate

The request routes to the ITSM service for approval. No self-serve bypass — every tenant provisioning flows through a tracked approval gate.
3

Subscription provisioned

On approval, the subscription record is created and product entitlements are assigned.
4

Kubernetes backend invoked

The KubernetesProvisioningBackend applies the tenant Helm chart, creating the namespace, RBAC, and per-tenant resource manifests in Kubernetes.
5

Tenant namespace live

The saga completes. The tenant has a live, isolated Kubernetes namespace. Eight E2E scenarios validate the full path.
The fact that this landed during peak OOM chaos is worth noting. We had no fast CI feedback loop while implementing it. Every validation was local. The merge was on trust and manual testing.

Capsule-Scoped Data Filter Middleware

The capsule-scoped middleware enforces tenant isolation between the API handler and the database query. Every query — regardless of handler code — is automatically scoped to the authenticated tenant’s capsule before hitting the database. We wrote adversarial tests for this. Tests that deliberately construct queries attempting to escape the capsule boundary. Tests that pass a different tenant ID in the query than in the auth context. Tests that try to enumerate cross-tenant data via sort key prefix scans.
#[tokio::test]
async fn adversarial_cross_tenant_escape_attempt() {
    let tenant_a = TenantId::new("tenant-a");
    let tenant_b = TenantId::new("tenant-b");

    // Authenticate as tenant-a
    let ctx = CapsuleContext::new(tenant_a.clone());

    // Attempt to query tenant-b's data through the middleware
    let filter = CapsuleDataFilter::new(ctx);
    let query = QueryBuilder::new()
        .partition_key(tenant_b.as_pk())  // Explicit cross-tenant key
        .build();

    let result = filter.apply(query).await;

    // Middleware must rewrite or reject — never pass through
    assert!(result.is_err() || result.unwrap().partition_key() == tenant_a.as_pk());
}
Every adversarial test passes. The boundary holds. We’re not declaring victory — we’re establishing a baseline to regress against.

The Feed Materializer

The DynamoDB StreamsEventBridge Pipes fan-out materializer completes the write path for the platform’s global activity feed. When any domain event lands in DynamoDB, the change stream is captured, EventBridge Pipes routes it to the appropriate downstream consumers, and the read models are updated.
We designed this architecture three weeks ago. It took this long because the write path — events landing reliably in DynamoDB — had to be solid before we wired up the read path. Events are facts. The fan-out is the consequence.

The Meta-Story: Fighting a Fire While Building the Fire Station

Here’s what’s slightly absurd about this week: while we were drowning in CI firefighting, we built an AI agent to own exactly that class of problem. The DevOps Engineer digital employee was created Thursday — mid-firefight — with a full CI/infrastructure runbook covering OOM failures, container registry management, Kubernetes deployments, SonarQube configuration, and runner triage. It was wired into the RACI matrix alongside existing digital employees. The MCP server expanded with 28 new tools, readOnlyHint flags to prevent accidental mutations during investigation, and an autoresearch upgrade to Gemma3 27B with mandatory human escalation on zero findings. We were encoding the institutional knowledge for handling CI failures into an agent while we were actively experiencing those failures. It felt less absurd by Friday. Having a structured runbook — even one written and read by the same person — cut the second OOM incident response from three hours to thirty minutes. This mirrors what we wrote about in The Runbook Is a Failure Ledger: organizing institutional knowledge has value before you hand it to an agent. Turns out it’s also valuable just for you.

What We Learned

1. CI OOM is a production incident, not a flaky test. We treated the OOM runners as an annoyance for two weeks before it forced a stop. The cost was a full week of degraded developer feedback loops across five services. When your build environment is failing, stop and fix it — it compounds. 2. Validate platform compatibility before org-wide rollout. The reusable CI workflows idea was correct. The execution — adopt everywhere in one day without verifying platform support — was overconfident. The right sequence: create → test in one service → confirm platform compatibility → roll out. We skipped step three. 3. Adversarial tests are a different discipline than coverage tests. The capsule-scoped middleware has 100% test coverage. That’s not what we’re proud of. We’re proud that the suite includes tests specifically designed to break the boundary. Coverage tells you code paths were executed. Adversarial tests tell you the security property held under attack. 4. Build environment debt taxes every commit. Every feature shipped this week cost extra validation overhead. No fast CI loop means longer manual test cycles, higher merge anxiety, slower iteration. Technical debt in the build environment is not invisible — it compounds across every team member on every push. 5. Runbooks pay dividends immediately. The DevOps runbook — even written and read by the same person — cut the second OOM incident response by 90%. Organizing institutional knowledge has value before you hand it to an agent.

What Didn’t Work

The Reusable Workflow Repo

This deserves a full post-mortem. The idea was right. Reusable CI workflows are a genuine DRY win. Consistent tooling across five services reduces the surface area for drift. One place to update OOM guard settings means one fix reaches every service simultaneously. We executed it well: four structured workflow files, clean uses: invocations, same-day adoption. By Thursday afternoon, every service CI was green and per-service pipeline files had shrunk by 80%. Then Friday morning happened. The failure mode is frustrating because it’s invisible until it breaks. The YAML is valid. The workflow references resolve in the Gitea UI. The runner accepts the job. And then it silently fails to fetch the external workflow file — no pre-flight check would have caught this without actually running the pipeline against the real platform version. The lesson is not “don’t build shared CI infrastructure.” It’s: test reusable workflow references on a throwaway pipeline before migrating production services. One hour of validation would have saved a morning of rollback. The ci-workflows repo is parked. When we upgrade Gitea, it’s the first thing we’ll re-enable.

How AI Helped (and Where It Struggled)

Where AI excelled: Describing the onboarding saga semantics — “a sign-up that must survive ITSM approval, subscription provisioning, and K8s provisioning, each of which can fail independently” — and getting back eight well-structured E2E test scenarios saved several hours. The test structure required only minor adjustment. The DevOps Engineer’s runbook was co-authored with AI. We described the CI problems we’d encountered — OOM on Rust compilation, SonarQube token scoping, Kaniko build failures, distroless ENTRYPOINT issues — and the AI organized them into a structured runbook with diagnostic steps and recovery procedures. That runbook is now the starting context for the DevOps digital employee. Where AI struggled: The Gitea 1.24 reusable workflow bug was outside the AI’s knowledge. We described the failure mode, got reasonable suggestions — file path casing, secret scoping, YAML indentation — but none were the root cause. The answer required reading the Gitea 1.24 release notes and issue tracker. Live knowledge problem; training data can’t solve it. The OOM diagnosis also required understanding how Rust’s compilation model interacts with memory-constrained environments. AI could suggest CARGO_BUILD_JOBS=1 immediately. It couldn’t tell us that pre-warming the Cargo registry on a persistent volume was the real fix — that came from profiling the runner’s I/O during a failed job. The pattern from previous weeks holds: AI is strong on patterns and boilerplate, weak on novel platform-specific failure modes. The DevOps digital employee will face the same limitation. Its value is organizing institutional knowledge the team has already acquired — not independently diagnosing novel infrastructure failures. For a deeper look at where AI assistance breaks down, see Episode 10: The Leadership Evolution.

What’s Next

Week 12: Gitea upgrade and ci-workflows re-enablement. The shared workflow infrastructure deserves to actually work. Beyond CI: the global activity feed now has a complete write path. Week 12 will wire up the read side — feed query APIs and real-time update propagation. The onboarding saga shipped with ITSM approval as a required gate; we’ll be hardening the ITSM service’s HTTP layer and integrating its first Docker image into Kubernetes deployment manifests. One more thing we’re committing to: stop treating “the build environment is broken” as something you work around. The DevOps digital employee now owns CI health as a first-class operational concern. We’ll see if structured ownership makes a measurable difference. The engine is running. The toolchain, finally, is next.
68 meaningful commits. 5 services touched. 35 CI commits just to keep the lights on. 1 reusable workflow repo, created and rolled back in 23 hours. This is what building looks like when the infrastructure fights back.
All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.