Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

Executive Summary

A self-hosted continuous integration pipeline running entirely within a Kubernetes cluster eliminates dependency on external CI services, removes exposure to public registry rate limits, and provides complete control over the execution environment. This paper documents the architecture of such a pipeline built on Gitea Actions, act_runner, Docker-in-Docker, Harbor, and ArgoCD running inside k3s — achieving an 11-second push-to-result latency with no public internet dependency in the execution path. The operational investment required to reach a working state is substantially higher than self-hosted CI guides suggest, primarily due to a class of failure mode in which correct individual components produce incorrect behavior in combination. Six such failure modes are documented here with root cause analysis and resolutions. Organizations evaluating self-hosted CI should treat these failure modes as known costs to be planned for, not as edge cases to be discovered.

Key Findings

  • Self-hosted CI eliminates two categories of external dependency: public internet availability in the execution path and per-minute billing from managed CI providers. Both eliminations represent operational reliability improvements, not merely cost reductions.
  • Harbor proxy cache eliminates Docker Hub rate limit exposure entirely once the cache is warm. The configuration investment — approximately one afternoon — pays dividends on the first busy CI day.
  • Port allocation is cluster-wide in k3s, not per-node. klipper-lb installs iptables DNAT rules on every node for LoadBalancer services, affecting pod-to-pod traffic on the same node. This produces silent routing failures that present as network issues and require an understanding of klipper-lb mechanics to diagnose.
  • Missing registry credentials produce misleading timeout errors, not authentication errors. A docker push that hangs for exactly six seconds before failing with a network error is almost certainly a credentials problem, not a network problem.
  • Service-to-service communication inside k3s should always use internal cluster DNS. External ingress URLs for internal traffic introduce TLS complexity and DNAT overhead that produce intermittent failures and add unnecessary operational risk.
  • The DaemonSet deployment pattern for CI runners is operationally inferior to a single Deployment for moderate task volumes, due to resource contention, rolling update complexity, and runner registration overhead.

1. Introduction

Self-hosted CI guides consistently present the implementation as straightforward: install Gitea, deploy a runner, write workflow YAML, and the pipeline is operational. This description is accurate at the component level. The gap between component-level accuracy and system-level functionality is approximately two weeks of debugging failures that arise from incorrect assumptions at the intersection of components. These intersection failures are rarely documented because the engineers who resolve them do not write them down. The failure knowledge transfers informally or is lost. The next team building a similar pipeline discovers the same failures independently. This paper documents the architecture of a fully self-hosted CI pipeline, the design decisions made during construction, and — most importantly — the six failure modes that determined the project timeline. Each failure mode is presented with the observable symptom, root cause analysis, and resolution.

2. System Architecture

2.1 Component Stack

The pipeline comprises six components:
  • Gitea — self-hosted git server; primary remote for all infrastructure code
  • Gitea Actions — GitHub Actions-compatible workflow runner, built into Gitea
  • act_runner — the agent that executes workflows, deployed as a Kubernetes Deployment
  • Docker-in-Docker (DinD) — sidecar container that enables CI jobs to build and push Docker images
  • Harbor — private container registry with proxy cache for Docker Hub and GHCR
  • ArgoCD — GitOps controller that automatically synchronizes Kubernetes manifests on merge to main

2.2 Pipeline Flow

PR opened on Gitea

Gitea webhook → act_runner pod

CI job pulls ci-k8s image from Harbor proxy cache (no Docker Hub call)

yamllint + kubectl dry-run on changed manifests

Pass/fail status posted back to PR

Merge → ArgoCD webhook → manifests applied to cluster
From push to result: approximately 11 seconds. No external CI service in the execution path. The “air-gapped” property of this architecture is specific: after Harbor’s proxy cache is warm, CI job execution does not require public internet access. Image pulls, dependency fetches, and tool invocations all resolve from Harbor’s cache. A public internet outage does not interrupt in-progress or queued CI jobs.

2.3 act_runner Deployment

The runner is deployed as a standard Deployment in the gitea-runner namespace. A single replica is sufficient for moderate task volumes and eliminates the resource contention and rolling-update complexity of a DaemonSet deployment. The following manifest represents the production configuration. Two properties require particular attention: the DinD sidecar runs in privileged mode, which is required for Docker-in-Docker operation; and the Harbor credentials are mounted directly as a Docker configuration file, not injected as environment variables.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: act-runner
  namespace: gitea-runner
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: runner
          image: gitea/act_runner:latest
          env:
            - name: GITEA_INSTANCE_URL
              value: "http://gitea.gitea.svc.cluster.local:3000"
            - name: GITEA_RUNNER_REGISTRATION_TOKEN
              valueFrom:
                secretKeyRef:
                  name: runner-secret
                  key: token
          volumeMounts:
            - name: harbor-credentials
              mountPath: /root/.docker/config.json
              subPath: config.json
        - name: dind
          image: docker:dind
          securityContext:
            privileged: true
          env:
            - name: DOCKER_TLS_CERTDIR
              value: ""
      volumes:
        - name: harbor-credentials
          secret:
            secretName: harbor-credentials

2.4 Harbor Proxy Cache

Harbor’s proxy cache feature configures upstream registries — Docker Hub, GHCR, quay.io — and serves pulls through Harbor transparently. CI workflow images reference registry.example.com/dockerhub-proxy/library/alpine:3.19 instead of docker.io/library/alpine:3.19. Harbor fetches from the upstream registry on the first pull and serves from cache on all subsequent pulls. Docker Hub’s anonymous pull rate limits are a practical constraint for any active CI pipeline. Harbor proxy cache eliminates this exposure entirely. See the Harbor proxy cache documentation for configuration details.

2.5 CI Image Specification

Three lightweight Alpine-based images are built by a dedicated workflow and pushed to Harbor:
  • ci-base: curl, git, jq, yq, python3, docker-cli, openssh-client — 45MB
  • ci-k8s: ci-base + kubectl v1.32.3 + helm v3.17.1 — 120MB
  • ci-ansible: ci-base + ansible + ansible-lint — 180MB
Job startup time from Harbor cache is 2–3 seconds, compared to 20–40 seconds pulling from Docker Hub on a cold cache.

3. Validated Workflow Specifications

Two workflows execute on pull request events. validate-k8s.yml is triggered when pull requests modify files matching k8s/**:
  1. Clone the repository using the CLONE_TOKEN secret
  2. Run yamllint on all modified YAML files
  3. Run kubectl apply --dry-run=server against the live cluster API
The server-side dry run validates against the cluster’s live state, catching errors that static linting cannot detect: a Service referencing a non-existent Deployment, a PersistentVolumeClaim specifying a non-existent StorageClass, an Ingress referencing an incorrect backend. These are runtime failures that manifest only when the manifest is processed against actual cluster resources. validate-ansible.yml is triggered when pull requests modify files matching ansible/**:
  1. Clone the repository
  2. Run ansible-lint on all playbooks (soft fail — execution continues on lint warnings)
  3. Run ansible-playbook --syntax-check on all playbooks (hard fail — blocks merge on syntax errors)
The distinction between soft-fail lint and hard-fail syntax check reflects a deliberate policy: ansible-lint flags valid patterns as warnings depending on configuration; syntax errors are always real failures. Merge gates enforce only the latter.

4. Documented Failure Modes

The following six failure modes represent the primary sources of unplanned time during the initial implementation. Each is documented with the observable symptom, root cause, resolution, and generalized lesson.

4.1 klipper-lb DNAT Intercepts Port 22 and 443

Observable symptom: SSH clone from within the cluster hangs indefinitely. No error message. No timeout indication. The runner attempts to clone the repository and does not return. Root cause: k3s ships with klipper-lb, which installs iptables DNAT rules on every node for LoadBalancer services. These rules apply to all traffic arriving at the node, including traffic originating from pods on the same node. If Gitea SSH operates on port 22 and any LoadBalancer service is assigned to port 22, klipper-lb intercepts SSH connections from within the cluster and redirects them to the LoadBalancer backend. The runner cannot clone the repository via SSH on port 22. Resolution: Gitea SSH configured on port 222. The Gitea deployment sets SSH_DOMAIN: 192.168.x.241 and SSH_PORT: 222. Workflows clone using ssh://git@192.168.x.241:222/org/repo.git. The same conflict affects port 443: all external traffic on 443 routes to the nginx ingress via MetalLB. Internal services use HTTP on non-standard ports and rely on ingress for TLS termination.
In k3s, every node is a load balancer. klipper-lb DNAT rules apply cluster-wide, not per-node. Port 22 and port 443 are effectively reserved for LoadBalancer services if any service claims them. Plan service port assignments before deployment; conflicts present as silent routing failures that are difficult to diagnose without knowledge of klipper-lb mechanics.

4.2 DinD + Harbor Credentials = 6-Second Timeout

Observable symptom: docker push hangs for exactly six seconds, then fails with a network error. The error presents as a DNS problem. Root cause: The DinD sidecar requires Harbor credentials to push images. Without credentials, the TCP connection attempt runs until the connection timeout — six seconds — and fails with a generic network error. The error message does not indicate missing credentials; it directs investigation to the network layer. Resolution: Create a harbor-credentials secret containing a Docker configuration JSON:
{
  "auths": {
    "registry.example.com": {
      "auth": "base64(user:password)"
    }
  }
}
Mount at /root/.docker/config.json in the runner container — not the DinD sidecar. The runner’s docker-cli connects to DinD via TCP and uses the runner container’s credentials. The mount uses subPath: config.json to replace only the file without affecting the remainder of the directory.
If a docker push step hangs for exactly six seconds before failing, verify the credentials mount before investigating networking. The six-second duration is the TCP connection timeout; it is diagnostic of a connection refused by an unauthenticated endpoint, not a network routing failure.

4.3 ArgoCD repoURL Must Use Internal Cluster DNS

Observable symptom: ArgoCD repository synchronization fails. Depending on the diagnostic layer examined first, the failure presents as certificate errors, intermittent connectivity, or elevated latency. Root cause: Pointing ArgoCD at Gitea’s external ingress URL introduces two problems: the TLS certificate is self-signed, requiring CA configuration in ArgoCD; and all traffic routes through the ingress load balancer, adding a DNAT traversal, TLS termination, and re-entry into the cluster — an unnecessary round trip that introduces additional failure points. Resolution: Use the internal cluster DNS service name:
spec:
  source:
    repoURL: http://gitea.gitea.svc.cluster.local:3000/homelab/infrastructure.git
Plain HTTP. No TLS. No DNAT. No ingress traversal. Direct pod-to-pod communication via CoreDNS resolution. ArgoCD connects to Gitea in under 100ms. Generalized lesson: Service-to-service communication inside k3s should always use <service>.<namespace>.svc.cluster.local. External ingress URLs are for traffic originating outside the cluster; internal services that use them add unnecessary complexity.

4.4 yamllint Configuration Rule Names Are Non-Intuitive

Observable symptom: yamllint rejects valid Ansible YAML. Error messages reference rule names that do not appear in the documentation consulted. Root cause: yamllint’s rule names do not correspond to intuitive English equivalents. The following configuration illustrates the non-obvious names:
rules:
  document-start: disable          # NOT "present: false"
  indentation:
    indent-sequences: whatever     # NOT "true" or "false"
  truthy:
    allowed-values: [true, false, yes, no]  # Required for Ansible booleans
  commas:
    max-spaces-after: -1           # -1 means "allow any" — enables alignment
The truthy rule permits only true and false by default. Ansible uses yes and no extensively. Without the explicit allowed-values override, every Ansible playbook fails yamllint validation. The error message is unambiguous once the rule name is known; before that, it presents as a parsing failure.
Copy yamllint configurations from verified working projects rather than constructing them from documentation. When an unexpected rule failure occurs, read the yamllint source rule definitions directly rather than relying on third-party documentation.

4.5 ExpressVPN Blocks Go TCP Sockets

Observable symptom: kubectl hangs with no output, no error message, and no timeout. The cluster is confirmed healthy by other means. Only kubectl and other Go binaries are affected. Root cause: Go’s network stack uses a TCP connection approach that ExpressVPN’s network driver intercepts. This affects any Go binary making outbound TCP connections, including kubectl, helm, and related tooling. Diagnosing the VPN dependency requires recognizing that the hang is specific to Go binaries, which is not obvious when the initial investigation focuses on cluster health. Resolution: SSH tunnel. The following session initialization script establishes the tunnel on the development machine:
export KUBECONFIG="$HOME/.kube/cluster-config"
ssh -i ~/.ssh/homelab \
  -p 2223 \
  -L 16443:127.0.0.1:6443 \
  -N -f ubuntu@192.168.x.201
This forwards localhost:16443 to k3s-control:6443 (the k3s API server). The kubeconfig specifies https://localhost:16443 as the server URL. kubectl connects to localhost; the SSH tunnel carries the traffic; ExpressVPN observes only an SSH connection to a known host. The control node’s SSH operates on port 2223 — see Section 4.1.
If kubectl hangs without error, suspect VPN interference before investigating cluster health. This failure mode is specific to Go binaries and will not manifest for other tools that use different network stack implementations.

4.6 Gitea Tokens Are PBKDF2, Not SHA256

Observable symptom: Authentication fails with a token that appears correct. Regenerating the token does not resolve the failure. Root cause: Gitea stores API tokens as PBKDF2-SHA256 hashes with a separate salt, not as plain SHA256 hashes. The token value generated in the Gitea UI is the raw token. The stored representation is a PBKDF2 hash. This distinction is irrelevant for the common case — storing the raw token in a Kubernetes Secret and passing it to services as a credential — but surfaces in automation scenarios that compare tokens against stored hash values, specifically during cluster initialization scripts that pre-generate tokens. Generalized lesson: When Gitea authentication fails and the token value appears correct, verify the hash format before investigating the token value itself.

5. Implementation Constraints and Failure Analysis

5.1 DaemonSet vs. Deployment for Runner

An initial design used a DaemonSet to deploy the runner, with one runner pod per node. The intent was improved job parallelism. In practice, the DaemonSet approach introduced resource contention between concurrent runners on the same node, complicated rolling updates — updating pods on nodes that might be executing jobs — and required separate runner registration tokens per pod. A single Deployment with one replica is simpler, sufficient for the task volume in the deployment environment, and easier to update. The DaemonSet approach adds complexity that is not justified unless job concurrency requirements exceed what a single runner can provide.

5.2 External URL for ArgoCD

An initial configuration pointed ArgoCD at Gitea’s external ingress URL. Half a day was invested configuring ArgoCD to trust the self-signed TLS certificate before the internal DNS approach was identified. The internal URL requires no TLS configuration, has lower latency, and eliminates an ingress traversal from every repository synchronization operation.

5.3 Direct Docker Hub Pulls in CI

Before Harbor was deployed, CI jobs pulled images directly from Docker Hub. Rate limits were reached by midday on days with active CI activity. Harbor proxy cache, configured in approximately one afternoon, eliminated the constraint entirely.

5.4 Comparative Analysis: Deployment Approaches

Configuration DimensionRecommended ApproachAlternative ConsideredReason for Decision
Runner deployment typeSingle DeploymentDaemonSetSimpler updates, no resource contention, sufficient concurrency
ArgoCD repository URLInternal cluster DNSExternal ingress URLNo TLS configuration, lower latency, fewer failure points
Registry source for CI imagesHarbor proxy cacheDirect Docker HubRate limit immunity, lower latency from local cache
Gitea SSH portNon-standard (222)Standard (22)klipper-lb DNAT conflict on port 22
kubectl connectivity with VPNSSH tunnelDirect API accessGo TCP interception by VPN driver

6. Recommendations

  1. Treat klipper-lb port allocation as a first-class architectural constraint in any k3s deployment. Document the port allocation plan for LoadBalancer services before deploying any service that requires a specific port. Port 22 and port 443 in particular should be explicitly assigned or explicitly avoided.
  2. Deploy Harbor proxy cache as part of the initial CI infrastructure, not as a later optimization. Docker Hub rate limits will constrain CI activity before they are expected. The configuration investment is small relative to the operational disruption of hitting rate limits on an active development day.
  3. Use internal cluster DNS for all service-to-service communication. Establish this as a deployment standard, enforced by code review if necessary. External ingress URLs for internal traffic are a consistent source of TLS configuration complexity and DNAT-related failures.
  4. Verify the credentials mount as the first diagnostic step for any docker push failure. The six-second TCP timeout pattern is diagnostic of missing or incorrect credentials; it is not a network routing failure. This diagnostic heuristic should be documented in the team’s operational runbook.
  5. Copy yamllint configurations from verified working projects. The yamllint rule naming convention is not self-documenting. Building configurations from the official documentation results in non-obvious failures; building from working examples does not.
  6. Add SSH tunnel setup to the standard session initialization script for any development environment that uses a VPN. VPN interference with Go TCP sockets is a known failure mode; encoding the tunnel setup as a standard initialization step eliminates it as a recurring diagnostic distraction.

7. Conclusion

A self-hosted CI pipeline running inside Kubernetes provides genuine operational advantages: elimination of external service dependencies, immunity to public registry rate limits, and complete control over the execution environment. The implementation complexity is higher than commonly documented, primarily because the most consequential failure modes arise from interactions between correctly implemented components rather than from individual component misconfiguration. The six failure modes documented in this paper are finite. Each failure mode, once encountered and resolved, does not recur. The operational characteristics of the resulting pipeline — 11-second push-to-result latency, no external dependencies in the execution path, complete environment control — represent a stable foundation for infrastructure automation. As self-hosted Kubernetes deployments proliferate beyond cloud-native organizations into homelab, research, and enterprise on-premises environments, the operational knowledge required to build reliable internal CI infrastructure will become baseline practice rather than specialized expertise. The failure modes documented here will be encountered by any team following this architecture; documenting them in advance reduces the implementation cost from weeks to days. The operational discipline required to maintain a self-hosted pipeline — including the incident documentation practice described in The Runbook Is a Failure Ledger — compounds over time. Each documented failure mode reduces the diagnostic cost of future incidents and provides context for autonomous agents operating within the same infrastructure.
All content represents personal learning from personal and side projects. Infrastructure details are generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.