Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
A routine configuration update applied to a self-hosted workflow automation service running on Kubernetes resulted in the complete loss of PVC-backed application data. The incident was not caused by a tool defect or infrastructure failure; it was caused by applying a stateless change management discipline to a stateful workload. This analysis examines the failure sequence, identifies the root cause as a category error in operational mental model, and presents the remediation architecture and change protocol adopted in response. The central finding is actionable and broadly applicable: organizations operating PVC-backed Kubernetes workloads must establish a separate change management discipline for those services, distinct from the procedures applied to stateless deployments. Without that separation, data loss incidents of this type are a matter of when, not whether.Key Findings
- PVC-backed services exhibit fundamentally different failure modes than stateless workloads. A deployment operation that is entirely safe for a stateless service can produce irreversible data loss when applied to a stateful one. Standard Kubernetes tooling does not differentiate between the two cases.
- The “cattle not pets” operational model, while appropriate for stateless workloads, is actively hazardous when applied to services with persistent volume claims. The mental model failure precedes and enables the technical failure.
- The coupling between a Kubernetes Deployment and its associated PVC is looser than it appears. Deployment manifests describe compute configuration, not data lifecycle. Volume mount behavior during rollout transitions is not surfaced as a risk in standard deployment workflows.
- Version-controlling application state definitions — such as workflow JSON, configuration exports, or schema migrations — transforms a multi-hour manual recovery into a sub-thirty-minute automated restore. The incident demonstrated that treating state definitions as source artifacts is a first-order reliability concern, not an optimization.
- Automated PVC backup infrastructure (CronJob-based snapshots) provides a recoverable baseline but is insufficient without a pre-change manual backup discipline. Scheduled backups reduce the recovery window; they do not eliminate point-in-time risk for changes applied between backup intervals.
- Explicit pre-change backup, change classification, rollback planning, and post-change volume verification — applied as a sequenced protocol — prevent the conditions that produced this incident. No new tooling is required; discipline and procedure are the control.
1. Introduction
Kubernetes has become the dominant orchestration platform for containerized workloads. Its operational model — ephemeral pods, declarative configuration, automated scheduling — is well-suited to stateless services, where no individual pod carries state that cannot be reconstructed or discarded. A significant and growing category of Kubernetes deployments, however, involves stateful workloads: databases, workflow engines, content management systems, monitoring stacks, and other services that persist application state to Persistent Volume Claims (PVCs). These services occupy a fundamentally different operational category than their stateless counterparts, yet the tooling used to manage them —kubectl, Deployment manifests, rollout strategies — is identical in appearance and syntax to the tooling used for stateless services.
This surface-level equivalence is a latent risk. It enables operators to apply stateless change management procedures to stateful workloads without any warning, friction, or tooling-level guard against the resulting failure modes.
This analysis documents a data loss incident that arose from precisely this condition: a PVC-backed workflow automation service managed as if it were a stateless application. The incident resulted in total loss of application data — workflow definitions, credentials, execution history — stored on the PVC. The data was not recoverable from backup because no backup infrastructure existed at the time.
The document proceeds as follows: Section 2 provides incident analysis; Section 3 identifies root cause; Section 4 describes the remediation architecture implemented; Section 5 defines the Stateful Service Change Protocol; Section 6 offers recommendations for organizations operating similar environments.
2. Incident Analysis
2.1 Service Architecture
The affected service was n8n, an open-source workflow automation platform (n8n.io). n8n runs as a Node.js process and persists all application state — workflow definitions, credential mappings, execution history — to a database. In the configuration under analysis, this database was a SQLite file stored on a Kubernetes PVC. The deployment architecture followed a pattern common in self-hosted Kubernetes environments:- A Kubernetes Deployment managed the compute layer (the n8n container process)
- A PersistentVolumeClaim provided the storage layer (the SQLite database file)
- The Deployment referenced the PVC through a volume mount
The compute/storage separation in this architecture is architecturally sound. The failure was not in the design of the system but in the operational procedures applied during changes. A well-designed architecture does not protect against a procedure gap.
2.2 The Change Event
A minor configuration change was applied to the n8n Deployment — specifically, an environment variable update. This category of change is routine for stateless services: update the manifest, apply it, observe the rollout, confirm the new pod is healthy. The operator applied the change using standard Kubernetes tooling (kubectl apply). The Deployment’s rollout strategy was configured as Recreate, which terminates the existing pod before starting the replacement. This strategy is appropriate for services where two simultaneous instances would cause conflicts — a reasonable choice for a single-instance SQLite-backed application.
The rollout proceeded as follows:
- The existing n8n pod was terminated.
- A new pod was scheduled and started.
- The new pod mounted the PVC.
- The new pod’s startup sequence initialized the application against what it encountered on the volume mount path.
2.3 Discovery
The data loss was discovered during a post-change verification check. The n8n dashboard loaded successfully and presented an empty workflow list. Initial interpretation considered a rendering or API error. Direct inspection of the PVC confirmed the data was absent: the directory structure was present; the SQLite database file contents were not in the expected state.2.4 Recovery
No automated backup infrastructure existed at the time of the incident. Recovery required manual reconstruction of all workflow definitions from memory and documentation. This process required multiple hours and was incomplete: some workflows were reconstructed accurately; others required rediscovery of the original integration logic. A subset of execution history and credential configurations was not recoverable.3. Root Cause Analysis
3.1 Primary Root Cause: Operational Category Error
The primary root cause was the application of a stateless operational discipline to a stateful workload. The operator’s mental model — shaped by extensive experience with stateless Kubernetes services — did not include a distinct procedure for services with persistent storage dependencies. This is not an operator error in the sense of a mistake or oversight. It is a systemic condition: Kubernetes tooling does not surface the stateful/stateless distinction at the point of change. The commandkubectl apply -f n8n-deployment.yaml and kubectl apply -f nginx-deployment.yaml are syntactically and procedurally identical. The consequences are not.
3.2 Contributing Factor: Absent Backup Infrastructure
The absence of automated PVC backup infrastructure transformed a recoverable incident into a data loss event. Had a recent backup existed, the incident’s impact would have been limited to the recovery window — the interval between the last backup and the change event. This contributing factor is downstream of the primary root cause. An organization that correctly classifies stateful services would, as a consequence of that classification, provision backup infrastructure. The backup gap and the procedural gap share a common origin.3.3 Contributing Factor: Loose Deployment-PVC Coupling
The Kubernetes Deployment resource describes compute configuration. It does not describe, constrain, or protect the data lifecycle of its associated PVC. This design is intentional and appropriate — Deployments and PVCs are separate resources — but it means that changes to a Deployment manifest can have unintended consequences for PVC state without any tooling-level warning. The specific failure mechanism involved the interaction between theRecreate rollout strategy and the application’s initialization behavior on a volume mount. This interaction is not surfaced as a risk by standard Kubernetes tooling or documentation.
3.4 Risk Matrix: Stateless vs. Stateful Change Risk
The following table summarizes the divergence in risk profile between stateless and stateful Kubernetes workloads across common change operations.| Change Operation | Stateless Service Risk | Stateful Service Risk | Notes |
|---|---|---|---|
| Environment variable update | Low | Medium | Can trigger application re-initialization against PVC |
| Image tag update | Low | Medium | New image may behave differently against existing volume data |
| Replica count change | Low | Low–High | Depends on whether service supports multi-instance access to shared PVC |
| Volume mount path change | N/A | Critical | Redirects application away from existing data |
| Storage class change | N/A | Critical | May require PVC recreation, destroying data |
| Rollout strategy change | Low | High | Changes pod lifecycle behavior relative to PVC availability |
| Namespace migration | Low | High | PVCs are namespace-scoped; migration requires data movement |
| Node drain / rescheduling | Low | Low | Storage backends typically handle pod migration cleanly |
4. Remediation Architecture
Following the incident, three remediation components were implemented: version-controlled state definitions, automated PVC backup infrastructure, and a formal change protocol. This section describes the first two; Section 5 addresses the protocol.4.1 Version-Controlled State Definitions
The workflow automation platform supports export of workflow definitions as structured JSON. A CI workflow was implemented to treat these JSON definitions as source artifacts: on any commit to the designated workflows directory, the CI process automatically imports the definitions into the running service instance. This architecture applies a well-established principle from database engineering: the data on disk is a materialized view; the authoritative source is the definition in version control. If the materialized state is lost, it can be reconstructed from the source. Recovery time is bounded by import execution time rather than manual reconstruction effort. This change alone would have reduced the incident’s impact from a multi-hour manual reconstruction to a sub-thirty-minute automated restore.4.2 Automated PVC Backup Infrastructure
A CronJob-based backup pattern was implemented for all PVC-backed services in the environment. Each stateful service is assigned a corresponding CronJob that mounts the service PVC and copies its contents to an off-volume backup location on a defined schedule. The following CronJob manifest implements a daily backup pattern applicable to a generic PVC-backed service:4.3 Backup Strategy Comparison
The following table compares backup approaches applicable to PVC-backed Kubernetes services.| Strategy | Implementation Complexity | Recovery Granularity | Protection Against Planned Changes | Suitable For |
|---|---|---|---|---|
| Scheduled CronJob copy | Low | Backup interval (e.g., daily) | No — gap between schedule and change | Baseline recovery; unplanned failures |
| Pre-change manual backup | Low | Point-in-time (before change) | Yes | Planned change events |
| Volume snapshot (CSI) | Medium | Point-in-time | Yes, if automated pre-change | Environments with CSI-compatible storage |
| Application-level export to VCS | Medium | Per-commit | Yes | Services that support structured export |
| Velero cluster backup | High | Scheduled or on-demand | Yes, if triggered pre-change | Full cluster backup/restore requirements |
The remediation architecture described in this document uses scheduled CronJob backups combined with pre-change manual backups. This combination does not require CSI snapshot support or third-party backup tooling, making it applicable in environments where storage driver capabilities are limited.
5. The Stateful Service Change Protocol
The Stateful Service Change Protocol defines the required procedure for any change operation applied to a PVC-backed Kubernetes workload. It is structured around four principles.Principle 1: Backup Before Every Change
A manual backup of the PVC must be completed before any change is applied to a stateful service Deployment. This requirement admits no exceptions based on the perceived scope or risk of the change. Environment variable updates, image tag bumps, and resource limit adjustments are all subject to this requirement. The rationale is not that these changes are high-risk in isolation. It is that the operator’s assessment of risk for a stateful service change is systematically unreliable — as demonstrated by this incident, in which a low-risk change produced a high-impact outcome. The backup is not a risk mitigation; it is an escape hatch that exists regardless of risk assessment.Principle 2: Classify the Change Before Execution
Change operations against stateful services are classified into two categories with distinct procedural requirements: Configuration changes include environment variable updates, replica count adjustments, resource limits and requests, and image tag updates. These changes are lower risk but are not zero risk. They require a pre-change backup and post-change volume verification. Structural changes include volume mount modifications, storage class changes, PVC configuration updates, and namespace migrations. These changes are high risk and must be treated with the same deliberation applied to database schema migrations: explicit rollback plans, staging validation where possible, and documented recovery procedures before execution begins.Principle 3: Define the Rollback Before the Change
A concrete, executable rollback procedure must be documented before any change to a stateful service is applied. If the operator cannot describe in specific terms how the change would be reversed — including what state the PVC would be in after rollback and how that state would be verified — the change is not ready to proceed. This requirement serves two purposes. First, it forces explicit reasoning about failure modes before the operator is in a degraded state attempting to recover. Second, it identifies gaps in reversibility that may indicate the change requires further planning or staging.Principle 4: Verify the Volume, Not the Pod
Post-change verification must include explicit confirmation that the PVC contents are intact and in the expected state. Pod health status — including readiness probe success and pod phaseRunning — is not sufficient evidence of data integrity. The verification step requires connecting to the pod and confirming that the persistent state is what the operator expects.
The following sequence defines the complete change execution flow:
Pre-Change Backup
Trigger a manual backup of the service PVC. Confirm the backup completed successfully and that the backup artifact is accessible before proceeding.
Change Classification
Classify the intended change as configuration or structural. If structural, confirm that a staging validation has been completed and that a maintenance window is in effect.
Rollback Definition
Document the specific rollback procedure, including PVC restore steps. Confirm the backup from Step 1 is the restore source.
Volume Verification
Connect to the new pod. Verify that the volume mount is correct and that the persistent state — database contents, file structure, application configuration — is intact and as expected.
6. Recommendations
The following recommendations are offered for organizations operating PVC-backed Kubernetes workloads in self-hosted environments. 1. Establish a formal stateful/stateless service classification for every workload in the cluster. The presence of a PVC is the operative criterion. Any service with a PVC is a stateful service and must be managed under the stateful discipline. This classification should be documented and reviewed when new services are introduced. 2. Treat PVC-backed services as stateful databases, not pods. This reframing is the most important operational change available at zero infrastructure cost. It changes the instinct applied at the moment of a change event — from “apply and observe” to “backup, plan, apply, verify.” 3. Implement automated PVC backup infrastructure for every stateful service before the service enters production. Backup infrastructure provisioned after a data loss incident is backup infrastructure provisioned too late. The CronJob pattern described in Section 4.2 is low-complexity and broadly applicable. 4. Version-control application state definitions wherever the application supports structured export. Workflow definitions, configuration exports, schema migration files, and similar artifacts should be committed to version control and used as the authoritative restore source. This reduces recovery time from hours to minutes. 5. Require pre-change manual backups as a gated step in the change process for stateful services. Scheduled backups are not a substitute for point-in-time pre-change backups. The pre-change backup closes the gap between the backup schedule and the change event. 6. Include explicit volume and data verification in post-change runbooks. Healthy pod status is not a proxy for data integrity. Post-change verification procedures must include a direct check of PVC contents. 7. Apply the migration mindset to all structural changes on stateful Deployments. Volume mount changes, storage class modifications, and PVC configuration updates should be treated with the same deliberation as database schema migrations: staged, reversible, and executed with explicit rollback plans. 8. Document every data loss incident and update operational protocols accordingly. Incidents not documented become incidents repeated. The protocol described in Section 5 emerged from a specific failure; it is more reliable and more credible for having that origin.Conclusion
The incident analyzed in this document was caused by a gap between the operational mental model applied and the operational category of the workload being managed. Kubernetes tooling provided no friction at the point of the erroneous operation. Recovery required multi-hour manual reconstruction. The data loss was total and, in the absence of backup infrastructure, irreversible. The remediation is straightforward: classify stateful and stateless workloads distinctly, provision backup infrastructure for all PVC-backed services, and apply a change protocol that treats stateful service modifications as database migration events. These measures require no new tooling and no infrastructure investment beyond the CronJob backup pattern. They require only a change in operational discipline. As Kubernetes adoption in self-hosted environments continues to increase, the gap between stateless-native tooling assumptions and stateful operational requirements will widen. The platform’s defaults are optimized for the stateless case. Organizations operating stateful workloads must compensate for that optimization gap through explicit procedure and classification — or absorb the cost of incidents that the platform was not designed to prevent. The incident described here was expensive in time and in recovered state quality. It was also, ultimately, instructive: it made a gap in the operational model visible and created the conditions for closing it. Every incident not documented is a lesson paid for twice.For related operational frameworks, see The Runbook Is a Failure Ledger on converting incidents into permanent operational records, and Self-Hosted CI Pipeline for additional operational considerations in self-hosted Kubernetes environments.
All content represents personal learning from personal and side projects. Infrastructure details are generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.