The /tmp Durability Gap: Automated Lessons-Learned in AI Development

Three-layer knowledge capture: in-session log, session-end notification, nightly wiki write — plus task lifecycle automation

Executive Summary

AI coding agents operating in iterative development workflows accumulate corrections, constraint discoveries, and process adjustments throughout a session — but standard ephemeral storage (/tmp) does not survive session restarts, server reboots, or CI container teardowns. Without a deliberate persistence mechanism, each new session begins with no memory of what the previous session learned. This paper documents a three-layer knowledge capture architecture that solves the durability problem: corrections are logged in-session to /tmp, surfaced to the human operator via a session-end notification hook, and written durably to a project wiki by a nightly automated sweep. Combined with search-based parent task checkbox automation and dependency unblocking, the system produces a self-maintaining knowledge base and task graph that requires no manual curation. The architectural principle underlying all components is fire-and-forget: knowledge capture must never block or fail the primary development workflow.

Key Findings

The /tmp durability gap is the central failure mode in AI agent knowledge management: corrections logged during a session are lost on restart unless an explicit persistence mechanism exists, causing the same mistakes to recur across sessions.
A three-layer capture architecture provides durability without coupling: each layer (in-session log, session-end notification, nightly wiki write) can fail independently without affecting the others, and the nightly write provides the durable record regardless of session-end behavior.
The ## Closure comment is a more reliable DONE signal than issue close events: issues are closed for many reasons — duplicates, spam, wontfix. Only the presence of a structured closure comment, written by the agent that completed the work, indicates legitimate task completion. All automation that fires on task close gates on this signal.
Search-based parent discovery eliminates maintenance overhead for task hierarchy tracking: rather than requiring tasks to explicitly declare parent relationships, a search query against open issue bodies finds parent tasks containing an unchecked reference to the closing task at close time. No manual linking is required.
Dependency unblocking should be automatic: when a task in a Blocked state is waiting on a specific completed task, the completion event should propagate automatically through the dependency graph, transitioning dependents without human intervention.
The 30-minute CI poll interval is the primary latency bottleneck in autonomous task execution: replacing it with an event-driven dispatch on task approval reduces time-to-start from up to 30 minutes to near-real-time.

1. The Problem: Ephemeral Sessions, Persistent Mistakes

An AI agent operating in a development workflow receives corrections throughout each session. A human engineer observes that the agent is approaching a problem incorrectly, intervenes, and the agent adjusts. The corrected approach succeeds. The session ends. In the next session, the agent begins with no memory of the correction. It approaches the same problem the same way. The same intervention is required. The same correction is applied. This is not a failure of the AI model — it is a failure of the operational environment. The model has no mechanism to carry forward what it learned. The human must provide the same correction repeatedly, which is expensive and eventually goes unprovided. The uncorrected mistake propagates to production. The standard development tooling provides /tmp for ephemeral session state, and git for durable state. The gap between them — corrections that are known within a session but not written to git — is the durability problem this architecture addresses.

2. Architecture: Three-Layer Knowledge Capture

2.1 Layer 1 — In-Session Capture

log-correction.sh is a six-line script. It writes a timestamped entry to the current day’s correction log in /tmp:

LOGFILE="/tmp/corrections-$(date +%Y%m%d).log"
TIMESTAMP=$(date -u '+%Y-%m-%dT%H:%MZ')
echo "[$TIMESTAMP] $*" >> "$LOGFILE"

The agent’s operational constitution (CLAUDE.md) contains the behavioral rule: after any correction from the human operator, call log-correction.sh with a brief description. The human triggers the correction; the agent is responsible for logging it immediately. This establishes the input discipline without requiring the human to do additional work. The log is deliberately simple: append-only, plain text, no schema. Simplicity is intentional — any write failure is silent and non-blocking, and any parsing approach works on the output.

2.2 Layer 2 — Session-End Notification

A Claude Code stop hook fires automatically when the agent session ends. lessons-capture.sh reads the day’s correction log, builds a structured JSON payload, and POSTs it to an n8n webhook. n8n fans out the notification to Telegram, delivering the human operator a summary of the session’s corrections with a direct link to the project wiki. After posting, the script deletes the log file. This prevents the same corrections from appearing in a subsequent session-end notification if the agent is restarted the same day.

if [[ ! -s "$LOGFILE" ]]; then exit 0; fi
# build JSON payload from log entries
curl -s -X POST "$N8N_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d "$payload" || true  # fire-and-forget: failure is non-fatal
rm -f "$LOGFILE"

The || true on the curl call is load-bearing: a Telegram notification failure must not cause the hook to exit non-zero. The stop hook failing would surface as an error to the human at the moment the session closes — adding noise to a workflow that should be invisible.

2.3 Layer 3 — Nightly Wiki Write

write-lessons.sh performs the durable persistence step. It is called by the nightly gate-resolved sweep (03:00 UTC) as a fire-and-forget subprocess after all Resolved tasks have been processed to Closed. It reads the day’s /tmp log, formats each entry, and writes to the project wiki via the Redmine REST API:

# GET current wiki content
# Append new entries in format:
# [YYYY-MM-DD] WHAT: <correction> → FIX: (see task journal for remediation)
# PUT updated content back
url = os.environ["WIKI_URL"]  # /projects/<project>/wiki/Lessons-Learned.json
req = urllib.request.Request(url, data=body, headers={
    "X-Redmine-API-Key": api_key,
    "Content-Type": "application/json",
}, method="PUT")

The FIX field intentionally defers to the task journal rather than duplicating remediation detail. The wiki is the index — a searchable ledger of what went wrong and when. The task journal holds the full corrective action. The nightly sweep calls this script even if the session-end hook failed: the log file may still exist in /tmp from earlier in the day. This redundancy is deliberate — the layers are independent, not sequential.

The wiki write uses a read-modify-write pattern: GET the full current page, append new entries, PUT the entire updated page. The Redmine wiki REST API does not support appending to a page — it requires the complete page body on every write. For high-volume correction environments, this creates a contention risk if multiple agents write simultaneously. The nightly sweep serializes all writes through a single process, which eliminates this risk at the cost of a delay between correction and wiki persistence.

Layer Comparison

Property	Layer 1 (in-session)	Layer 2 (session-end)	Layer 3 (nightly)
Durability	Ephemeral — lost on restart	Semi-durable — Telegram history	Permanent — wiki
Latency	Immediate	End of session	Next 03:00 UTC
Failure impact	Silent — no entry logged	Notification missed	Delay, not loss
Human visibility	None	High — Telegram alert	Searchable wiki
Survives reboot	No	No (/tmp deleted)	Yes

3. Task Lifecycle Automation

3.1 State Machine Overview

The task lifecycle is a formal state machine with legal transitions enforced by a GitHub Actions workflow at every state change. The states relevant to automation:

New → Proposed → Breakdown → Skeptic Review → Approved
  → In Progress → In Review → Resolved → Closed

Automation fires at two transitions: → Approved (near-real-time executor dispatch) and → Closed (wiki write, parent checkbox update, dependency unblocking).

3.2 Near-Real-Time Executor Dispatch

Before event-driven dispatch, the agent responsible for executing tasks polled the task tracker on a 30-minute Gitea Actions schedule. An approved task sat idle for up to 30 minutes before an agent picked it up. Under sprint conditions with 10–20 tasks approved per day, this introduced hours of aggregate idle time. The dispatch workflow fires on the labeled event for any task receiving the Approved label. It immediately sends a repository_dispatch event to the management repository with the task’s identifier. The executor processes the event on receipt:

Task labeled APPROVED → webhook fires → executor receives task-ready event
  → executor begins within seconds (previously: up to 30 minutes)

The same dispatch fires on REOPENED — tasks that failed review and were returned to the executor for rework. Reopened tasks are processed before newly approved tasks, ensuring that rework does not queue behind new work indefinitely.

3.3 Parent Task Checkbox Automation

Epic and parent issues track sub-task completion via Markdown checkboxes in the issue body:

## Sprint 12 — Auth Hardening
- [ ] #624 — Track Cargo.lock for audit blind spot
- [ ] #625 — Replace inline token in git clone URLs
- [ ] #626 — Per-agent Vault credential injection

When a child issue closes with a ## Closure comment, the automation searches for all open parent issues whose body contains an unchecked reference to the closing issue number:

search_query = f"repo:{REPO} is:issue is:open \"- [ ] #{ISSUE_NUMBER}\" in:body"

For each parent found, it patches the body to replace - [ ] #N with - [x] #N and posts a comment on the closed child issue: "Checked off in #<parent-number>". The search-based discovery pattern is the critical design choice. No explicit parent link is required in either the child or parent issue. A parent written weeks before the child is created automatically receives the checkbox update when the child closes. The maintenance burden of keeping parent-child relationships current is eliminated.

The search query uses a literal string match on "- [ ] #N" rather than a regex. This is intentional — GitHub’s search index handles the literal match efficiently, while regex-based API queries are either unavailable or slow. The regex-safe replacement pattern is applied client-side after the parent issue body is retrieved.

3.4 Dependency Unblocking

When a task in Blocked state is waiting on a specific task, that relationship is recorded in the blocked task’s journal. When the blocking task closes, gate-resolved.py calls notify_dependents(), which posts a ## Dependency Resolved comment on any task in Blocked state that references the closed task number. If the dependent task is in Blocked state, it is automatically transitioned to Approved.

for dep in find_blocked_dependents(closed_issue_number):
    post_comment(dep, "## Dependency Resolved\n\nBlocking task #N is now closed.")
    if get_status(dep) == "Blocked":
        set_status(dep, "Approved")
        dispatch_task_ready(dep)

The dependency graph propagates automatically. A sprint where three tasks are blocked on a single infrastructure task proceeds without human intervention when that task closes.

4. The Fire-and-Forget Principle

Every component in this architecture is fire-and-forget relative to the primary development workflow. log-correction.sh fails silently. lessons-capture.sh catches all curl errors. write-lessons.sh is wrapped in exception handling that never re-raises. Parent checkbox updates post failure comments but do not surface errors to the workflow. This is not carelessness — it is an explicit design decision based on a priority ordering: development work is critical-path; knowledge capture is valuable but not critical-path. A failing wiki write must never cause a gate sweep to fail. A failing Telegram notification must never cause a session close to error. The consequence of this design is that knowledge capture degrades gracefully under failure rather than failing hard. An operator can audit the /tmp log directly, check Telegram history, or query the wiki — each layer provides independent access to the captured corrections.

The fire-and-forget design means knowledge capture failures are silent by default. In high-velocity environments where corrections accumulate rapidly, a monitoring check on wiki page last-modified timestamp is advisable. If the nightly write is silently failing, the wiki will appear up to date (from prior successful writes) while new corrections accumulate unrecorded.

5. Recommendations

Establish the behavioral rule before the automation. The CLAUDE.md rule — “call log-correction.sh after every correction” — is the foundation. Without consistent in-session capture, the downstream layers have nothing to persist. The rule must be explicit and in the operational constitution, not an informal expectation.
Use the ## Closure comment as the authoritative DONE signal, not the issue close event. Close events are unreliable indicators of legitimate completion. A structured closure comment written by the agent that completed the work provides the structured signal that all downstream automation can reliably key on.
Implement search-based parent discovery rather than explicit parent linking. Explicit linking requires discipline at issue creation time and maintenance as sprint structure evolves. Search-based discovery requires no discipline — it works on issues written before the automation existed and requires no retroactive modification.
Replace polling-based task pickup with event-driven dispatch. The 30-minute poll interval is the primary latency bottleneck in autonomous task execution. Event-driven dispatch on task approval is a straightforward improvement that requires a single webhook workflow and a repository dispatch handler.
Commit the lessons-learned wiki to a searchable, indexed system. A flat append-only ledger in a project wiki is sufficient for logging. What matters is searchability — the ability to query “what correction was applied to problems involving X” across the full history. Redmine wiki, Confluence, Notion, and GitHub wikis all support this. A text file in git does not.
Audit knowledge capture completeness periodically. Query the wiki for the most recent entry. If the date is more than two days behind the current date in an active development period, the nightly write is likely failing silently. Add this check to the regular operational review.

Conclusion

The /tmp durability gap is not unique to AI agent workflows — it affects any iterative process where corrections accumulate within bounded sessions and need to survive across restarts. What is distinctive about AI agent workflows is the scale of the problem: an agent corrected on the same mistake five times in five sessions is producing five times the remediation cost that a persistent memory system would eliminate. The three-layer architecture described here is not sophisticated — it is three shell scripts and a wiki. The sophistication is in the principle: every correction captured in-session must have a durable path to persistent storage, and that path must be non-blocking. Knowledge capture that can fail the development workflow will eventually be disabled to prevent disruptions. Knowledge capture that is fire-and-forget continues to accumulate regardless. As AI agents take on longer-horizon autonomous work, the value of this accumulated correction history compounds. An agent operating in week 12 of a project with access to corrections from weeks 1–11 has a qualitatively different operational context than one starting cold each session. The investment in capture infrastructure pays forward proportionally to the project’s duration.

Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Overview

Workflows

Process

Infrastructure

The /tmp Durability Gap: Automated Lessons-Learned Capture in an AI Development Workflow

Executive Summary

Key Findings

1. The Problem: Ephemeral Sessions, Persistent Mistakes

2. Architecture: Three-Layer Knowledge Capture

2.1 Layer 1 — In-Session Capture

2.2 Layer 2 — Session-End Notification

2.3 Layer 3 — Nightly Wiki Write

Layer Comparison

3. Task Lifecycle Automation

3.1 State Machine Overview

3.2 Near-Real-Time Executor Dispatch

3.3 Parent Task Checkbox Automation

3.4 Dependency Unblocking

4. The Fire-and-Forget Principle

5. Recommendations

Conclusion

Overview

Workflows

Process

Infrastructure

Documentation Index

​Executive Summary

​Key Findings

​1. The Problem: Ephemeral Sessions, Persistent Mistakes

​2. Architecture: Three-Layer Knowledge Capture

​2.1 Layer 1 — In-Session Capture

​2.2 Layer 2 — Session-End Notification

​2.3 Layer 3 — Nightly Wiki Write

​Layer Comparison

​3. Task Lifecycle Automation

​3.1 State Machine Overview

​3.2 Near-Real-Time Executor Dispatch

​3.3 Parent Task Checkbox Automation

​3.4 Dependency Unblocking

​4. The Fire-and-Forget Principle

​5. Recommendations

​Conclusion

Executive Summary

Key Findings

1. The Problem: Ephemeral Sessions, Persistent Mistakes

2. Architecture: Three-Layer Knowledge Capture

2.1 Layer 1 — In-Session Capture

2.2 Layer 2 — Session-End Notification

2.3 Layer 3 — Nightly Wiki Write

Layer Comparison

3. Task Lifecycle Automation

3.1 State Machine Overview

3.2 Near-Real-Time Executor Dispatch

3.3 Parent Task Checkbox Automation

3.4 Dependency Unblocking

4. The Fire-and-Forget Principle

5. Recommendations

Conclusion