Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
AI coding agents operating in iterative development workflows accumulate corrections, constraint discoveries, and process adjustments throughout a session — but standard ephemeral storage (/tmp) does not survive session restarts, server reboots, or CI container teardowns. Without a deliberate persistence mechanism, each new session begins with no memory of what the previous session learned. This paper documents a three-layer knowledge capture architecture that solves the durability problem: corrections are logged in-session to /tmp, surfaced to the human operator via a session-end notification hook, and written durably to a project wiki by a nightly automated sweep. Combined with search-based parent task checkbox automation and dependency unblocking, the system produces a self-maintaining knowledge base and task graph that requires no manual curation. The architectural principle underlying all components is fire-and-forget: knowledge capture must never block or fail the primary development workflow.
Key Findings
- The
/tmpdurability gap is the central failure mode in AI agent knowledge management: corrections logged during a session are lost on restart unless an explicit persistence mechanism exists, causing the same mistakes to recur across sessions. - A three-layer capture architecture provides durability without coupling: each layer (in-session log, session-end notification, nightly wiki write) can fail independently without affecting the others, and the nightly write provides the durable record regardless of session-end behavior.
- The
## Closurecomment is a more reliable DONE signal than issue close events: issues are closed for many reasons — duplicates, spam, wontfix. Only the presence of a structured closure comment, written by the agent that completed the work, indicates legitimate task completion. All automation that fires on task close gates on this signal. - Search-based parent discovery eliminates maintenance overhead for task hierarchy tracking: rather than requiring tasks to explicitly declare parent relationships, a search query against open issue bodies finds parent tasks containing an unchecked reference to the closing task at close time. No manual linking is required.
- Dependency unblocking should be automatic: when a task in a Blocked state is waiting on a specific completed task, the completion event should propagate automatically through the dependency graph, transitioning dependents without human intervention.
- The 30-minute CI poll interval is the primary latency bottleneck in autonomous task execution: replacing it with an event-driven dispatch on task approval reduces time-to-start from up to 30 minutes to near-real-time.
1. The Problem: Ephemeral Sessions, Persistent Mistakes
An AI agent operating in a development workflow receives corrections throughout each session. A human engineer observes that the agent is approaching a problem incorrectly, intervenes, and the agent adjusts. The corrected approach succeeds. The session ends. In the next session, the agent begins with no memory of the correction. It approaches the same problem the same way. The same intervention is required. The same correction is applied. This is not a failure of the AI model — it is a failure of the operational environment. The model has no mechanism to carry forward what it learned. The human must provide the same correction repeatedly, which is expensive and eventually goes unprovided. The uncorrected mistake propagates to production. The standard development tooling provides/tmp for ephemeral session state, and git for durable state. The gap between them — corrections that are known within a session but not written to git — is the durability problem this architecture addresses.
2. Architecture: Three-Layer Knowledge Capture
2.1 Layer 1 — In-Session Capture
log-correction.sh is a six-line script. It writes a timestamped entry to the current day’s correction log in /tmp:
CLAUDE.md) contains the behavioral rule: after any correction from the human operator, call log-correction.sh with a brief description. The human triggers the correction; the agent is responsible for logging it immediately. This establishes the input discipline without requiring the human to do additional work.
The log is deliberately simple: append-only, plain text, no schema. Simplicity is intentional — any write failure is silent and non-blocking, and any parsing approach works on the output.
2.2 Layer 2 — Session-End Notification
A Claude Code stop hook fires automatically when the agent session ends.lessons-capture.sh reads the day’s correction log, builds a structured JSON payload, and POSTs it to an n8n webhook. n8n fans out the notification to Telegram, delivering the human operator a summary of the session’s corrections with a direct link to the project wiki.
After posting, the script deletes the log file. This prevents the same corrections from appearing in a subsequent session-end notification if the agent is restarted the same day.
|| true on the curl call is load-bearing: a Telegram notification failure must not cause the hook to exit non-zero. The stop hook failing would surface as an error to the human at the moment the session closes — adding noise to a workflow that should be invisible.
2.3 Layer 3 — Nightly Wiki Write
write-lessons.sh performs the durable persistence step. It is called by the nightly gate-resolved sweep (03:00 UTC) as a fire-and-forget subprocess after all Resolved tasks have been processed to Closed. It reads the day’s /tmp log, formats each entry, and writes to the project wiki via the Redmine REST API:
/tmp from earlier in the day. This redundancy is deliberate — the layers are independent, not sequential.
The wiki write uses a read-modify-write pattern: GET the full current page, append new entries, PUT the entire updated page. The Redmine wiki REST API does not support appending to a page — it requires the complete page body on every write. For high-volume correction environments, this creates a contention risk if multiple agents write simultaneously. The nightly sweep serializes all writes through a single process, which eliminates this risk at the cost of a delay between correction and wiki persistence.
Layer Comparison
| Property | Layer 1 (in-session) | Layer 2 (session-end) | Layer 3 (nightly) |
|---|---|---|---|
| Durability | Ephemeral — lost on restart | Semi-durable — Telegram history | Permanent — wiki |
| Latency | Immediate | End of session | Next 03:00 UTC |
| Failure impact | Silent — no entry logged | Notification missed | Delay, not loss |
| Human visibility | None | High — Telegram alert | Searchable wiki |
| Survives reboot | No | No (/tmp deleted) | Yes |
3. Task Lifecycle Automation
3.1 State Machine Overview
The task lifecycle is a formal state machine with legal transitions enforced by a GitHub Actions workflow at every state change. The states relevant to automation:→ Approved (near-real-time executor dispatch) and → Closed (wiki write, parent checkbox update, dependency unblocking).
3.2 Near-Real-Time Executor Dispatch
Before event-driven dispatch, the agent responsible for executing tasks polled the task tracker on a 30-minute Gitea Actions schedule. An approved task sat idle for up to 30 minutes before an agent picked it up. Under sprint conditions with 10–20 tasks approved per day, this introduced hours of aggregate idle time. The dispatch workflow fires on thelabeled event for any task receiving the Approved label. It immediately sends a repository_dispatch event to the management repository with the task’s identifier. The executor processes the event on receipt:
REOPENED — tasks that failed review and were returned to the executor for rework. Reopened tasks are processed before newly approved tasks, ensuring that rework does not queue behind new work indefinitely.
3.3 Parent Task Checkbox Automation
Epic and parent issues track sub-task completion via Markdown checkboxes in the issue body:## Closure comment, the automation searches for all open parent issues whose body contains an unchecked reference to the closing issue number:
- [ ] #N with - [x] #N and posts a comment on the closed child issue: "Checked off in #<parent-number>".
The search-based discovery pattern is the critical design choice. No explicit parent link is required in either the child or parent issue. A parent written weeks before the child is created automatically receives the checkbox update when the child closes. The maintenance burden of keeping parent-child relationships current is eliminated.
3.4 Dependency Unblocking
When a task inBlocked state is waiting on a specific task, that relationship is recorded in the blocked task’s journal. When the blocking task closes, gate-resolved.py calls notify_dependents(), which posts a ## Dependency Resolved comment on any task in Blocked state that references the closed task number. If the dependent task is in Blocked state, it is automatically transitioned to Approved.
4. The Fire-and-Forget Principle
Every component in this architecture is fire-and-forget relative to the primary development workflow.log-correction.sh fails silently. lessons-capture.sh catches all curl errors. write-lessons.sh is wrapped in exception handling that never re-raises. Parent checkbox updates post failure comments but do not surface errors to the workflow.
This is not carelessness — it is an explicit design decision based on a priority ordering: development work is critical-path; knowledge capture is valuable but not critical-path. A failing wiki write must never cause a gate sweep to fail. A failing Telegram notification must never cause a session close to error.
The consequence of this design is that knowledge capture degrades gracefully under failure rather than failing hard. An operator can audit the /tmp log directly, check Telegram history, or query the wiki — each layer provides independent access to the captured corrections.
5. Recommendations
-
Establish the behavioral rule before the automation. The
CLAUDE.mdrule — “calllog-correction.shafter every correction” — is the foundation. Without consistent in-session capture, the downstream layers have nothing to persist. The rule must be explicit and in the operational constitution, not an informal expectation. -
Use the
## Closurecomment as the authoritative DONE signal, not the issue close event. Close events are unreliable indicators of legitimate completion. A structured closure comment written by the agent that completed the work provides the structured signal that all downstream automation can reliably key on. - Implement search-based parent discovery rather than explicit parent linking. Explicit linking requires discipline at issue creation time and maintenance as sprint structure evolves. Search-based discovery requires no discipline — it works on issues written before the automation existed and requires no retroactive modification.
- Replace polling-based task pickup with event-driven dispatch. The 30-minute poll interval is the primary latency bottleneck in autonomous task execution. Event-driven dispatch on task approval is a straightforward improvement that requires a single webhook workflow and a repository dispatch handler.
- Commit the lessons-learned wiki to a searchable, indexed system. A flat append-only ledger in a project wiki is sufficient for logging. What matters is searchability — the ability to query “what correction was applied to problems involving X” across the full history. Redmine wiki, Confluence, Notion, and GitHub wikis all support this. A text file in git does not.
- Audit knowledge capture completeness periodically. Query the wiki for the most recent entry. If the date is more than two days behind the current date in an active development period, the nightly write is likely failing silently. Add this check to the regular operational review.
Conclusion
The/tmp durability gap is not unique to AI agent workflows — it affects any iterative process where corrections accumulate within bounded sessions and need to survive across restarts. What is distinctive about AI agent workflows is the scale of the problem: an agent corrected on the same mistake five times in five sessions is producing five times the remediation cost that a persistent memory system would eliminate.
The three-layer architecture described here is not sophisticated — it is three shell scripts and a wiki. The sophistication is in the principle: every correction captured in-session must have a durable path to persistent storage, and that path must be non-blocking. Knowledge capture that can fail the development workflow will eventually be disabled to prevent disruptions. Knowledge capture that is fire-and-forget continues to accumulate regardless.
As AI agents take on longer-horizon autonomous work, the value of this accumulated correction history compounds. An agent operating in week 12 of a project with access to corrections from weeks 1–11 has a qualitatively different operational context than one starting cold each session. The investment in capture infrastructure pays forward proportionally to the project’s duration.
Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.