Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
Engineering organizations operating in fast-moving domains — LLM tooling, AI infrastructure, competitive SaaS markets — face a structural intelligence deficit: the rate at which relevant developments emerge exceeds the rate at which manual review processes can surface, evaluate, and route them. Weekly human review cycles are inconsistent in execution, expensive in attention, and chronically behind the information frontier. The Autoresearch skill addresses this deficit through an eight-phase autonomous execution loop running nightly on a scheduled cron, with a hard ninety-minute budget and no human approval required at any phase. Relevance scoring is performed entirely by a locally-hosted LLM, eliminating per-run API cost for the highest-frequency operation in the loop. Score-nine-and-above findings auto-create formal requirements and trigger immediate notification, closing the gap between intelligence surfacing and organizational response without human mediation. This paper documents the architecture, scoring model, escalation ladder, and operational constraints of the Autoresearch loop, with recommendations for organizations seeking continuous intelligence coverage at sustainable cost.Key Findings
- Manual research review processes cannot maintain research currency at the pace that modern AI and SaaS domains evolve. The information half-life of LLM tooling changes, competitor feature releases, and relevant research is days to weeks — far shorter than the typical weekly review cadence.
- Local inference for relevance scoring eliminates the primary per-run cost driver. Routing all scoring through an on-premise Ollama instance rather than a cloud API converts the highest-frequency operation in the loop from a per-token expense into a sunk infrastructure cost, enabling nightly cadence without cost compounding.
- The cross-validation bonus — awarding a point to findings that appear in two or more channels — is a lightweight ensemble signal that surfaces consensus without requiring complex aggregation. A finding corroborated across GitHub Trending and HackerNews carries meaningfully more signal than one appearing in either channel alone.
- Budget enforcement via “finish-current-item, then post” ensures partial results are always published. A run that exceeds the ninety-minute budget but produces and posts twelve scored findings is operationally superior to a run that times out silently with nothing posted.
- A run summary is a mandatory output — not a conditional one. Zero high-signal findings is a valid research outcome. Zero run summary is a monitoring blind spot. The distinction is critical for operational reliability.
- The research brief decouples organizational focus from execution mechanics. A human-editable file that defines scoring context changes what the agent prioritizes without touching the execution loop — enabling weekly focus updates without code changes.
1. The Research Currency Problem: Manual Review Cannot Keep Pace With Autonomous Execution Speed
Research currency — the degree to which an engineering organization’s awareness of its domain accurately reflects the current state of that domain — is a prerequisite for competitive positioning. Organizations that identify relevant developments late pay a lag tax: competitor features observed after implementation rather than during planning, research papers absorbed after the methodology has become industry standard, tooling improvements discovered after the organization has already built a workaround. The traditional response to this problem is scheduled human review: a weekly or biweekly session in which a team member surveys selected sources, extracts relevant items, and routes them to appropriate stakeholders. This approach has four structural weaknesses that compound at scale. Consistency degradation. Human review sessions are the first casualty of execution pressure. When sprint commitments intensify, the research review is skipped or abbreviated. The organizations that most need current intelligence — those in the highest-velocity competitive environments — are also the ones most likely to deprioritize the review under load. Channel coverage limits. A human reviewer can meaningfully engage with a bounded set of sources per session. GitHub Trending, HackerNews front page, relevant subreddits, competitor changelogs, and recent arXiv preprints collectively represent more content than a thorough weekly session can evaluate with consistent quality. Practical human review covers a subset of the available signal space. Routing latency. Even when a relevant finding is surfaced, the path from identification to organizational response involves human judgment at each step: is this worth flagging? who should see it? does it warrant a formal requirement or informal discussion? Each judgment step adds latency between signal and response. Scoring inconsistency. Without a defined relevance rubric, human review produces inconsistent triage. The same finding evaluated by two reviewers on different weeks may receive different prioritization — not because the finding changed, but because the reviewers’ implicit weighting criteria differed. Autonomous continuous execution resolves each of these weaknesses. A nightly loop runs regardless of sprint pressure. It covers a defined channel set exhaustively. It routes findings by score without human intermediation. It applies a consistent scoring rubric on every run. The constraint is not willingness — it is the cost of running a capable LLM for relevance scoring at nightly cadence. Local inference eliminates that constraint.2. Architecture: Eight Phases, Ninety Minutes, Zero Human Approval
The Autoresearch loop executes eight sequential phases within a hard ninety-minute budget. The phases are not parallelized; each phase depends on the outputs of its predecessors. The agent decides, posts, and exits without requesting human continuation at any point.| Phase | Name | Budget | Description |
|---|---|---|---|
| 1 | INITIALIZE | 2 min | Load focus configuration, product profiles, deduplication logs |
| 2 | ENGINEERING SEARCH | 25 min | Fetch from GitHub Trending, HackerNews, arXiv, release channels |
| 3 | PRODUCT INTELLIGENCE SEARCH | 25 min | Fetch from competitor changelogs, Reddit, review platforms |
| 4 | ENGINEERING SCORE | 10 min | Score engineering findings via local LLM; deduplicate |
| 5 | PRODUCT SCORE | 10 min | Score product findings via local LLM; deduplicate |
| 6 | POST | 15 min | Write findings, create requirements, verify escalation counts |
| 7 | SUMMARY | 3 min | Post run summary regardless of finding count |
| 8 | AUTONOMOUS DECISION | — | Handle empty results, budget overrun, or channel failure |
Phase sequencing matters for correctness. Engineering and product search phases run before scoring so that the full corpus of raw findings is assembled before deduplication is applied. Running scoring on each finding as it is fetched would produce a deduplication set that grows during the fetch phase, creating ordering-dependent outcomes where the same finding would be deduped or not depending on which channel was fetched first.
3. Source Coverage: Engineering Channels, Product Intelligence, and Cross-Validation
The Autoresearch loop draws from two distinct source categories, each with its own channel set, scoring ruleset, and output destination.3.1 Engineering Channels
The engineering search phase covers four primary sources: GitHub Trending (top 20 repositories). GitHub Trending surfaces repositories gaining sustained attention within a time window. The top 20 provides sufficient coverage of high-momentum projects without capturing the long tail of marginal activity. Repository metadata — language, description, star velocity — provides the scoring LLM with sufficient signal to evaluate relevance against the current research brief. HackerNews (top 50 items). The HackerNews front page aggregates community-validated technical content. The top-50 threshold captures the items with demonstrated engagement without over-indexing on velocity-gaming tactics that occasionally affect lower-ranked positions. arXiv (top 10 papers). Research preprints from arXiv represent the leading edge of methodology before peer review and before industry adoption. Ten papers per run provides meaningful coverage of daily output in relevant categories (cs.AI, cs.LG, cs.SE) without overwhelming the scoring budget. Release channels. Major platform release announcements — LLM providers, foundational infrastructure, relevant open-source projects — are fetched from official release feeds. This channel is the highest-precision source in the engineering set: a release announcement from a directly relevant platform is almost always worth scoring.3.2 Product Intelligence Channels
The product intelligence search phase covers competitor and market signal sources, applied per-profile against each enabled product: Competitor changelog feeds (RSS and scrape). Official changelogs represent the most authoritative signal for competitor feature activity. Where RSS feeds are available, they are preferred for structural reliability. Where scraping is required, the phase applies basic structural heuristics to extract dated changelog entries. Reddit sentiment queries. Targeted queries — constructed from the product profile’s competitor list and sentiment terms — surface organic user frustration that does not appear in official channels. A query pattern such as “[Competitor] frustrating” or “[Competitor] broken” retrieves recent posts expressing pain points. These findings score lower on the escalation ladder but provide leading indicators of competitor weakness before those weaknesses appear in formal review channels. HackerNews Algolia API. The HackerNews Algolia search API enables targeted retrieval of discussion threads mentioning specific competitors or products. Unlike the front-page fetch in the engineering phase, this query is product-profile-specific and retrieves historical discussion depth not limited to the current front page. Review platforms (1–3 star filters). Low-star reviews on software review platforms surface structured negative feedback about competitors. The 1–3 star filter concentrates the query on dissatisfied users who are most likely to describe specific pain points rather than general impressions.3.3 Cross-Validation Bonus
A finding that appears in two or more channels within the same run receives a +1 score bonus before the final scoring pass. This cross-validation bonus implements a lightweight ensemble signal: independent corroboration across channels indicates that the finding has passed multiple distinct attention-filtering mechanisms, increasing the probability that it represents genuine signal rather than noise in any individual channel. The bonus is applied after individual channel scoring and before the final deduplication pass. Deduplication uses URL identity as the primary key, with a secondary fuzzy-match step for findings that appear across channels with different canonical URLs but identical content.4. Local Inference as the Scoring Layer: Zero Marginal Cost per Run
Relevance scoring is the highest-frequency operation in the Autoresearch loop. A single run evaluates up to eighty raw findings — twenty from GitHub Trending, fifty from HackerNews, ten from arXiv — against the current research brief and product profiles. At nightly cadence, this produces approximately 560 scoring operations per week. Routing these operations through a cloud inference API would generate measurable per-run cost that compounds to significant monthly expense at scale. The Autoresearch loop routes all scoring through a locally-hosted Ollama instance using gemma3:27b as the primary model. The scoring request is a structured prompt containing the research brief context, the finding metadata, and the scoring rubric from the relevant rules file.qwen3:8b followed by gemma4:latest. Fallback triggers on Ollama connection failure or a response that does not parse as valid JSON after two retries. The fallback chain is ordered by inference quality rather than speed: the primary model produces the most nuanced scoring against the research brief; the fallback models trade nuance for reliability when the primary is unavailable.
The operational implication of local inference is significant: the scoring layer operates at zero marginal cost per run. The hardware running the Ollama instance is already provisioned. The models are already downloaded. Running eighty scoring requests costs nothing beyond electricity. This cost structure makes nightly cadence economically equivalent to weekly cadence — the frequency decision is a product decision, not a cost decision.
The only operations in the Autoresearch loop that incur external cost are requirement creation events on score-9+ findings. These MCP tool calls hit a server endpoint and are by design rare: the scoring rubric is calibrated to produce 9+ scores only for findings that directly address a current priority in the research brief with high specificity. In practice, most runs produce zero to two requirement-creation events.
5. The Score-to-Action Escalation Ladder: From Informational to Auto-Requirement
The Autoresearch loop applies a four-level escalation ladder that maps numeric scores to specific, automated actions. The ladder is deterministic: the score determines the action without agent judgment or human approval.| Score | Classification | Actions Taken |
|---|---|---|
| 9–10 | High-priority signal | Auto-create Proposed requirement via MCP + immediate Telegram notification |
| 7–8 | Planning-relevant signal | Post to engineering findings log + Telegram alert flagged for sprint planning |
| 6 | Informational | Post to engineering findings log only; no notification |
| < 6 | Below threshold | Discard; deduplication log still updated |
6. The Research Brief: Human-Editable Focus Without Code Changes
The research brief is a Markdown file maintained by the engineering or product team that defines the scoring context for the current research period. It contains the focus areas, keywords, and priority signals that the scoring LLM uses to evaluate relevance.relevance-scoring-rules.md for engineering findings and product-scoring-rules.md for product intelligence — define the rubric that the LLM applies to the brief context and the raw finding. These files are less frequently updated than the brief: they represent stable organizational judgments about what constitutes a relevant finding in each category, while the brief represents current tactical focus. Separating the rubric from the context prevents frequent brief updates from accidentally eroding the scoring rubric’s integrity.
7. Silence Is Not Success: Run Summary as a Mandatory Output
The Autoresearch loop mandates a run summary at the end of every execution, regardless of finding count, scoring outcomes, or channel availability. This mandate is architecturally significant. A zero-findings run has two distinct explanations: the channels contained no high-signal items this cycle, or the channels failed to return data. A loop that posts findings only when findings exist cannot distinguish between these explanations. An operator monitoring the findings log sees no new entries and cannot determine whether the loop ran successfully with nothing to report, or whether it failed silently. The run summary resolves this ambiguity. Phase 7 posts a structured summary to the configured notification channel containing:- Run start and end timestamps
- Channels successfully fetched and channels that failed
- Total findings fetched per channel
- Findings scored by threshold bracket (9+, 7–8, 6, discarded)
- Requirements created (count)
- Budget status (completed within budget / budget exceeded, posted partial results)
The run summary Telegram notification serves a secondary function beyond monitoring: it establishes a daily baseline for what “normal” looks like. After two weeks of nightly operation, an operator looking at the summary stream can identify patterns — which channels fail most frequently, which days consistently produce more findings, whether the current research brief is producing useful signal or should be tightened. The summary is not just a completion acknowledgment; it is longitudinal operational data.
8. Secret Management: Bootstrap Tokens, Runtime Fetch, Revoke After Job
The Autoresearch loop interacts with authenticated services: the project management system for requirement creation, the notification API for Telegram delivery, and any authenticated source channels. Secrets required for these interactions must be available to the running agent without being stored in YAML configuration or committed to the repository. The loop uses a Vault AppRole bootstrap pattern:Bootstrap credentials injected at CI level
Two values — a role identifier and a secret identifier — are provided to the CI job as environment variables from the secrets store. These bootstrap credentials are not themselves functional API tokens; they are credentials that authorize the retrieval of functional tokens from Vault.
Runtime token fetch at loop initialization
Phase 1 of the loop executes a Vault API call using the bootstrap credentials to retrieve the live tokens required for the current run. Tokens are fetched with the minimum TTL required to complete the ninety-minute run, reducing the exposure window of any retrieved credential.
Immediate masking in logs
Retrieved tokens are immediately registered with the CI log masking mechanism before any further use. Any subsequent logging that would inadvertently include a token value produces masked output. This step runs synchronously before the token is passed to any downstream function.
Token revocation after job completion
After the run summary is posted and the loop exits, the tokens retrieved during phase 1 are explicitly revoked via the Vault API. A token that has been revoked cannot be used even if captured from an intermediate log or artifact. Revocation is a hard requirement, not a best-effort operation — the loop treats revocation failure as a security event requiring notification.
9. Implementation Constraints
Local inference model availability is a single point of failure for the scoring phase. If the Ollama instance is unavailable when phases 4 and 5 execute, the fallback chain must be exhausted before the loop can proceed. If all fallback models are also unavailable, the loop has two options: skip scoring and post all fetched findings as unscored, or abort the scoring phase and jump to the summary. The operationally correct choice depends on the volume of findings — posting hundreds of unscored findings is less useful than a summary explaining that scoring was unavailable. The implementation should define this threshold explicitly rather than leaving it as a runtime judgment. Channel format changes break fetches silently. GitHub Trending and competitor changelog pages are scraped rather than retrieved from stable APIs. When the source page structure changes — a common occurrence during platform redesigns — the scraper returns empty results rather than an error. An operator reviewing the run summary sees zero findings from the affected channel, which is indistinguishable from a valid zero-finding result until the pattern repeats across multiple runs. Implement a minimum-findings threshold alert per channel: if a channel that has consistently returned ten or more items per run returns zero for three consecutive runs, generate an alert rather than treating the result as a valid data point. The scoring rubric must be recalibrated periodically. A rubric written against the finding landscape of six months ago may produce systematically miscalibrated scores against today’s landscape. Topics that were novel and high-signal six months ago may now be commodity — and the rubric’s weighting of those topics will elevate what is now routine activity to apparent significance. Schedule a rubric review every eight to twelve weeks, comparing auto-created requirements against the finding scores that generated them. Requirements that proved low-value when reviewed by the engineering team indicate rubric miscalibration at the 9–10 threshold. Deduplication log growth requires periodic pruning. The deduplication log retains URLs from the past thirty days. Without pruning, the log grows unboundedly and introduces lookup latency into the scoring phase. A nightly pruning job — run at the end of the POST phase — should remove entries older than the configured retention window. This is a maintenance task, not a business logic concern, but it must be implemented before the log grows large enough to affect run-time performance. Budget overrun handling requires explicit phase state tracking. When the ninety-minute budget expires mid-phase, the loop must jump to the POST phase. This jump is only safe if the loop maintains explicit state for each phase: which items have been processed, which have been scored, and which have already been posted. A loop that tracks phase state as a side effect of execution order — rather than as explicit persistent state — may re-post findings already published in a partial previous run, or skip findings that were fetched but not yet scored. Phase state must be written to a persistent store (not held in memory) so that budget-overrun recovery is deterministic.10. Recommendations
- Calibrate the scoring rubric before enabling auto-requirement creation. Run the loop in dry-run mode — scoring and logging findings without posting or creating requirements — for at least two weeks before enabling the escalation ladder. Review the findings that scored 9+ during the dry-run period. If more than 5% of total scored findings reach the 9–10 bracket, tighten the rubric before activating auto-creation. A requirement signal that fires frequently loses its escalation value.
- Treat channel failure patterns as a maintenance backlog, not acceptable variance. After the first month of operation, review the run summaries and compile a channel failure frequency table. Channels that fail more than 20% of runs represent structural fragility — scraper breakage, rate limits, or RSS feed deprecation — that should be addressed as a maintenance task. Schedule channel health reviews on the same cadence as rubric recalibration.
- Assign explicit ownership of the research brief to a named team member. A brief that is owned by everyone is updated by no one. Designate a brief owner who is responsible for reviewing and updating focus areas weekly. The brief’s commit history should reflect weekly updates; a brief that has not been updated in three weeks is a brief that no longer reflects current organizational priorities.
- Implement the minimum-findings threshold alert before deploying to production. The silent-zero failure mode — a scraper returning empty results due to a page format change — is the most operationally deceptive failure in the loop. It produces summaries that look like valid data points rather than failure indicators. This alert is not optional infrastructure; it is a prerequisite for trusting the system’s zero-findings reports.
- Test the token revocation path before your first production run. Revocation is the hardest part of the secret management lifecycle to test because it requires deliberately inducing a post-run state. Run a dry-run that completes the full initialization, token fetch, and revocation cycle before the loop handles any real data. Confirm that the revocation API returns success and that a subsequent attempt to use the revoked token produces the expected rejection. Do not discover that revocation is broken during an actual security incident.
- Log scoring decisions at the finding level, not just the aggregate level. The run summary reports score distribution counts. The finding-level log should record each finding’s raw score, the cross-validation status, the one-sentence reason from the LLM, and the action taken. This log is the primary tool for rubric calibration — without per-finding scoring rationale, distinguishing a miscalibrated rubric from a genuinely low-signal week is guesswork.
- Version the research brief and scoring rules files alongside the loop itself. A change to the research brief or scoring rules file produces a behavior change in the loop equivalent to a code change. Brief and rules file history should be maintained in the same repository as the loop implementation, with commit messages that describe what changed and why. This version history is essential for understanding why a particular run produced unexpectedly high or low finding counts.
Conclusion and Forward Outlook
The Autoresearch loop demonstrates that continuous intelligence coverage at nightly cadence is achievable without significant operational cost when the most expensive component — relevance scoring — is moved to local inference. The design choices documented here — eight sequential phases, a hard budget with partial-result publication, mandatory run summaries, conservative auto-escalation, and bootstrap-fetch-mask-revoke secret management — are not arbitrary; they represent the minimum viable set of structural commitments required to make an autonomous research loop operationally trustworthy rather than merely technically functional. The pattern’s most transferable insight is the separation between the research brief and the execution loop. A research agent that cannot be redirected without code changes is a rigid tool; one that reads its focus from a committed, human-editable file is an organizational asset. The loop runs the same eight phases every night. The brief determines what those phases look for. This decoupling is what makes continuous intelligence coverage sustainable over months and quarters, as organizational priorities shift and the competitive landscape evolves. As local inference models improve in quality at the 7B–27B parameter range — a trajectory that has been consistent and rapid — the quality ceiling of local-inference scoring will rise without any change to the loop’s architecture or cost structure. Organizations that establish local-inference scoring pipelines now will inherit those quality improvements automatically, while organizations that deferred deployment waiting for “good enough” local models will find themselves building infrastructure that their competitors have been running for a year. The organizations best positioned for that transition are the ones that have already deployed the loop, operated it through its calibration period, and built the operational discipline — brief ownership, channel maintenance, rubric review cadence — that makes the technical infrastructure meaningful.All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.