Autoresearch: Autonomous Nightly Intelligence with Local LLM Scoring

Executive Summary

Engineering organizations operating in fast-moving domains — LLM tooling, AI infrastructure, competitive SaaS markets — face a structural intelligence deficit: the rate at which relevant developments emerge exceeds the rate at which manual review processes can surface, evaluate, and route them. Weekly human review cycles are inconsistent in execution, expensive in attention, and chronically behind the information frontier. The Autoresearch skill addresses this deficit through an eight-phase autonomous execution loop running nightly on a scheduled cron, with a hard ninety-minute budget and no human approval required at any phase. Relevance scoring is performed entirely by a locally-hosted LLM, eliminating per-run API cost for the highest-frequency operation in the loop. Score-nine-and-above findings auto-create formal requirements and trigger immediate notification, closing the gap between intelligence surfacing and organizational response without human mediation. This paper documents the architecture, scoring model, escalation ladder, and operational constraints of the Autoresearch loop, with recommendations for organizations seeking continuous intelligence coverage at sustainable cost.

Key Findings

Manual research review processes cannot maintain research currency at the pace that modern AI and SaaS domains evolve. The information half-life of LLM tooling changes, competitor feature releases, and relevant research is days to weeks — far shorter than the typical weekly review cadence.
Local inference for relevance scoring eliminates the primary per-run cost driver. Routing all scoring through an on-premise Ollama instance rather than a cloud API converts the highest-frequency operation in the loop from a per-token expense into a sunk infrastructure cost, enabling nightly cadence without cost compounding.
The cross-validation bonus — awarding a point to findings that appear in two or more channels — is a lightweight ensemble signal that surfaces consensus without requiring complex aggregation. A finding corroborated across GitHub Trending and HackerNews carries meaningfully more signal than one appearing in either channel alone.
Budget enforcement via “finish-current-item, then post” ensures partial results are always published. A run that exceeds the ninety-minute budget but produces and posts twelve scored findings is operationally superior to a run that times out silently with nothing posted.
A run summary is a mandatory output — not a conditional one. Zero high-signal findings is a valid research outcome. Zero run summary is a monitoring blind spot. The distinction is critical for operational reliability.
The research brief decouples organizational focus from execution mechanics. A human-editable file that defines scoring context changes what the agent prioritizes without touching the execution loop — enabling weekly focus updates without code changes.

1. The Research Currency Problem: Manual Review Cannot Keep Pace With Autonomous Execution Speed

Research currency — the degree to which an engineering organization’s awareness of its domain accurately reflects the current state of that domain — is a prerequisite for competitive positioning. Organizations that identify relevant developments late pay a lag tax: competitor features observed after implementation rather than during planning, research papers absorbed after the methodology has become industry standard, tooling improvements discovered after the organization has already built a workaround. The traditional response to this problem is scheduled human review: a weekly or biweekly session in which a team member surveys selected sources, extracts relevant items, and routes them to appropriate stakeholders. This approach has four structural weaknesses that compound at scale. Consistency degradation. Human review sessions are the first casualty of execution pressure. When sprint commitments intensify, the research review is skipped or abbreviated. The organizations that most need current intelligence — those in the highest-velocity competitive environments — are also the ones most likely to deprioritize the review under load. Channel coverage limits. A human reviewer can meaningfully engage with a bounded set of sources per session. GitHub Trending, HackerNews front page, relevant subreddits, competitor changelogs, and recent arXiv preprints collectively represent more content than a thorough weekly session can evaluate with consistent quality. Practical human review covers a subset of the available signal space. Routing latency. Even when a relevant finding is surfaced, the path from identification to organizational response involves human judgment at each step: is this worth flagging? who should see it? does it warrant a formal requirement or informal discussion? Each judgment step adds latency between signal and response. Scoring inconsistency. Without a defined relevance rubric, human review produces inconsistent triage. The same finding evaluated by two reviewers on different weeks may receive different prioritization — not because the finding changed, but because the reviewers’ implicit weighting criteria differed. Autonomous continuous execution resolves each of these weaknesses. A nightly loop runs regardless of sprint pressure. It covers a defined channel set exhaustively. It routes findings by score without human intermediation. It applies a consistent scoring rubric on every run. The constraint is not willingness — it is the cost of running a capable LLM for relevance scoring at nightly cadence. Local inference eliminates that constraint.

2. Architecture: Eight Phases, Ninety Minutes, Zero Human Approval

The Autoresearch loop executes eight sequential phases within a hard ninety-minute budget. The phases are not parallelized; each phase depends on the outputs of its predecessors. The agent decides, posts, and exits without requesting human continuation at any point.

Phase	Name	Budget	Description
1	INITIALIZE	2 min	Load focus configuration, product profiles, deduplication logs
2	ENGINEERING SEARCH	25 min	Fetch from GitHub Trending, HackerNews, arXiv, release channels
3	PRODUCT INTELLIGENCE SEARCH	25 min	Fetch from competitor changelogs, Reddit, review platforms
4	ENGINEERING SCORE	10 min	Score engineering findings via local LLM; deduplicate
5	PRODUCT SCORE	10 min	Score product findings via local LLM; deduplicate
6	POST	15 min	Write findings, create requirements, verify escalation counts
7	SUMMARY	3 min	Post run summary regardless of finding count
8	AUTONOMOUS DECISION	—	Handle empty results, budget overrun, or channel failure

The ninety-minute hard budget is enforced at phase boundaries: when the budget expires, the agent completes the current phase item, then jumps directly to the POST phase. This design guarantees that partial results are always published. A run that scores twenty findings before the budget expires and posts all twenty is operationally superior to a run that continues scoring silently until it is terminated by the scheduler with nothing posted.

Phase sequencing matters for correctness. Engineering and product search phases run before scoring so that the full corpus of raw findings is assembled before deduplication is applied. Running scoring on each finding as it is fetched would produce a deduplication set that grows during the fetch phase, creating ordering-dependent outcomes where the same finding would be deduped or not depending on which channel was fetched first.

Channel failures within search phases do not halt the loop. If the GitHub Trending fetch fails — network error, rate limit, structural change in the page — the phase logs the failure and continues with the remaining channels. A phase that completes with partial channel coverage is preferable to a phase that aborts on first failure and produces no findings. This failure-skip-continue design means the loop degrades gracefully under partial infrastructure failure rather than producing an all-or-nothing result.

3. Source Coverage: Engineering Channels, Product Intelligence, and Cross-Validation

The Autoresearch loop draws from two distinct source categories, each with its own channel set, scoring ruleset, and output destination.

3.1 Engineering Channels

The engineering search phase covers four primary sources: GitHub Trending (top 20 repositories). GitHub Trending surfaces repositories gaining sustained attention within a time window. The top 20 provides sufficient coverage of high-momentum projects without capturing the long tail of marginal activity. Repository metadata — language, description, star velocity — provides the scoring LLM with sufficient signal to evaluate relevance against the current research brief. HackerNews (top 50 items). The HackerNews front page aggregates community-validated technical content. The top-50 threshold captures the items with demonstrated engagement without over-indexing on velocity-gaming tactics that occasionally affect lower-ranked positions. arXiv (top 10 papers). Research preprints from arXiv represent the leading edge of methodology before peer review and before industry adoption. Ten papers per run provides meaningful coverage of daily output in relevant categories (cs.AI, cs.LG, cs.SE) without overwhelming the scoring budget. Release channels. Major platform release announcements — LLM providers, foundational infrastructure, relevant open-source projects — are fetched from official release feeds. This channel is the highest-precision source in the engineering set: a release announcement from a directly relevant platform is almost always worth scoring.

3.2 Product Intelligence Channels

The product intelligence search phase covers competitor and market signal sources, applied per-profile against each enabled product: Competitor changelog feeds (RSS and scrape). Official changelogs represent the most authoritative signal for competitor feature activity. Where RSS feeds are available, they are preferred for structural reliability. Where scraping is required, the phase applies basic structural heuristics to extract dated changelog entries. Reddit sentiment queries. Targeted queries — constructed from the product profile’s competitor list and sentiment terms — surface organic user frustration that does not appear in official channels. A query pattern such as “[Competitor] frustrating” or “[Competitor] broken” retrieves recent posts expressing pain points. These findings score lower on the escalation ladder but provide leading indicators of competitor weakness before those weaknesses appear in formal review channels. HackerNews Algolia API. The HackerNews Algolia search API enables targeted retrieval of discussion threads mentioning specific competitors or products. Unlike the front-page fetch in the engineering phase, this query is product-profile-specific and retrieves historical discussion depth not limited to the current front page. Review platforms (1–3 star filters). Low-star reviews on software review platforms surface structured negative feedback about competitors. The 1–3 star filter concentrates the query on dissatisfied users who are most likely to describe specific pain points rather than general impressions.

3.3 Cross-Validation Bonus

A finding that appears in two or more channels within the same run receives a +1 score bonus before the final scoring pass. This cross-validation bonus implements a lightweight ensemble signal: independent corroboration across channels indicates that the finding has passed multiple distinct attention-filtering mechanisms, increasing the probability that it represents genuine signal rather than noise in any individual channel.

The cross-validation bonus is most valuable for the 6–8 score range, where the +1 point may lift a finding from “post to log” to “post with Telegram alert.” For findings that score 9+ without the bonus, the cross-validation is a confirmation; for findings at the boundary, it is a meaningful differentiator. When reviewing run logs, sort borderline findings by cross-validation status to identify which boundary cases have the strongest corroboration.

The bonus is applied after individual channel scoring and before the final deduplication pass. Deduplication uses URL identity as the primary key, with a secondary fuzzy-match step for findings that appear across channels with different canonical URLs but identical content.

4. Local Inference as the Scoring Layer: Zero Marginal Cost per Run

Relevance scoring is the highest-frequency operation in the Autoresearch loop. A single run evaluates up to eighty raw findings — twenty from GitHub Trending, fifty from HackerNews, ten from arXiv — against the current research brief and product profiles. At nightly cadence, this produces approximately 560 scoring operations per week. Routing these operations through a cloud inference API would generate measurable per-run cost that compounds to significant monthly expense at scale. The Autoresearch loop routes all scoring through a locally-hosted Ollama instance using gemma3:27b as the primary model. The scoring request is a structured prompt containing the research brief context, the finding metadata, and the scoring rubric from the relevant rules file.

# Scoring prompt structure (engineering findings)
system: |
  You are a relevance scoring agent for an engineering research brief.
  Apply the rules in relevance-scoring-rules.md to the finding below.
  Return a JSON object: { "score": <integer 1-10>, "reason": "<one sentence>" }
  Do not add explanation outside the JSON object.

user: |
  Research brief context:
  {{ research_brief_summary }}

  Finding:
  Title: {{ finding.title }}
  Source: {{ finding.source }}
  URL: {{ finding.url }}
  Summary: {{ finding.summary }}
  
  Cross-validated: {{ finding.cross_validated }}
  Score (pre-bonus): apply rubric, then add 1 if cross_validated is true.

The fallback model chain is qwen3:8b followed by gemma4:latest. Fallback triggers on Ollama connection failure or a response that does not parse as valid JSON after two retries. The fallback chain is ordered by inference quality rather than speed: the primary model produces the most nuanced scoring against the research brief; the fallback models trade nuance for reliability when the primary is unavailable. The operational implication of local inference is significant: the scoring layer operates at zero marginal cost per run. The hardware running the Ollama instance is already provisioned. The models are already downloaded. Running eighty scoring requests costs nothing beyond electricity. This cost structure makes nightly cadence economically equivalent to weekly cadence — the frequency decision is a product decision, not a cost decision. The only operations in the Autoresearch loop that incur external cost are requirement creation events on score-9+ findings. These MCP tool calls hit a server endpoint and are by design rare: the scoring rubric is calibrated to produce 9+ scores only for findings that directly address a current priority in the research brief with high specificity. In practice, most runs produce zero to two requirement-creation events.

5. The Score-to-Action Escalation Ladder: From Informational to Auto-Requirement

The Autoresearch loop applies a four-level escalation ladder that maps numeric scores to specific, automated actions. The ladder is deterministic: the score determines the action without agent judgment or human approval.

Score	Classification	Actions Taken
9–10	High-priority signal	Auto-create Proposed requirement via MCP + immediate Telegram notification
7–8	Planning-relevant signal	Post to engineering findings log + Telegram alert flagged for sprint planning
6	Informational	Post to engineering findings log only; no notification
< 6	Below threshold	Discard; deduplication log still updated

The critical design decision is at the 9–10 threshold: the agent creates a formal requirement without human approval. This is not a notification that something might warrant a requirement — it is the creation of the requirement, placed in Proposed status, immediately. The agent that identified the finding, scored it, and created the requirement acted on the full authority delegated by the escalation ladder.

Auto-requirement creation at score 9–10 is only safe if the scoring rubric is calibrated conservatively. A rubric that produces frequent 9+ scores will generate requirement noise that degrades the signal value of the mechanism — team members will begin treating auto-created requirements as low-credibility items requiring additional human validation, which defeats the purpose of autonomous escalation. Calibrate the rubric against a backlog of past findings before enabling auto-creation in production. Aim for a 9+ rate of less than 5% of scored findings across the first month of operation.

The deduplication log is updated for all findings regardless of score, including those discarded below the threshold. A finding that scores 4 today will not be re-evaluated tomorrow even if the same URL appears in the channel feed. This prevents the deduplication mechanism from being bypassed by findings that hover near the threshold — the log records that the finding was seen and scored, not only that it was acted upon. Escalation verification runs as phase 6d after all posting is complete. The agent counts the number of findings that scored 9+ in the current run and compares that count against the number of requirements created. A mismatch — more 9+ findings than requirements created, or more requirements created than 9+ findings — indicates a bug in the posting phase and is logged as an error requiring investigation. This count reconciliation is a simple but reliable integrity check for the highest-stakes operation in the loop.

6. The Research Brief: Human-Editable Focus Without Code Changes

The research brief is a Markdown file maintained by the engineering or product team that defines the scoring context for the current research period. It contains the focus areas, keywords, and priority signals that the scoring LLM uses to evaluate relevance.

# Research Brief — Week of 2026-05-05

## Engineering Focus Areas
- Local LLM inference: quantization methods, model serving efficiency, context window handling
- Agentic frameworks: multi-agent orchestration, tool use patterns, memory architectures
- Rust ecosystem: async runtime improvements, WebAssembly compilation, compile time reduction

## Product Intelligence Focus
- Competitors: [list of product categories and market segments]
- Signal of interest: pricing changes, enterprise feature additions, API deprecations

## Scoring Context
Prioritize findings that represent production-deployable techniques over research prototypes.
Deprioritize findings already well-covered in recent runs (see deduplication log).

The research brief is the only artifact that changes the behavior of the scoring phase without modifying the loop itself. A team that wants to shift focus from LLM inference optimization to multimodal model deployment edits the brief. The loop reads the brief at initialization (phase 1) and passes the current brief content to the scoring LLM as context for every scoring request in that run. The next nightly run reflects the updated focus. This decoupling has a compounding benefit beyond the immediate flexibility: it creates an audit trail for research focus over time. The brief is a committed file in the repository. Its change history records when the organization shifted focus, what it shifted toward, and implicitly what it de-emphasized. This history is directly useful when evaluating whether a finding posted six weeks ago was correctly deprioritized given the brief in effect at the time. The scoring rules files — relevance-scoring-rules.md for engineering findings and product-scoring-rules.md for product intelligence — define the rubric that the LLM applies to the brief context and the raw finding. These files are less frequently updated than the brief: they represent stable organizational judgments about what constitutes a relevant finding in each category, while the brief represents current tactical focus. Separating the rubric from the context prevents frequent brief updates from accidentally eroding the scoring rubric’s integrity.

7. Silence Is Not Success: Run Summary as a Mandatory Output

The Autoresearch loop mandates a run summary at the end of every execution, regardless of finding count, scoring outcomes, or channel availability. This mandate is architecturally significant. A zero-findings run has two distinct explanations: the channels contained no high-signal items this cycle, or the channels failed to return data. A loop that posts findings only when findings exist cannot distinguish between these explanations. An operator monitoring the findings log sees no new entries and cannot determine whether the loop ran successfully with nothing to report, or whether it failed silently. The run summary resolves this ambiguity. Phase 7 posts a structured summary to the configured notification channel containing:

Run start and end timestamps
Channels successfully fetched and channels that failed
Total findings fetched per channel
Findings scored by threshold bracket (9+, 7–8, 6, discarded)
Requirements created (count)
Budget status (completed within budget / budget exceeded, posted partial results)

## Autoresearch Run Summary — 2026-05-08 02:00 UTC

**Status:** Completed within budget (87 of 90 minutes used)

**Channels:**
- GitHub Trending: 20 repos fetched ✓
- HackerNews top-50: 50 items fetched ✓
- arXiv: 10 papers fetched ✓
- Release channels: 3 of 4 fetched (anthropic-releases timeout) ⚠
- Competitor changelogs: 2 of 3 fetched (product-b RSS 404) ⚠
- Reddit sentiment: 6 queries completed ✓

**Scoring:**
- Total findings evaluated: 73
- Score 9–10: 1 (requirement created)
- Score 7–8: 4 (posted with Telegram alert)
- Score 6: 9 (posted to log)
- Discarded (<6): 59

**Escalation verification:** 1 score-9+ finding, 1 requirement created ✓

A Telegram notification fires at the end of every run — not only when findings exceed a threshold. A run that produces no scored findings sends a notification stating that the loop executed, no high-signal findings were identified, and the channel availability summary is available in the run log. An operator who sees no Telegram notification knows the loop did not complete, not that there was nothing to report.

The run summary Telegram notification serves a secondary function beyond monitoring: it establishes a daily baseline for what “normal” looks like. After two weeks of nightly operation, an operator looking at the summary stream can identify patterns — which channels fail most frequently, which days consistently produce more findings, whether the current research brief is producing useful signal or should be tightened. The summary is not just a completion acknowledgment; it is longitudinal operational data.

8. Secret Management: Bootstrap Tokens, Runtime Fetch, Revoke After Job

The Autoresearch loop interacts with authenticated services: the project management system for requirement creation, the notification API for Telegram delivery, and any authenticated source channels. Secrets required for these interactions must be available to the running agent without being stored in YAML configuration or committed to the repository. The loop uses a Vault AppRole bootstrap pattern:

Bootstrap credentials injected at CI level

Two values — a role identifier and a secret identifier — are provided to the CI job as environment variables from the secrets store. These bootstrap credentials are not themselves functional API tokens; they are credentials that authorize the retrieval of functional tokens from Vault.

Runtime token fetch at loop initialization

Phase 1 of the loop executes a Vault API call using the bootstrap credentials to retrieve the live tokens required for the current run. Tokens are fetched with the minimum TTL required to complete the ninety-minute run, reducing the exposure window of any retrieved credential.

Immediate masking in logs

Retrieved tokens are immediately registered with the CI log masking mechanism before any further use. Any subsequent logging that would inadvertently include a token value produces masked output. This step runs synchronously before the token is passed to any downstream function.

Token revocation after job completion

After the run summary is posted and the loop exits, the tokens retrieved during phase 1 are explicitly revoked via the Vault API. A token that has been revoked cannot be used even if captured from an intermediate log or artifact. Revocation is a hard requirement, not a best-effort operation — the loop treats revocation failure as a security event requiring notification.

No secrets appear in the CI pipeline YAML. The YAML contains only references to the bootstrap credential names — not their values. The values are resolved at runtime by the secrets infrastructure. This design ensures that a repository containing the pipeline configuration does not itself contain sufficient information to authenticate against any downstream service.

Revocation failure — a network error or API unavailability that prevents the loop from revoking its runtime tokens — must be treated as a security incident requiring immediate manual action, not logged and ignored. Build a dead-man notification into the revocation step: if the revocation API call does not return success within thirty seconds of the loop’s final post, send a priority alert to the security channel. Unreachable revocation endpoints can be caused by the same infrastructure instability that would leave a token active longer than intended.

9. Implementation Constraints

Local inference model availability is a single point of failure for the scoring phase. If the Ollama instance is unavailable when phases 4 and 5 execute, the fallback chain must be exhausted before the loop can proceed. If all fallback models are also unavailable, the loop has two options: skip scoring and post all fetched findings as unscored, or abort the scoring phase and jump to the summary. The operationally correct choice depends on the volume of findings — posting hundreds of unscored findings is less useful than a summary explaining that scoring was unavailable. The implementation should define this threshold explicitly rather than leaving it as a runtime judgment. Channel format changes break fetches silently. GitHub Trending and competitor changelog pages are scraped rather than retrieved from stable APIs. When the source page structure changes — a common occurrence during platform redesigns — the scraper returns empty results rather than an error. An operator reviewing the run summary sees zero findings from the affected channel, which is indistinguishable from a valid zero-finding result until the pattern repeats across multiple runs. Implement a minimum-findings threshold alert per channel: if a channel that has consistently returned ten or more items per run returns zero for three consecutive runs, generate an alert rather than treating the result as a valid data point. The scoring rubric must be recalibrated periodically. A rubric written against the finding landscape of six months ago may produce systematically miscalibrated scores against today’s landscape. Topics that were novel and high-signal six months ago may now be commodity — and the rubric’s weighting of those topics will elevate what is now routine activity to apparent significance. Schedule a rubric review every eight to twelve weeks, comparing auto-created requirements against the finding scores that generated them. Requirements that proved low-value when reviewed by the engineering team indicate rubric miscalibration at the 9–10 threshold. Deduplication log growth requires periodic pruning. The deduplication log retains URLs from the past thirty days. Without pruning, the log grows unboundedly and introduces lookup latency into the scoring phase. A nightly pruning job — run at the end of the POST phase — should remove entries older than the configured retention window. This is a maintenance task, not a business logic concern, but it must be implemented before the log grows large enough to affect run-time performance. Budget overrun handling requires explicit phase state tracking. When the ninety-minute budget expires mid-phase, the loop must jump to the POST phase. This jump is only safe if the loop maintains explicit state for each phase: which items have been processed, which have been scored, and which have already been posted. A loop that tracks phase state as a side effect of execution order — rather than as explicit persistent state — may re-post findings already published in a partial previous run, or skip findings that were fetched but not yet scored. Phase state must be written to a persistent store (not held in memory) so that budget-overrun recovery is deterministic.

10. Recommendations

Calibrate the scoring rubric before enabling auto-requirement creation. Run the loop in dry-run mode — scoring and logging findings without posting or creating requirements — for at least two weeks before enabling the escalation ladder. Review the findings that scored 9+ during the dry-run period. If more than 5% of total scored findings reach the 9–10 bracket, tighten the rubric before activating auto-creation. A requirement signal that fires frequently loses its escalation value.
Treat channel failure patterns as a maintenance backlog, not acceptable variance. After the first month of operation, review the run summaries and compile a channel failure frequency table. Channels that fail more than 20% of runs represent structural fragility — scraper breakage, rate limits, or RSS feed deprecation — that should be addressed as a maintenance task. Schedule channel health reviews on the same cadence as rubric recalibration.
Assign explicit ownership of the research brief to a named team member. A brief that is owned by everyone is updated by no one. Designate a brief owner who is responsible for reviewing and updating focus areas weekly. The brief’s commit history should reflect weekly updates; a brief that has not been updated in three weeks is a brief that no longer reflects current organizational priorities.
Implement the minimum-findings threshold alert before deploying to production. The silent-zero failure mode — a scraper returning empty results due to a page format change — is the most operationally deceptive failure in the loop. It produces summaries that look like valid data points rather than failure indicators. This alert is not optional infrastructure; it is a prerequisite for trusting the system’s zero-findings reports.
Test the token revocation path before your first production run. Revocation is the hardest part of the secret management lifecycle to test because it requires deliberately inducing a post-run state. Run a dry-run that completes the full initialization, token fetch, and revocation cycle before the loop handles any real data. Confirm that the revocation API returns success and that a subsequent attempt to use the revoked token produces the expected rejection. Do not discover that revocation is broken during an actual security incident.
Log scoring decisions at the finding level, not just the aggregate level. The run summary reports score distribution counts. The finding-level log should record each finding’s raw score, the cross-validation status, the one-sentence reason from the LLM, and the action taken. This log is the primary tool for rubric calibration — without per-finding scoring rationale, distinguishing a miscalibrated rubric from a genuinely low-signal week is guesswork.
Version the research brief and scoring rules files alongside the loop itself. A change to the research brief or scoring rules file produces a behavior change in the loop equivalent to a code change. Brief and rules file history should be maintained in the same repository as the loop implementation, with commit messages that describe what changed and why. This version history is essential for understanding why a particular run produced unexpectedly high or low finding counts.

Conclusion and Forward Outlook

The Autoresearch loop demonstrates that continuous intelligence coverage at nightly cadence is achievable without significant operational cost when the most expensive component — relevance scoring — is moved to local inference. The design choices documented here — eight sequential phases, a hard budget with partial-result publication, mandatory run summaries, conservative auto-escalation, and bootstrap-fetch-mask-revoke secret management — are not arbitrary; they represent the minimum viable set of structural commitments required to make an autonomous research loop operationally trustworthy rather than merely technically functional. The pattern’s most transferable insight is the separation between the research brief and the execution loop. A research agent that cannot be redirected without code changes is a rigid tool; one that reads its focus from a committed, human-editable file is an organizational asset. The loop runs the same eight phases every night. The brief determines what those phases look for. This decoupling is what makes continuous intelligence coverage sustainable over months and quarters, as organizational priorities shift and the competitive landscape evolves. As local inference models improve in quality at the 7B–27B parameter range — a trajectory that has been consistent and rapid — the quality ceiling of local-inference scoring will rise without any change to the loop’s architecture or cost structure. Organizations that establish local-inference scoring pipelines now will inherit those quality improvements automatically, while organizations that deferred deployment waiting for “good enough” local models will find themselves building infrastructure that their competitors have been running for a year. The organizations best positioned for that transition are the ones that have already deployed the loop, operated it through its calibration period, and built the operational discipline — brief ownership, channel maintenance, rubric review cadence — that makes the technical infrastructure meaningful.

All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Overview

CLAUDE.md & Memory

Skills

Hooks

MCP Servers

Agents & Subagents

Autoresearch: Nightly Intelligence with Local Inference

Executive Summary

Key Findings

1. The Research Currency Problem: Manual Review Cannot Keep Pace With Autonomous Execution Speed

2. Architecture: Eight Phases, Ninety Minutes, Zero Human Approval

3. Source Coverage: Engineering Channels, Product Intelligence, and Cross-Validation

3.1 Engineering Channels

3.2 Product Intelligence Channels

3.3 Cross-Validation Bonus

4. Local Inference as the Scoring Layer: Zero Marginal Cost per Run

5. The Score-to-Action Escalation Ladder: From Informational to Auto-Requirement

6. The Research Brief: Human-Editable Focus Without Code Changes

7. Silence Is Not Success: Run Summary as a Mandatory Output

8. Secret Management: Bootstrap Tokens, Runtime Fetch, Revoke After Job

9. Implementation Constraints

10. Recommendations

Conclusion and Forward Outlook

Overview

CLAUDE.md & Memory

Skills

Hooks

MCP Servers

Agents & Subagents

Documentation Index

​Executive Summary

​Key Findings

​1. The Research Currency Problem: Manual Review Cannot Keep Pace With Autonomous Execution Speed

​2. Architecture: Eight Phases, Ninety Minutes, Zero Human Approval

​3. Source Coverage: Engineering Channels, Product Intelligence, and Cross-Validation

​3.1 Engineering Channels

​3.2 Product Intelligence Channels

​3.3 Cross-Validation Bonus

​4. Local Inference as the Scoring Layer: Zero Marginal Cost per Run

​5. The Score-to-Action Escalation Ladder: From Informational to Auto-Requirement

​6. The Research Brief: Human-Editable Focus Without Code Changes

​7. Silence Is Not Success: Run Summary as a Mandatory Output

​8. Secret Management: Bootstrap Tokens, Runtime Fetch, Revoke After Job

​9. Implementation Constraints

​10. Recommendations

​Conclusion and Forward Outlook

Executive Summary

Key Findings

1. The Research Currency Problem: Manual Review Cannot Keep Pace With Autonomous Execution Speed

2. Architecture: Eight Phases, Ninety Minutes, Zero Human Approval

3. Source Coverage: Engineering Channels, Product Intelligence, and Cross-Validation

3.1 Engineering Channels

3.2 Product Intelligence Channels

3.3 Cross-Validation Bonus

4. Local Inference as the Scoring Layer: Zero Marginal Cost per Run

5. The Score-to-Action Escalation Ladder: From Informational to Auto-Requirement

6. The Research Brief: Human-Editable Focus Without Code Changes

7. Silence Is Not Success: Run Summary as a Mandatory Output

8. Secret Management: Bootstrap Tokens, Runtime Fetch, Revoke After Job

9. Implementation Constraints

10. Recommendations

Conclusion and Forward Outlook