Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

Executive Summary

End-to-end test suites and manual exploratory testing represent two complementary but structurally incomplete approaches to software quality assurance. Regression suites excel at confirming that previously verified behavior remains stable; they are architecturally incapable of detecting novel failure modes because they only find what they were written to find. Manual exploratory testing surfaces genuinely novel bugs but is inconsistent in coverage and frequency. The Autoexplore pattern closes this gap by deploying a structured sub-agent — powered by Playwright for browser automation and Claude Code itself as the triage decision-maker — that runs 15 categorized detection rules against a live development environment on a repeating schedule. The critical architectural insight is that the triage phase, which determines which findings warrant bug reports, runs inside the Claude Code session that already exists: no separate LLM API call is made, and no additional API cost is incurred. Organizations running active development against a live dev environment should treat systematic automated exploration as a third pillar of quality assurance alongside regression testing and manual review.

Key Findings

  • Regression suites are structurally incapable of detecting novel bugs because each test assertion was written to verify a known behavior — a test suite that passes perfectly can coexist with a UI that has never been manually explored for new failure patterns.
  • The triage orchestrator costs zero API tokens because it executes as a task inside the Claude Code session that is already running — the same agent that reads the findings file and decides which bugs to file is the LLM, and no additional inference call is required.
  • The entity triple-check (Rule 13) addresses a failure mode unique to polyrepo systems: when a backend struct field is renamed but the frontend form field is not updated, all per-repo test suites pass independently while the integration is silently broken.
  • Deduplication against the open bug tracker is not optional — without it, the same findings are re-filed on every run, destroying the signal-to-noise ratio of the bug tracker within days.
  • Running as a background sub-agent with run_in_background: true adds zero wall-clock time to the parent orchestration loop — detection and triage proceed in parallel with other scheduled operations.
  • Detection coverage is organized into three distinct layers — Playwright browser rules, CMS audit rules, and deep cross-repo integrity checks — each targeting failure modes that the other layers cannot surface.

1. The Coverage Gap Between Regression Tests and Manual Exploration

A mature test suite is evidence of historical quality, not current quality. Each assertion in a regression suite was authored at a specific moment to verify a specific behavior that was known to matter at that time. The suite grows as known behaviors accumulate. It does not grow in response to behaviors that have never been observed or considered — because no mechanism exists to write assertions for failure modes that have not yet been identified. This structural constraint produces a persistent coverage gap. A system with comprehensive regression coverage and active development will accumulate novel failure modes in the space between the last exploratory testing session and the present moment. UI elements that were not present when the last manual test was performed may now be empty, broken, or misaligned. API error rates on new routes may be elevated. Form fields added in the most recent sprint may lack validation. CMS content may be registered in the navigation without a corresponding published page. None of these failure modes will cause a regression suite to fail — the suite has no assertions covering them. Manual exploratory testing addresses this gap but introduces a different failure mode: inconsistency. The coverage achieved in a manual session depends on the tester’s attention, available time, and knowledge of recent changes. Routes that the tester does not visit in a given session are not covered. Patterns the tester does not check — accessibility attributes, pagination behavior, breadcrumb hierarchy — are not evaluated. The coverage is real but non-systematic, and the same failure mode may be missed in consecutive sessions if it does not surface in the areas the tester chose to explore. The Autoexplore pattern addresses both failure modes simultaneously. By codifying the detection rules that manual explorers apply — and running them programmatically against the live development environment on every cycle — systematic coverage is achieved without the consistency problem of manual testing and without the structural limitation of assertion-based regression suites.
Autoexplore is not a replacement for either regression testing or manual exploration. It is a third pillar. Regression suites verify known behaviors efficiently at scale. Manual exploration surfaces findings that require human judgment and contextual knowledge. Autoexplore provides systematic coverage of structured detection rules that are too numerous and too repetitive for manual application but too heuristic for assertion-based implementation.

2. Architecture: Detection Sub-Agent Plus Zero-Cost Triage Orchestrator

The Autoexplore architecture divides the QA automation problem into two phases with distinct resource profiles: a detection phase that is computationally expensive and an LLM-free triage phase that costs nothing in API tokens.

2.1 Detection Agent

The detection agent runs as a Playwright-driven browser automation session. It operates against the live development environment, loading pages, interacting with UI elements, intercepting network traffic, and recording observations. Its time budget is 15 minutes. Within that budget, it executes 15 structured detection rules across three layers and writes all findings to a structured output file at /tmp/autoexplore-findings.json. The detection agent can be invoked in two ways. The primary invocation is as a background sub-agent spawned from an autoresearch or nightly orchestration loop, using run_in_background: true. In this configuration, it runs in parallel with other orchestration phases and adds no wall-clock time to the parent loop. The secondary invocation is via the /autoexplore skill, which runs the detection agent as a standalone foreground process when targeted QA coverage is required outside the scheduled loop.

2.2 Triage Orchestrator

The triage orchestrator reads /tmp/autoexplore-findings.json, fetches all open bugs from the project tracker, and makes filing decisions: which findings are net-new and warrant a bug report, and which findings are already represented in the open bug tracker and should be suppressed. The triage orchestrator is not a separate LLM. It is Claude Code — the same agent session that is already running. When the detection agent completes and writes its findings file, the parent session reads that file and executes the triage logic. The fuzzy-matching against open bug titles and descriptions, the severity classification, and the filing decisions are all made by the inference that is already in context. No additional API call is initiated. No marginal token cost is incurred. This architectural decision is the cost unlock of the entire pattern. The computationally expensive operation — Playwright browser automation with page rendering, network interception, and DOM traversal — runs locally at zero API cost. The LLM reasoning required to interpret findings and make filing decisions reuses the Claude Code session that orchestrated the detection agent. The result is a complete QA automation loop with near-zero marginal cost per run.
PhaseExecutionTime BudgetAPI Cost
Detection agent (Playwright)Local browser automation15 minutesZero
Triage orchestrator (Claude Code)In-session reasoning5 minutesZero (reuses existing session)
Bug filingProject tracker API callsIncluded in triage budgetZero

3. Three Detection Layers: Playwright Rules, CMS Audit, and Deep Integrity Checks

The 15 detection rules are organized into three layers. Each layer targets a category of failure mode that the other layers cannot surface, and each layer uses a different data source and detection mechanism.

3.1 Layer 1 — Playwright Browser Rules (Rules 1–11)

Layer 1 rules are executed through browser automation against the rendered UI. They cover the failure modes that are only detectable by loading pages and observing their state. The following rules constitute Layer 1:
RuleDetection Target
Rule 1Stub or placeholder data visible in the UI (hardcoded strings such as “Lorem ipsum,” “Test User,” or “Sample Name” persisting into rendered views)
Rule 2API 4xx or 5xx errors in network traffic during page load (captured via request interception)
Rule 3UI defects: broken layouts, missing icons, overlapping elements
Rule 4Required form fields missing client-side validation
Rule 5Navigation gaps: menu items that resolve to 404 responses
Rule 6Console errors emitted during page interaction
Rule 7Empty states without meaningful content or call-to-action guidance
Rule 8Accessibility violations: missing alt text, unlabeled form inputs
Rule 9Loading states that do not resolve within a defined timeout threshold
Rule 10Broken pagination or list rendering (truncated results, missing items, incorrect counts)
Rule 11Missing breadcrumbs or incorrect breadcrumb hierarchy
Rules 1–11 share a common detection mechanism: Playwright navigates to each route registered in the application’s route manifest, performs a standardized interaction sequence (page load, form focus, list scroll), and records observations at each step.

3.2 Layer 2 — CMS Audit (Rule 12)

Layer 2 contains a single rule that audits the consistency between the content management system and the application’s route and component registries. CMS-related failures are invisible to browser automation because they manifest as content gaps rather than rendering errors — a page that renders correctly with no content is indistinguishable from a broken page at the DOM level without semantic knowledge of what content is expected. Rule 12 checks three conditions:
  1. A route registered in the navigation manifest has no corresponding published CMS page at the expected slug.
  2. A CMS component is present in the component library but is not referenced in any published page.
  3. A page is published in the CMS but its slug is not registered in the application’s route manifest.
Each of these conditions represents a different failure mode in the CMS integration: a navigation entry pointing to an empty endpoint, an orphaned component that cannot be reached, and published content that is unreachable through the application’s navigation.

3.3 Layer 3 — Deep Integrity Checks (Rules 13–15)

Layer 3 rules operate below the UI level, against source code and API registries. They detect structural misalignments that produce correct-looking UIs with incorrect underlying behavior — failure modes that are invisible to both browser automation and CMS audits. Rule 13 — Entity Triple-Check: Validates consistency across three representations of the same domain entity: the backend struct definition, the frontend form field configuration, and the metadata schema. A mismatch across these three representations — for example, a field that exists in the backend struct but not in the frontend form — indicates that a cross-repository change was applied incompletely. Rule 14 — Events Inventory: Compares the events registered in the application’s source code against the events registered in the management API. Drift between these two registries indicates that schema changes have been deployed to the codebase without being reflected in the event registry, or vice versa. This condition produces silent failures when the application attempts to emit or consume events that the registry does not recognize. Rule 15 — Foreign Key Relationship Pickers: Performs generic repository discovery and form field classification to identify foreign key fields — fields that reference entities in other tables or services — that lack an associated picker or selection component. A foreign key field without a picker is a data entry dead end: a user filling out the form must know the exact identifier of the related entity rather than being able to select it from a list.
Rules 13–15 require read access to source code repositories and management APIs. In environments where the detection agent runs under a service account, ensure that account has the minimum necessary access: read-only access to the relevant repositories and read-only access to the management API’s entity and event registry endpoints. Do not grant write access to the detection agent — it should observe and record, not modify.

4. The Entity Triple-Check: Catching Cross-Repo Drift That Unit Tests Miss

Rule 13 addresses a failure mode that is structurally undetectable by standard testing approaches in polyrepo systems. In a system where the backend and frontend live in separate repositories, each repository’s test suite validates that repository’s code in isolation. A backend test suite confirms that the backend struct fields are correctly serialized. A frontend test suite confirms that the frontend form fields correctly bind to state. Neither test suite has visibility into the other repository’s field definitions. This isolation produces a specific failure mode: a cross-repository change applied incompletely. When a backend struct field is renamed — a common operation during domain model refinement — the rename must propagate to the frontend form field and to the metadata schema. If it propagates to the frontend but not the metadata schema, or to the metadata schema but not the frontend, both the backend and frontend test suites pass. The system appears to be working. The incomplete propagation manifests only at runtime when a user attempts to create or edit an entity and the field mapping fails. The entity triple-check performs a structural comparison across all three representations simultaneously:
# Simplified illustration of the triple-check detection pattern.
# In production, each source (struct, form config, schema) is parsed
# from its respective repository or registry endpoint.

def check_entity_triple(entity_name: str) -> list[Finding]:
    struct_fields = parse_backend_struct(entity_name)
    form_fields = parse_frontend_form_config(entity_name)
    schema_fields = parse_metadata_schema(entity_name)

    findings = []

    for field in struct_fields:
        if field not in form_fields:
            findings.append(Finding(
                rule=13,
                severity="P2",
                description=f"Field '{field}' present in backend struct but absent from frontend form config for entity '{entity_name}'",
            ))
        if field not in schema_fields:
            findings.append(Finding(
                rule=13,
                severity="P2",
                description=f"Field '{field}' present in backend struct but absent from metadata schema for entity '{entity_name}'",
            ))

    return findings
The triple-check does not evaluate whether the field definitions are correct in any individual repository — it evaluates whether they are consistent across all three. A field that is incorrectly defined in all three repositories passes the triple-check. A field that is correctly defined in two repositories but absent from the third fails it. The check targets synchronization, not correctness. This distinction makes the triple-check an appropriate complement to, not a replacement for, per-repository unit tests. Unit tests verify correctness within a boundary; the triple-check verifies consistency across boundaries.

5. Zero-Cost Triage: Claude Code as the LLM at No API Expense

The triage phase of Autoexplore is the phase that transforms raw findings into filed bug reports. It requires LLM reasoning: interpreting the natural-language descriptions in the findings file, comparing them semantically against open bug titles and descriptions, classifying severity, and generating bug report content. This is not a task that can be performed by a deterministic rule engine — the comparison between a finding description and an existing bug title requires fuzzy semantic matching, not string equality. The zero-cost architecture exploits a property of the Claude Code execution model: when Autoexplore runs as a sub-agent within an orchestration loop, the parent Claude Code session is already active when the detection agent completes. The triage logic runs as a continuation of that session. The LLM that performs the semantic matching and filing decisions is the same inference context that launched the detection agent — no new session is initiated, and no additional API call is made. The triage sequence is as follows:
1

Load findings file

Read /tmp/autoexplore-findings.json. Parse all findings into a structured list. Count total findings by rule and layer.
2

Fetch open bugs

Query the project tracker’s API for all open bugs. Extract titles, descriptions, and severity classifications into a comparison corpus.
3

Fuzzy-match each finding

For each finding, perform semantic comparison against all open bug titles and descriptions. A finding is considered a duplicate if a substantially similar bug — matching on the affected component, the failure mode, and the general symptom — is already open. The matching threshold is intentionally conservative: a finding that might be related to an open bug is treated as potentially duplicative and flagged for human review rather than auto-filed.
4

File net-new findings

For each finding with no match in the open bug corpus, generate a structured bug report and file it via the project tracker API. Severity classification is based on the rule’s default severity (which encodes the detection layer’s risk profile) adjusted by contextual signals in the finding description.
5

Post run summary

Write the run summary to the activity feed with total counts across all categories.
The run summary format provides a consistent signal for evaluating the detection run at a glance:
## Autoexplore Run — {date}
Environment: dev | Routes scanned: N | API endpoints probed: N | CMS slugs audited: N
Findings: {total} | Net-new filed: {filed} | Deduplicated: {deduplicated}
### P1 filed: {p1} | P2 filed: {p2} | Deduplicated: {deduplicated}
The distinction between “total findings” and “net-new filed” is the primary quality signal of the run. A run with 12 total findings and 10 deduplicated is healthy — most findings are already tracked. A run with 12 total findings and 12 filed is a signal worth investigating: either the bug tracker is being poorly maintained and old bugs are not being closed, or the detection rules are surfacing new failure modes faster than the engineering team is resolving them.

6. Deduplication: Why Filing Without It Destroys Bug Tracker Signal

The consequence of omitting deduplication from the triage phase is immediate and severe. Without fuzzy-matching against open bugs, the detection agent will re-file every persisting finding on every run. A bug that takes two weeks to resolve — a common timeline for non-critical UI defects — will generate fourteen duplicate bug reports before it is closed. A bug tracker with fourteen copies of the same finding does not provide a clearer signal than a bug tracker with one — it provides a significantly noisier signal, because the team must now mentally filter duplicate entries to understand the actual shape of the open issue set. The consequence is not merely aesthetic. Bug tracker signal-to-noise ratio directly affects prioritization quality. When the tracker contains numerous duplicates of the same findings, high-priority novel bugs filed in the same period are proportionally harder to identify. The deduplication step is not a quality enhancement — it is a prerequisite for the bug tracker to remain useful as a prioritization tool. The fuzzy-matching approach is preferable to exact-match deduplication for a specific reason: the same underlying failure mode will produce findings with different natural-language descriptions depending on the route, the entity, and the specific state of the UI when the detection agent encountered it. An exact-match approach would treat two descriptions of the same failure as distinct findings. A fuzzy-match approach that operates on semantic similarity rather than string equality correctly identifies them as duplicates.
Tune the deduplication threshold conservatively at first — prefer false negatives (treating a genuinely novel finding as a duplicate and suppressing it) over false positives (filing a duplicate). A suppressed finding can be recovered by a human reviewer who reads the run summary. A filed duplicate degrades the bug tracker immediately and must be manually cleaned up. Once the deduplication logic has been validated against several weeks of run data, the threshold can be adjusted based on observed false negative and false positive rates.

7. Background Execution: Parallel QA Without Wall-Clock Overhead

Autoexplore is designed to run as a sub-agent within a nightly or scheduled orchestration loop. The invocation pattern uses run_in_background: true, which spawns the detection agent as a concurrent process rather than a blocking operation within the parent loop. The consequence of this design is that the detection phase — all 15 rules, all routes, all API probes — runs in parallel with other phases of the orchestration loop. A nightly research loop that also runs competitor analysis, requirement generation, and wiki updates does not extend its total runtime by 15 minutes when Autoexplore is added to it. Autoexplore runs alongside those phases, and the triage step executes after all phases complete. The invocation pattern within the orchestration loop:
# Simplified illustration of the sub-agent invocation pattern.
# The detection agent is spawned as a background process.
# The parent loop continues executing other phases concurrently.

spawn_subagent(
    skill="/autoexplore",
    run_in_background=True,
    output_path="/tmp/autoexplore-findings.json",
    environment="dev",
    time_budget_minutes=15,
)

# Other orchestration phases run concurrently:
run_competitor_analysis()
run_requirement_generation()
run_wiki_updates()

# Triage runs after all phases complete — findings file is ready.
run_autoexplore_triage(
    findings_path="/tmp/autoexplore-findings.json",
    time_budget_minutes=5,
)
The background execution model also means that Autoexplore can be added to an existing orchestration loop without restructuring it. The detection agent is an additive parallel phase; the triage step is a terminal step that runs after all parallel phases have completed. Neither requires changes to the existing loop structure beyond adding the spawn call at the beginning and the triage call at the end. For standalone invocation — when targeted QA coverage is required outside the scheduled loop — the /autoexplore skill runs the full detection-and-triage sequence synchronously as a foreground process, blocking until both phases complete.

8. Implementation Constraints

Playwright requires a running development environment. The detection agent cannot run against a non-responsive environment. If the dev environment is unavailable — due to a failed deployment, an in-progress migration, or infrastructure maintenance — the detection agent will record connection failures for every route and produce a findings file dominated by Rule 2 errors (API 5xx responses). The triage step should detect this condition from the findings distribution and suppress filing, reporting the environment state as a run status rather than generating bug reports. Rule 13 (entity triple-check) requires up-to-date repository access. If the detection agent’s repository clone is stale relative to the current HEAD of either the backend or frontend repository, the triple-check will operate against outdated field definitions and may produce false positives or miss genuine drift. The detection agent should pull the latest from all relevant repositories before executing Layer 3 rules. Deduplication quality degrades as the open bug count grows. Fuzzy-matching against a corpus of 50 open bugs is computationally trivial and semantically reliable. Fuzzy-matching against a corpus of 500 open bugs increases both the compute cost and the false-negative rate, as the probability that any given finding has a semantically similar existing bug increases. Teams that allow the open bug count to grow unchecked will observe that deduplication becomes increasingly aggressive — suppressing findings that are genuinely novel because the expanded corpus contains superficially similar entries. Regular bug triage and closure discipline is a prerequisite for effective Autoexplore deduplication. The 15-minute detection budget is a constraint, not a target. On small applications with few routes, the detection agent may complete all 15 rules in significantly less than 15 minutes. On large applications with many routes, 15 minutes may be insufficient. The budget should be calibrated to the application’s route count: a rough heuristic is two minutes per ten routes for Layer 1 rules, plus five minutes for Layer 2 and Layer 3 rules regardless of route count. Background sub-agent spawning requires the parent session to remain active. If the orchestration loop’s parent Claude Code session terminates before the detection agent completes, the findings file may be incomplete or absent. The triage step should validate that the findings file exists and contains a terminal marker before proceeding.
ConstraintRiskMitigation
Dev environment unavailableDetection agent produces noise, not signalDetect from findings distribution; suppress filing
Stale repository accessTriple-check produces false positivesPull latest before Layer 3 execution
Growing open bug corpusDeduplication becomes overly aggressiveEnforce regular bug triage and closure
Detection budget exceededLayer 3 rules may be skippedCalibrate budget to route count
Parent session terminates earlyIncomplete findings fileValidate file and terminal marker before triage

9. Recommendations

  1. Treat automated exploration as a third pillar of quality assurance, alongside regression testing and manual review. Do not position Autoexplore as a replacement for either existing approach. Regression suites verify known behaviors; Autoexplore surfaces novel failure modes; manual review applies contextual judgment that neither automated approach can replicate. All three are necessary for a complete quality posture.
  2. Deploy deduplication before deploying detection. The detection rules will produce findings immediately. Without deduplication in place, those findings will generate duplicate bug reports from the first run. Implement and validate the fuzzy-matching logic against your project tracker’s API before enabling the detection agent on any schedule.
  3. Implement the entity triple-check (Rule 13) even if your test coverage appears comprehensive. Per-repository test suites pass independently when a cross-repo change is applied incompletely. The triple-check is the only automated mechanism that validates consistency across repository boundaries. Treat any triple-check finding as P1 until proven otherwise — a field that exists in the backend but not the frontend is a data loss risk, not a cosmetic defect.
  4. Calibrate the detection budget to your environment’s route count before adding Autoexplore to a scheduled loop. Run the detection agent once as a foreground process and observe the actual elapsed time. Set the budget to 20% above the observed time to accommodate variance. A budget that is too tight will produce incomplete runs that miss Layer 3 rules; a budget that is too generous will idle the detection agent at the cost of wall-clock time in the orchestration loop.
  5. Monitor the deduplication ratio as a health signal. A deduplication ratio above 80% over several consecutive runs indicates that the engineering team is not resolving bugs at a rate that keeps pace with detection. A deduplication ratio of 0% over several consecutive runs indicates either that the detection rules are not surfacing known failure modes (a calibration problem) or that the project tracker is not being used to track bugs from previous runs (a process problem). Both conditions warrant investigation.
  6. Validate the findings file structure before triage on every run. Implement a schema validation step between detection completion and triage initiation. A findings file that is malformed, empty, or missing the terminal marker should trigger an alert rather than silently producing a zero-finding triage result. Distinguishing between “the detection agent ran and found nothing” and “the detection agent failed to complete” requires explicit validation.
  7. Run /autoexplore as a foreground skill after any significant sprint deployment. The nightly background execution provides systematic coverage over time. A targeted foreground run immediately after deploying a sprint’s worth of changes provides coverage precisely when novel failure modes are most likely to be present — before the nightly schedule would have caught them.

Conclusion and Forward Outlook

The Autoexplore pattern demonstrates a broader principle for AI-assisted quality automation: the most significant cost in an automated QA loop is often not the LLM reasoning but the local execution — browser rendering, network interception, repository parsing. When the LLM reasoning can be reused from an already-active session rather than initiated as a new API call, the cost structure of the entire loop changes fundamentally. Detection runs that would otherwise require a per-run inference budget run at zero marginal cost in the triage phase. The three-layer detection architecture — browser rules, CMS audit, and deep integrity checks — reflects a categorization of failure modes by detection mechanism rather than by severity. This organization is extensible: new detection rules can be added to any layer without restructuring the others, and new layers can be introduced for categories of failure mode that the current three layers do not address. Teams building on event-driven architectures, for instance, might introduce a fourth layer that validates event schema consistency against a central schema registry. As AI-assisted development organizations scale and the surface area of deployed code grows, the gap between what regression suites cover and what a real user can encounter will widen. The pressure on manual exploratory testing will increase as the application grows and the team’s capacity to cover it manually does not scale proportionally. The pattern documented here will transition from an optimization to a baseline requirement — systematic automated exploration as a non-negotiable component of the quality assurance function in any continuously deployed system.
All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.