Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
This analysis examines the total cost structure of a hybrid local and cloud inference architecture deployed within an autonomous software development pipeline. The central finding is that “free per-token” local inference is not free: it carries measurable costs in electricity, hardware depreciation, inference latency, and a model quality ceiling that creates downstream rework costs when routing decisions are incorrect. The analysis documents two distinct cost failure modes — over-reliance on cloud inference exhausting API credits mid-cycle, and under-routing to local inference on tasks where quality variance produces incorrect outputs that cost more to remediate than the API call that would have prevented them. A task-type routing architecture that assigns work to inference tiers based on the marginal value of model quality produces lower total cost than either extreme. Planned migration to a unified routing interface will reduce the operational overhead of maintaining per-model credential management.Key Findings
- “Free per-token” local inference carries real costs: electricity consumption, hardware depreciation, per-call inference latency, and a model quality ceiling that imposes rework costs on incorrectly routed tasks.
- An autonomous pipeline running 24/7 on all-cloud inference will exhaust a monthly API budget within days during high-volume cycles — experienced directly when a sustained implementation backlog depleted credits mid-cycle.
- Approximately 80 percent of pipeline tasks can run on local inference without meaningful quality loss; the 20 percent that benefit materially from stronger models are disproportionately consequential — PR review, architectural decisions, and complex debugging.
- Quality variance compounds asymmetrically: API billing is linear and predictable; a wrong architectural decision from a weaker model propagates forward and produces remediation costs that exceed the cost of the stronger model by an order of magnitude.
- Cost-aware routing by task type, not by default model choice, is the architectural answer — neither all-local nor all-cloud routing produces optimal outcomes across the task distribution of a software development pipeline.
1. Introduction
1.1 Infrastructure Configuration
The pipeline analyzed in this report runs two inference tiers: Local inference (Mac Mini, local inference node):- Model: qwen3:32b, served via Ollama on the local network
- Hardware: Mac Mini with unified memory sufficient for the 32B model
- Consumers: Open WebUI chat interface, orchestration agent, voice and workflow service, automation tasks
- Marginal cost per token: $0
- Real costs: electricity, hardware depreciation, inference latency, model quality ceiling
- Primary model: claude-sonnet-4-6
- Consumers: task execution service, PR review service, Claude Code sessions
- Marginal cost per token: published per-token billing rates
- Real costs: credit exhaustion during high-volume autonomous cycles, rate limits
1.2 Scope
This analysis covers task routing decisions, observed quality outcomes, the pipeline pause incident caused by credit exhaustion, and the routing architecture developed in response. It does not provide exact cost figures, as these are hardware-dependent and electricity-rate-dependent; it provides the cost structure and the principles for evaluating routing decisions.2. Task Routing Analysis
2.1 Tasks Appropriate for Local Inference
The following task categories have been validated for local inference routing based on observed output quality meeting the threshold required for the task: Conversational research and ideation. Open WebUI chat sessions involving summarization, question answering from context, and ideation. At conversational volume and for these task types, qwen3:32b performs comparably to cloud models for most queries. Routing this traffic locally produces substantial cost savings with no measurable quality impact. Code scaffolding and boilerplate generation. Generating structs with specified fields, writing test harnesses for defined interfaces, implementing standard patterns. These tasks have well-defined correct outputs. The output requires human review regardless of the generating model; the review catches any quality variance before it propagates. Log analysis and structured output parsing. Tasks with structured inputs and structured outputs — log extraction, JSON summarization, pattern identification — where the work is parsing rather than reasoning. Model size has the least impact on quality in this category. High-frequency workflow automation calls. Classification, tagging, and routing decisions in automated workflows. High task volume and low per-call value makes local inference the appropriate tier. Downstream behavior provides implicit validation of output quality. Voice transcription. whisper-cpp runs locally for speech-to-text. Not Ollama-based, but governed by the same principle: a task with measurable correctness criteria, running at high frequency, where local inference is faster and carries no per-call cost.2.2 Tasks Requiring Cloud Inference
The following task categories are routed to cloud inference because the marginal value of stronger model quality materially affects outcomes: PR review in the autonomous pipeline. The pr-verifier service sends a diff to Claude and requests an assessment of implementation correctness, regression risk, and test coverage adequacy. The correctness requirement is strict. A subtle issue that a 32B model misses — a race condition, a security boundary violation, an unhandled edge case in error paths — costs more to identify and remediate after merge than the API call that would have caught it at review time. This task is routed to Claude without optimization. Architectural decisions. When the task executor encounters a decision affecting system structure — module placement for a new component, consistency with established patterns, cross-repository dependency handling — it routes to cloud inference. The reasoning chain for these decisions is longer and less templated than boilerplate generation; quality variance at this task type has forward-compounding effects on system coherence. Complex debugging. Multi-step causal reasoning tasks — identifying what conditions could produce an observed symptom, determining which conditions are consistent with the current system state, specifying the minimal change that addresses root cause without side effects. Smaller models produce plausible-sounding analyses that do not hold under inspection, resulting in second debugging sessions. Interactive Claude Code sessions. Design conversations, architecture reviews, and behavior analysis of autonomous loop output run on Claude. These sessions represent the highest-leverage human-in-the-loop interactions in the pipeline; using the strongest available model is appropriate here.3. The Credit Exhaustion Incident
3.1 Incident Description
The task executor operates autonomously. During a sustained high-volume cycle — processing a backlog of more than twenty implementation tasks involving multiple cross-repository changes, PR review loops, and revision cycles — API credit consumption was proportional to task complexity rather than calendar time. A single such cycle exhausted the month’s credit allocation within one week. The result: the pipeline paused mid-cycle. Tasks in flight did not complete. Manual triage was required to determine what had been completed, what was partially complete, and what required restart.3.2 Root Cause
The pipeline was not routing aggressively enough to local inference. The local model was available; the routing logic was not directing sufficient task volume to it. The default behavior tilted toward cloud inference even for task categories where local quality was adequate.3.3 Architectural Response
The credit exhaustion event forced the routing architecture that should have been in place from the start. The model selector was revised to implement the following routing logic:- Route to local inference for tasks below a defined complexity threshold
- Route to a cost-effective cloud model for standard implementation tasks above the local threshold
- Route to Claude for PR review, architectural decisions, and complex debugging
- Claude Code sessions use Claude directly — this is not a configurable routing decision
The routing architecture is not a choice between local and cloud inference — it is a tiered system in which both operate simultaneously. The design question is not “which tier?” but “which tier for which task type?” The answer to the latter question determines both cost and quality outcomes.
4. Planned Routing Architecture
4.1 Current State Limitations
The current model selector script is functional but manually maintained. Each new model addition requires code changes. Cost tracking occurs per model outside the routing system. Adding a new inference tier for a new task class requires a routing script update.4.2 OpenRouter Migration
OpenRouter addresses these limitations by providing a single API key and model-agnostic routing interface. Model selection is expressed as a model ID parameter; OpenRouter handles credential management and billing. The routing logic simplifies to: select the appropriate model ID for the task type. Everything else is a uniform API call. The planned routing table under this architecture:| Task Type | Model | Routing Rationale |
|---|---|---|
| Boilerplate generation | deepseek-v2 | Fast, low cost, adequate quality for templated code |
| Standard implementation | claude-haiku-4-5 | Capable, cost-effective for non-trivial tasks |
| PR review | claude-sonnet-4-6 | Strong reasoning required; quality here changes outcomes |
| Architectural decisions | claude-opus-4-6 | Highest-stakes reasoning; quality compounds forward |
| Classification and tagging | local (qwen3:32b) | High volume, low per-call value, easily validated |
5. Full Cost Accounting
5.1 Local Inference Cost Components
Hardware. Amortized depreciation over a three-year useful life produces a monthly cost figure. The hardware is already owned; the question is whether the inference workload justifies attributing the full depreciation cost to local inference, or whether the hardware serves other purposes that share the depreciation. Electricity. A Mac Mini under inference load draws approximately 30–50W above idle. At reasonable electricity rates and typical daily inference hours, this produces a small but non-zero monthly cost. The figure is not large; it is not zero. Inference latency. qwen3:32b on Mac Mini hardware generates tokens more slowly than the Claude API. For interactive chat, the latency is perceptible. For batch pipeline tasks, the effect is extended wall clock time per task. At current pipeline volume, this is manageable. At higher volume, latency compounds across sequential task chains and becomes a scheduling constraint. Quality ceiling. The model quality ceiling is the most consequential cost component and the hardest to quantify. For the tasks routed locally, the ceiling is acceptable and has been validated empirically. The ceiling exists: tasks initially routed locally that produced subtly incorrect outputs required re-running on cloud inference. The re-run cost includes both the cloud API call and the latency of the additional cycle.5.2 Cloud Inference Cost Components
Per-token billing. Published rates, linear with volume under normal conditions. Tail cost concentration. The average cost per task is not the relevant metric for budget management. The tail — a debugging session that runs long, a PR review that requires multiple passes, an architectural discussion that explores multiple options before converging — is where budget variance is concentrated. The tail is predictable in aggregate but not in timing.5.3 The Compound Cost of Quality Variance
The asymmetry between the two cost structures is important for routing decisions:| Cost Type | Characteristic | Budget Implication |
|---|---|---|
| API billing | Linear, predictable, real-time visible | Easy to project; exhaustion is detectable before it occurs |
| Wrong answer from weak model | Non-linear, delayed, often invisible until downstream | Hard to project; remediation cost arrives after the routing decision |
6. Implementation Constraints
6.1 Latency in Sequential Task Chains
An autonomous task that makes fifteen LLM calls in sequence — plan, implement step, verify step, implement next step, verify — is sensitive to per-call latency. At qwen3:32b inference rates, a complex task that completes in two minutes on cloud inference can take fifteen minutes on local inference. For interactive work, this represents a qualitatively different feedback loop. At pipeline scale, it affects total throughput.6.2 Context Management Is Invariant to Inference Cost
“Free per-token” does not imply unlimited context. A 32B model with a 32K context window fills that window on complex tasks. Context management — what to include, what to summarize, what to discard — remains a required engineering concern regardless of whether the model is billed per token or hosted locally.6.3 Routing Table Accuracy Is an Approximation
The routing table is an empirical approximation, not a theoretical guarantee. Observed routing mismatches in both directions:- Tasks assumed to require cloud inference running acceptably on local inference — indicating that the routing table is conservative in some task categories
- Tasks assumed to be within local inference capability producing subtly incorrect outputs not caught until later in the pipeline — indicating that the quality ceiling is higher than expected on some task types
Routing table calibration is an ongoing process, not a one-time configuration decision. As the task distribution changes and as new models become available, the routing table requires empirical validation rather than assumption.
7. Recommendations
- Implement explicit task-type routing rather than relying on default model selection. The costs of unrouted inference — both API budget exhaustion and quality-variance-induced rework — exceed the engineering cost of maintaining a routing table.
- Route PR review, architectural decisions, and complex debugging to the strongest available model. The quality ceiling on these task types has measurable downstream cost implications. Optimizing cost on these specific tasks is false economy.
- Validate local inference routing decisions empirically before committing to a routing category. Run candidate local-inference task types on both tiers and compare outputs. Do not assume quality adequacy — verify it.
- Monitor API credit consumption at the task level, not only at the account level. Pipeline-level credit monitoring identifies exhaustion only after it occurs. Task-level cost attribution enables proactive routing adjustments when high-volume cycles are anticipated.
- Plan the migration to a unified routing interface (OpenRouter or equivalent). Per-model credential management creates operational overhead that scales with the number of inference tiers. A unified interface reduces this to model ID selection per task type.
- Account for tail cost concentration when projecting cloud inference budgets. Average per-task cost underestimates total cost during high-complexity cycles. Budget reserves for tail events — long debugging sessions, multi-pass review cycles, architectural explorations — are a required planning input.
8. Conclusion
The analysis confirms that “free per-token” local inference is a useful cost reduction lever within a hybrid routing architecture, not a replacement for cloud inference. The costs it eliminates — per-call billing on high-volume, low-stakes tasks — are real and meaningful at pipeline scale. The costs it introduces — electricity, hardware depreciation, latency, and quality ceiling — are smaller but non-zero, and the quality ceiling in particular has asymmetric cost implications that are easy to underestimate. As autonomous AI development pipelines become more prevalent and as local inference hardware continues to improve in capability and availability, the routing architectures documented here will become baseline engineering practice rather than advanced optimization. The principles — route by task type based on the marginal value of model quality, validate routing decisions empirically, and monitor both API cost and quality-variance cost — are applicable across a wide range of inference configurations beyond the specific hardware and models described in this analysis.All content represents personal learning from personal and side projects. Cost figures are approximate and based on publicly available pricing. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.