The Code AI Can't Write (Yet)

Executive Summary

This paper presents a structured analysis of seven categories in which AI-assisted development fails consistently and materially, based on observation of a multi-month production SaaS platform engineering engagement. The failure categories are not random or recoverable through prompt refinement; they map directly to structural gaps between AI training data composition and the knowledge required for production-quality software development. Understanding these boundaries is a prerequisite for designing effective human-AI collaboration workflows. Organizations that treat AI capability as uniform across task types will consistently encounter the highest-cost failure modes — security vulnerabilities, cascading errors, and operational incidents — in production environments where remediation is most expensive.

Key Findings

Seven failure categories are consistent and structurally grounded, not artifacts of prompt quality or model version: novel pattern design, framework execution semantics, performance optimization, security hardening, operational resilience, domain-specific business logic, and cascading error resolution.
Novel problems expose the fundamental boundary of pattern matching: AI defaults to runtime validation where compile-time enforcement is required, because type-level safety patterns are underrepresented in training data.
Framework execution semantics produce silent failures: middleware ordering bugs produced by misapplication of Actix-web’s reverse-execution model manifest as runtime 500 errors, not compilation failures, and pass AI review without flagging.
Production operational knowledge — caching requirements, service call latency profiles, item size limits, throttling behavior — is absent from AI training data because it originates in incident post-mortems and operational runbooks, not public source repositories.
Cascading error scenarios expose the limits of local context optimization: AI’s error-by-error approach in a representative incident produced 30 commits over 24 hours without convergence; human batch remediation resolved the same problem in 90 minutes.
A structured production readiness checklist applied before deployment catches an estimated 90 percent of AI implementation gaps in the operational, performance, and framework categories.

1. Introduction

The value proposition of AI-assisted development is well established for a class of tasks: systematic implementation of well-specified patterns, boilerplate generation, test scaffolding from established templates. The limits of that value proposition are less precisely characterized. This paper addresses the characterization gap. The seven failure categories documented here were identified through direct observation over a multi-month platform engineering engagement in which AI-generated code was systematically reviewed, bugs were categorized by origin, and remediation approaches were compared. The goal is a precise enough description of each boundary that engineering organizations can design intervention protocols in advance, rather than discovering failure modes in production.

All code examples in this paper are sanitized representations of patterns observed in production development. They preserve structural failure characteristics without exposing proprietary implementation details.

2. The Seven Failure Categories

2.1 Novel Problems

Failure description: AI defaults to patterns represented in training data when requirements demand novel type-level or architectural constructs that are not. Representative case: Design of a scope-based AWS client factory enforcing multi-tenant isolation boundaries through the Rust type system. Requirements:

Four operational scopes (platform, tenant, capsule, operator)
Automatic table name prefixing per scope
Compile-time enforcement of scope correctness

AI’s initial proposal used runtime validation:

// AI's approach: Runtime validation
fn get_client(&self, scope: Scope) -> Client {
    if scope.is_valid() {  // Runtime check
        Client::new(scope)
    } else {
        panic!("Invalid scope")
    }
}

Implementation constraint: Runtime checks can be bypassed through incorrect caller code that compiles successfully. The requirement was for the compiler to reject invalid scope usage, not for the runtime to panic on it. The human-designed solution used separate client types per scope:

// Compile-time enforcement
impl CapsuleClient {
    fn table_name(&self, base: &str) -> String {
        format!("{}_{}", self.capsule.code, base)
    }
}

impl PlatformClient {
    fn table_name(&self, base: &str) -> String {
        base.to_string()  // No prefix
    }
}

The type system now prevents passing the wrong client type to any function. Errors are caught at compilation, not at runtime. Root cause analysis: Type-level safety patterns for multi-tenant enforcement in domain-specific contexts are not represented in public training data at sufficient density for AI to propose them as a default approach. Intervention indicator: Requirements containing the phrase “enforce at compile time” or “prevent architecturally” indicate a novel pattern design task. Human design is required; AI implements from the completed specification.

Do not use AI to design novel compile-time enforcement patterns without a human-authored architectural specification. AI will produce functionally equivalent runtime code that satisfies the stated requirement but not the stated constraint.

2.2 Framework Execution Semantics

Failure description: AI applies logical ordering analysis to middleware, lifecycle hooks, and configuration sequencing without access to the runtime execution model, producing code that reads correctly but executes incorrectly. Representative case:

// ❌ AI's initial code (wrong order)
App::new()
    .wrap(ConfigMiddleware::new(config_service))  // Runs first, needs capsule
    .wrap(CapsuleExtractor::new())                // Runs second, provides capsule

// ✅ Human fix
App::new()
    .wrap(CapsuleExtractor::new())                // Runs first
    .wrap(ConfigMiddleware::new(config_service))  // Runs second, has capsule

In Actix-web, .wrap() registrations execute in reverse order. ConfigMiddleware requires the CapsuleContext that CapsuleExtractor provides. With AI’s registration order, ConfigMiddleware executes first and finds no capsule context, returning a 500 error. The bug manifests at runtime and passes compilation and static analysis. Root cause analysis: Documentation for middleware execution order is inconsistently explicit across frameworks. Training data contains registration syntax at high density but runtime execution semantics at low density, because execution order is experienced operationally, not visible in static code. Intervention indicator: Any feature involving middleware registration, lifecycle hooks, dependency injection scope, or transaction boundary configuration requires human review of execution behavior, not just registration syntax.

2.3 Performance Optimization

Failure description: AI generates correct implementations without awareness of the latency profiles, caching requirements, or resource consumption characteristics of the external services it calls. Representative case: AWS cross-account credential management. AI implementation:

// AI's code: Assume role on every request
pub async fn operator_client(&self) -> Client {
    let sts = self.sts_client();
    let creds = sts.assume_role()  // 200-500ms latency
        .role_arn(&self.role_arn)
        .send()
        .await?;

    Client::new_with_credentials(creds)
}

Every cross-account operation made a fresh AssumeRole call. AWS STS AssumeRole adds 200–500ms of network latency, and the resulting credentials are valid for up to 60 minutes. The AI implementation was functionally correct and passed all tests; the performance cost was invisible until profiled against production traffic patterns. Human-added credential caching:

// Human optimization: Cache credentials
pub async fn operator_client(&self) -> Client {
    if let Some(creds) = self.cache.get(&self.role_arn) {
        return Ok(Client::new_with_credentials(creds));
    }

    let creds = self.sts_client()
        .assume_role()
        .send()
        .await?;

    self.cache.insert(self.role_arn.clone(), creds.clone());
    Ok(Client::new_with_credentials(creds))
}

Result: Latency decreased from 250ms to 2ms on cache hits, with a 98% cache hit rate observed under production traffic. Root cause analysis: AWS STS latency characteristics, DynamoDB item size limits, and EventBridge throttling behavior originate in operational experience and AWS service documentation addenda — not in public source code. These characteristics are absent from AI training data. Intervention indicator: Any code path that calls an external service in a potential loop or per-request context requires human review for caching requirements before deployment.

2.4 Security Hardening

Failure description: AI generates isolation logic that satisfies stated functional requirements while missing the attack vectors, bypass conditions, and boundary validations that a threat model would surface. Representative gaps identified:

Cross-tenant queries via timestamp-based Global Secondary Index returning data from multiple tenants
Missing tenant validation in batch operation handlers
Race conditions in tenant-scoped lock acquisition
Token substitution vectors: tenant_id substitution in JWT payloads without server-side binding validation

Root cause analysis: Security hardening requires explicit adversarial reasoning — systematically enumerating what an attacker could do with a given construct, not what a legitimate user is expected to do. This reasoning mode is not embedded in standard implementation pattern generation. Intervention indicator: All security-critical code — authentication, authorization, multi-tenant data isolation, audit logging — requires a human-authored threat model before AI implementation begins. The threat model defines the attack surface; AI implements the controls.

2.5 Operational Resilience

Failure description: AI generates architecturally correct implementations that lack the caching strategies, circuit breakers, observability instrumentation, and graceful degradation paths required for reliable production operation. Representative failures by category:

Gap Category	Failure Mode	Production Impact
Caching strategy	DynamoDB 400KB item size limit not handled	Event sourcing failures under production-scale event aggregation
Circuit breakers	No DynamoDB throttling protection	Bulk operations exhausted write capacity without backoff
Observability	No structured logging with correlation IDs	Request tracing across services impossible
Graceful degradation	No fallback when configuration service unavailable	All requests failed under a single dependency outage

Root cause analysis: Production operational concerns originate in incident post-mortems, operational runbooks, and deployment experience. These documents are confidential and are not represented in public training data. AI designs to pass tests in development environments where these failure modes do not manifest. Intervention indicator: Before deploying any AI implementation, apply a production readiness checklist covering: caching for all external service calls, circuit breakers for downstream dependencies, structured logging with correlation ID propagation, and explicit fallback behavior for each dependency failure mode.

The following production readiness checklist applied before deployment catches an estimated 90 percent of AI implementation gaps in the operational, performance, and framework categories:

Performance: What needs caching?
Security: What is the threat model?
Monitoring: Are metrics, logs, and traces instrumented?
Error handling: Are circuit breakers and retries present?
Limits: Are rate limits and batch sizes configured?
Framework: Is execution order correct?

2.6 Business Logic

Failure description: AI implements business rules as stated in requirements documents but cannot infer the implicit domain knowledge that domain experts hold and have not documented. Representative case: Billing calculation implementation. AI implements from stated requirements:

Charge per API call
Monthly aggregation
Pro-rated refunds

Implicit domain rules not present in requirements:

Failed requests (HTTP 500) must not generate charges
Maintenance window usage must not generate charges
Health check endpoint calls must not generate charges
Monthly charges are capped at the contracted limit
“Monthly” boundaries must account for timezone offsets

Each of these rules represents domain knowledge carried by billing specialists, not software engineers. None was stated in the written requirements. All would produce billing disputes or compliance exposure if deployed uncorrected. Root cause analysis: Implicit business rules reside in domain expert knowledge, contract terms, and regulatory requirements — not in requirements documents or source code. AI cannot infer rules that have not been made explicit. Intervention indicator: Any code path with regulatory, financial, or compliance implications requires validation against domain expert knowledge before deployment. The question to ask is: “What happens in this system that the requirements document does not address?“

2.7 Cascading Error Resolution

Failure description: When a single change produces errors across many dependent files, AI applies local error correction without understanding the global pattern required to resolve all errors consistently. This produces a diminishing-returns cycle that degrades without converging. Representative incident: A macro change produced 214 compilation errors across 30 files. AI’s approach: fix errors individually. Edge cases AI failed to resolve:

Method name collision: generated save() method conflicted with hand-written save() in 12 existing implementations
Error type mismatch: new macro returned EventStoreError; call sites expected RepositoryError
Factory pattern mismatch: macro expected a client() method; existing code used a client field
Dependency chain: fixing one file broke imports in three dependent files

AI result: 30 commits, 24 hours, 14 errors remaining. The human recovery process:

# Step 1: Understand all breaking changes (30 min)
# - save() → db_save()
# - client field → client() method
# - RepositoryError → EventStoreError

# Step 2: Batch fix (45 min)
rg "\.save\(" -t rust | xargs sd '\.save\(' '.db_save('
rg "\.client\b" -t rust | xargs sd 'self\.client' 'self.client()'
# Add error type conversions

# Step 3: Verify (15 min)
cargo check --workspace  # ✅ Clean
cargo test --workspace   # ✅ 142 tests passing

Human result: 3 commits, 90 minutes, zero errors. Root cause analysis: Cascading changes require understanding of system-wide dependency relationships. AI optimizes locally — correcting the identified error — without modeling how that correction affects dependent files outside the current context window. Intervention indicator: When error counts across consecutive commits show diminishing returns (fewer than three errors resolved per commit over three consecutive commits), or when AI produces “partial fix” commits, halt AI-assisted remediation and apply batch manual remediation.

3. Why These Boundaries Exist

3.1 Training Data Composition

AI learns from publicly available source code repositories. The following categories of production knowledge are structurally absent from this data:

Knowledge Category	Why It Is Absent
Production deployment configurations	Contains secrets; not committed to public repositories
Incident post-mortems	Confidential internal documents
Performance profiling results	Runtime data; not present in source code
Security threat models	Confidential; not open-sourced
Business domain knowledge	Resides in expert knowledge and private documentation

Implication: AI generates architecturally clean patterns that satisfy development environment test conditions. Operational reality is invisible to it.

3.2 Context Window Constraints

Even with 200,000-token context windows, workspace-level analysis remains impractical for large codebases:

Fits Within Context	Does Not Fit
Single-crate implementation	Entire workspace (9 crates, 180+ files)
Related test files	Cross-crate dependency chain
Architecture documentation	Full historical evolution of code changes

The context window boundary is the proximate cause of cross-entity consistency failures and cascading error resolution failures. AI sees local correctness; it cannot evaluate global impact.

4. Effective Workflow Design

4.1 The Human-AI Division of Labor

The evidence from this engagement suggests the following division of labor produces reliable outcomes:

Task Type	Recommended Approach
Novel pattern design	Human designs; AI implements from specification
Framework-specific configuration	AI implements; human verifies execution behavior
Performance-sensitive paths	Human specifies caching and latency requirements; AI implements
Security-critical paths	Human threat models; AI implements controls
Operational instrumentation	AI builds infrastructure; human adds monitoring, caching, circuit breakers
Domain-critical business logic	Human validates with domain expert; AI implements
Cascading errors (>10 affected files)	Manual batch remediation

4.2 Proactive Design Principles

The engagement data supports three operational principles that consistently improve AI implementation quality: Design Before Build. Documenting constraints in an Architecture Decision Record before requesting AI implementation eliminates the majority of failures in the novel pattern, security, and performance categories. The ADR serves as a specification that closes the gap between what AI knows and what the system requires. Atomic Change Scope. Migrating one component completely before proceeding to the next prevents cascading error scenarios. Committing broken intermediate states allows dependency chains to compound errors faster than AI can resolve them. Operational Layer Addition. AI-generated infrastructure requires a post-implementation pass to add caching for external service calls, structured logging with correlation identifiers, circuit breakers for downstream dependencies, and explicit degradation behavior. This pass should be treated as a fixed cost of AI implementation, not an optional enhancement.

4.3 Progress Monitoring

The following thresholds indicate that AI-assisted remediation is degrading rather than converging:

Metric	Healthy	Warning	Critical (Intervene)
Errors resolved per commit	5–10	2–4	Fewer than 2
Consecutive stalled commits	0	1	3
Error count trend	Decreasing	Flat	Increasing

5. Recommendations

Classify all development tasks against the seven failure categories before assigning them to AI. Tasks in the novel pattern, security, and business logic categories require human specification before AI implementation begins. Tasks in the operational, performance, and framework categories require human review after AI implementation is complete.
Establish and enforce a production readiness review stage for all AI-generated code before deployment. This review is distinct from functional code review and specifically addresses the operational, performance, and framework failure categories.
Implement a cascade intervention threshold: if error counts do not decrease by at least three per commit over three consecutive commits, stop AI-assisted remediation and apply manual batch remediation immediately.
Require Architecture Decision Records for any task involving novel patterns, security boundaries, or cross-entity type contracts. The ADR is the primary mechanism for closing the gap between AI training data and system-specific requirements.
Track errors resolved per AI commit as an ongoing progress metric. Diminishing returns in this metric are a leading indicator of the local optimization failure mode; early detection reduces total remediation cost.

6. Conclusion

The seven failure categories documented in this paper represent structural boundaries, not capability limitations that model improvements alone will resolve. Novel architectural patterns, production operational knowledge, adversarial security reasoning, and system-wide dependency context each require knowledge forms that are not present in AI training data by construction. Recognizing this provides a principled basis for workflow design: assign AI to the categories where it performs reliably, and assign human expertise to the categories where it does not. As AI capabilities continue to advance, the specific boundaries documented here will shift. Context window expansion will partially address the cascading error and cross-entity consistency categories. Improved instruction following will reduce some framework execution errors. However, the categories grounded in training data composition — operational knowledge, security adversarial reasoning, and implicit business logic — are unlikely to resolve through model scaling alone. Engineering organizations should expect these boundaries to persist as a structural feature of current AI development paradigms and design their workflows accordingly.

Disclaimer: This content represents personal learning from building with AI on a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns for educational purposes.

Overview

Practical Guides

Insights & Debate

The Code AI Can't Write (Yet)

Executive Summary

Key Findings

1. Introduction

2. The Seven Failure Categories

2.1 Novel Problems

2.2 Framework Execution Semantics

2.3 Performance Optimization

2.4 Security Hardening

2.5 Operational Resilience

2.6 Business Logic

2.7 Cascading Error Resolution

3. Why These Boundaries Exist

3.1 Training Data Composition

3.2 Context Window Constraints

4. Effective Workflow Design

4.1 The Human-AI Division of Labor

4.2 Proactive Design Principles

4.3 Progress Monitoring

5. Recommendations

6. Conclusion

Overview

Practical Guides

Insights & Debate

Documentation Index

​Executive Summary

​Key Findings

​1. Introduction

​2. The Seven Failure Categories

​2.1 Novel Problems

​2.2 Framework Execution Semantics

​2.3 Performance Optimization

​2.4 Security Hardening

​2.5 Operational Resilience

​2.6 Business Logic

​2.7 Cascading Error Resolution

​3. Why These Boundaries Exist

​3.1 Training Data Composition

​3.2 Context Window Constraints

​4. Effective Workflow Design

​4.1 The Human-AI Division of Labor

​4.2 Proactive Design Principles

​4.3 Progress Monitoring

​5. Recommendations

​6. Conclusion

Executive Summary

Key Findings

1. Introduction

2. The Seven Failure Categories

2.1 Novel Problems

2.2 Framework Execution Semantics

2.3 Performance Optimization

2.4 Security Hardening

2.5 Operational Resilience

2.6 Business Logic

2.7 Cascading Error Resolution

3. Why These Boundaries Exist

3.1 Training Data Composition

3.2 Context Window Constraints

4. Effective Workflow Design

4.1 The Human-AI Division of Labor

4.2 Proactive Design Principles

4.3 Progress Monitoring

5. Recommendations

6. Conclusion