Testing Infrastructure

Executive Summary

End-to-end event flow testing in event-sourced distributed systems is a high-value, high-effort engineering task that organizations consistently under-invest in due to its manual implementation cost. This analysis documents an AI-assisted implementation of comprehensive end-to-end testing for a seven-entity event-sourced CRM system, delivering 21 test scenarios in 1.5 engineering days against a manual effort estimate of 2–3 weeks. The implementation covered the complete event delivery pipeline from DynamoDB Streams through EventBridge and SQS to consumer verification, including cross-entity workflow scenarios and negative assertions for tenant isolation. A structured verification pass identified three critical issues — a race condition in event collection, an incomplete negative assertion pattern, and cross-test state pollution — none of which the AI generation phase had surfaced. The central finding is that AI tooling makes comprehensive testing economically viable; the secondary finding is that AI-generated test suites require rigorous structured verification to meet production quality standards.

Key Findings

AI tooling reduced comprehensive E2E test implementation from an estimated 2–3 weeks to 1.5 days. This is not a marginal efficiency gain; it represents a category change in the economic viability of thorough coverage.
Systematic, pattern-consistent testing across multiple entities is the highest-leverage AI testing application. The same verification logic applied to seven entity types is precisely the task profile where AI generation is most reliable.
Three critical defects in the AI-generated test suite were identified only through a dedicated verification phase. AI generation produced functionally correct tests that nonetheless contained a timing vulnerability, an incomplete assertion pattern, and a test isolation failure.
Test coverage percentage does not equal requirement coverage percentage. AI-generated tests that achieve high code coverage may not map to the business requirements they are intended to verify; explicit requirement traceability is a separate discipline.
Negative assertions — verifying that prohibited events do not occur — require explicit specification. AI generation does not produce negative assertions autonomously; they must be requested or the gap will propagate to production.
The economic threshold for “worth doing” has shifted. Testing investments previously evaluated as too expensive relative to benefit must be re-evaluated against AI-assisted implementation costs.

1. System Context and Testing Requirements

The system under test was a CRM domain layer implemented with event sourcing across seven entity types: Account, Contact, Lead, Opportunity, Activity, Product, and Address. The event delivery architecture was as follows:

DynamoDB Streams → EventBridge → SQS → Consumers

In event-sourced systems, defects in event flow have compounding consequences:

A missing event breaks the audit trail and may corrupt downstream state reconstructions.
An incorrect event order produces invalid state when events are replayed.
A cross-tenant event leak constitutes a compliance violation regardless of whether the data is acted upon.

The testing requirements were correspondingly demanding: independent verification of each entity’s event flow, verification of multi-entity workflow sequences, failure scenario coverage, and execution against a local infrastructure emulator (LocalStack) rather than live AWS resources. Manual effort estimate for comprehensive coverage: 2–3 weeks. The manual effort estimate reflects the genuine complexity of the task. Each test scenario requires test data setup, asynchronous event delivery with non-deterministic timing, payload and ordering verification, and resource cleanup. Multiplied across 21 scenarios covering seven entity types and multi-entity workflows, the repetitive implementation burden is substantial.

2. AI-Assisted Planning and Implementation

2.1 Planning Phase Output

The planning phase produced a 24-page specification covering test architecture, scenario inventory, utility design, and assertion patterns. The specification defined the following components: Test Architecture:

TestEventBus abstraction wrapping EventBridge and SQS clients for test isolation.
EventCollector for asynchronous event aggregation with configurable timeout.
Trait-based event matchers supporting flexible field-level assertions.
Separate test suite organization for per-entity and cross-entity scenarios.

Scenario Inventory (21 scenarios):

Contact events: 5 scenarios
Lead events: 6 scenarios including lead conversion workflow
Opportunity events: 4 scenarios
Partner events: 3 scenarios
Cross-entity workflows: 3 scenarios

Test Utility Specifications:

Event comparison with field-level diff output
Timeout-based asynchronous event waiting with structured failure messages
Test data builders for each entity type
Event payload normalization for deterministic assertions

2.2 Implementation Phase Output

The implementation phase produced the following file structure:

eva-crm/tests/integration/event_flow/
├── mod.rs                      (shared utilities)
├── contact_events_test.rs      (5 scenarios)
├── lead_events_test.rs         (6 scenarios)
├── opportunity_events_test.rs  (4 scenarios)
└── partner_events_test.rs      (3 scenarios)

eva-crm/tests/e2e/
└── lead_conversion_workflow_test.rs (multi-entity)

All 21 test scenarios passed against LocalStack on initial execution.

3. The EventCollector Implementation

The EventCollector utility is the core infrastructure component enabling reliable asynchronous event verification. The implementation uses exponential backoff with jitter to handle the variable latency inherent in the EventBridge-to-SQS delivery path:

/// Async event collector with timeout and filtering
pub struct EventCollector {
    queue_url: String,
    sqs_client: aws_sdk_sqs::Client,
    timeout: Duration,
}

impl EventCollector {
    /// Collect events matching predicate within timeout
    pub async fn collect_events<F>(
        &self,
        predicate: F,
        expected_count: usize,
    ) -> Result<Vec<Event>>
    where
        F: Fn(&Event) -> bool,
    {
        let start = Instant::now();
        let mut collected = Vec::new();

        // Exponential backoff with jitter
        let mut delay = Duration::from_millis(100);

        while start.elapsed() < self.timeout {
            // Poll SQS queue
            let messages = self.sqs_client
                .receive_message()
                .queue_url(&self.queue_url)
                .max_number_of_messages(10)
                .wait_time_seconds(1)
                .send()
                .await?
                .messages
                .unwrap_or_default();

            for msg in messages {
                let event: Event = serde_json::from_str(&msg.body)?;

                if predicate(&event) {
                    collected.push(event);

                    if collected.len() >= expected_count {
                        return Ok(collected);
                    }
                }
            }

            // Exponential backoff with jitter
            tokio::time::sleep(delay).await;
            delay = (delay * 2).min(Duration::from_secs(5));
        }

        Err(Error::EventCollectionTimeout {
            expected: expected_count,
            received: collected.len(),
            elapsed: start.elapsed(),
        })
    }
}

Test usage example:

#[tokio::test]
async fn test_lead_conversion_emits_events() {
    let collector = EventCollector::new("test-queue-url", Duration::from_secs(10));

    // Trigger lead conversion
    convert_lead_to_opportunity(lead_id).await?;

    // Collect events
    let events = collector
        .collect_events(
            |e| e.entity_type == "Lead" || e.entity_type == "Opportunity",
            2, // Expect: LeadConverted + OpportunityCreated
        )
        .await?;

    // Verify event ordering and payload
    assert_eq!(events[0].event_type, "LeadConverted");
    assert_eq!(events[1].event_type, "OpportunityCreated");
    assert_eq!(events[1].payload["lead_id"], lead_id);
}

The exponential backoff addresses a genuine operational characteristic of the EventBridge-to-SQS delivery path: delivery latency is variable and cannot be handled with a fixed sleep interval. The predicate-based filtering allows each test to specify exactly which events it is waiting for, preventing false positives from unrelated events in the queue.

4. Verification Phase: Critical Issues Identified

The AI generation phase produced 21 passing tests. A structured verification review identified three critical issues that would have caused failures or false confidence in a production environment.

Issue 1: Race Condition in Event CollectionThe initial implementation used fixed polling intervals. Tests occasionally failed due to variable EventBridge-to-SQS delivery delays that exceeded the fixed interval timing. This manifested as intermittent test failures with no deterministic reproduction pattern.Resolution: Replaced fixed-interval polling with exponential backoff and jitter, as shown in the EventCollector implementation above. Intermittent failures were eliminated.

Issue 2: Incomplete Negative Assertions for Tenant IsolationThe cross-tenant isolation tests verified that Tenant A’s events were delivered to Tenant A’s consumer. They did not verify that Tenant A’s events were absent from Tenant B’s queue. A partial event leak — where events arrive at the correct destination but also arrive at an incorrect destination — would have passed the test suite as written.Resolution: Added should_not_receive_event assertions to all tenant isolation scenarios. The absence of an event is as important to verify as its presence, particularly in compliance-sensitive isolation contexts.

Issue 3: Cross-Test State PollutionEvents emitted during one test scenario could be present in the SQS queue when a subsequent test scenario ran. Tests that relied on event counts were non-deterministic when prior test events were present. This caused order-dependent test failures that were difficult to diagnose.Resolution: Implemented per-test event queue isolation, ensuring each test scenario operates against a clean queue state. Test ordering ceased to affect test outcomes.

These three issues share a common characteristic: they are invisible to the AI generation phase because they concern execution context rather than test logic. The AI generates tests that are correct in isolation. The verification phase must evaluate tests in their collective execution context.

5. Principles Established

Principle 1: Systematic Coverage Is the Highest-Value AI Testing Application

Work that requires consistent application of a verification pattern across many cases is the task profile where AI test generation is most reliable and most leveraged. Twenty-one scenarios following the same structural pattern across seven entity types is the ideal AI generation task: the pattern is well-defined, the variation is data-driven, and human judgment is required primarily for scenario selection rather than for implementation. The anti-pattern is using AI for exploratory testing, where success criteria are not defined in advance. AI generation requires clear specification of what a passing test looks like.

Principle 2: Economic Viability Thresholds Have Shifted

The E2E testing implementation described in this analysis would not have been undertaken at its manual effort estimate. At 1.5 days with AI assistance versus 2–3 weeks without, the implementation crossed the threshold from “not worth the effort” to “clearly worth doing.” The confidence level in the event delivery system — and the ability to detect regressions — is materially higher as a result. Engineering organizations should re-evaluate any testing investment previously classified as too expensive. The effort equation has changed, and decisions made before AI-assisted implementation was available may no longer be correct.

Principle 3: Requirement Traceability Is a Separate Discipline From Coverage

AI-generated tests achieve high code coverage. They do not automatically map to business requirements. A test that verifies a code path exercises may not verify the behavior that the product specification requires. The following pattern enforces explicit requirement traceability in test naming:

#[test]
fn test_account_name_validation_per_prd_section_2_1() {
    // Test specific requirement from PRD
}

This naming convention makes the relationship between test and requirement explicit, auditable, and reviewable by non-engineering stakeholders.

6. AI Capability Assessment

Task	AI Performance	Requirement
Test architecture design	High — produced complete specification	Requirement context and scope
Scenario inventory generation	High — identified 21 scenarios systematically	Entity and workflow inventory
Test utility implementation	High — EventCollector and matchers correct	Specification of async patterns
Test data fixture generation	High — 40+ reusable fixtures across entity types	Entity schema definitions
Race condition handling	Not autonomous — required explicit specification	Human identification of timing concern
Negative assertion generation	Not autonomous — required explicit request	Human identification of absence requirement
Cross-test isolation	Not autonomous — required explicit specification	Human identification of pollution risk
Requirement traceability	Not autonomous — requires explicit convention	Organizational standard definition

The capability boundary is consistent with the pattern observed in other AI-assisted development contexts: AI excels at systematic, pattern-consistent generation from well-specified inputs. It does not autonomously apply production operational knowledge, identify absence-class requirements, or enforce organizational standards that are not part of its input specification.

7. Coverage Metrics

Metric	Value
Event flow test scenarios	21
Integration test scenarios	47
Unit tests	156
Overall code coverage	89%
Critical issues identified in verification	3
Issues that would have reached production without verification	3
Estimated manual implementation time	2–3 weeks
Actual AI-assisted implementation time	1.5 days

8. Recommendations

Recommendation 1: Adopt AI-assisted test generation for all systematic coverage tasks in event-sourced or message-driven systems. The effort reduction is sufficient to make comprehensive coverage economically viable in contexts where it was previously cost-prohibitive. The investment in AI-assisted test generation pays immediate returns in regression detection capability and ongoing returns in developer confidence. Recommendation 2: Require a dedicated verification phase for all AI-generated test suites before merging to the main branch. AI-generated tests that pass in isolation may contain race conditions, incomplete assertion patterns, or cross-test state pollution that is only visible in collective execution context. A structured verification phase that specifically targets timing, isolation, and negative assertion coverage is a mandatory quality gate, not an optional enhancement.

Structure the verification checklist around the three failure modes documented in this analysis: timing-dependent assertions (does this test assume a fixed delay?), absence assertions (does this test verify that prohibited events do not occur?), and state isolation (does this test assume a clean environment that prior tests may have polluted?). These three categories cover the majority of AI test suite defects.

Recommendation 3: Implement negative assertions for all compliance boundary tests, explicitly and as a required standard. Tests that verify cross-tenant isolation must assert the absence of events in prohibited queues, not only the presence of events in permitted queues. This requirement must be stated explicitly in test generation specifications; it will not be included by default. Recommendation 4: Establish requirement traceability conventions before AI-assisted test generation begins. Test naming and annotation conventions that establish explicit links between tests and product requirements must be defined as organizational standards and included in the AI generation specification. Retroactive addition of traceability to generated test suites is significantly more costly than building it in from the start. Recommendation 5: Re-evaluate all testing investments previously classified as cost-prohibitive. Any testing investment whose deferral was justified by manual implementation cost should be re-evaluated against current AI-assisted implementation estimates. The cost reduction is substantial enough to reverse many of those decisions.

9. Conclusion and Forward-Looking Assessment

The implementation documented in this analysis demonstrates that AI-assisted test generation represents a genuine capability expansion for engineering organizations, not merely a productivity improvement. Comprehensive testing that was previously unaffordable is now affordable. This shifts the constraint on testing quality from effort to discipline: the organizations that will extract the most value from AI-assisted testing are those with the procedural rigor to specify requirements clearly, verify generated output systematically, and enforce quality standards that AI tooling does not autonomously apply. As AI generation capabilities continue to improve, the gap between “what AI can generate” and “what production requires” will narrow for implementation correctness. It will not narrow for contextual operational knowledge — understanding timing characteristics of delivery pipelines, recognizing absence requirements in compliance contexts, enforcing organizational standards. Human judgment in these domains will remain the differentiating factor between test suites that provide genuine confidence and test suites that provide the appearance of confidence.

All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Overview

Data & State

Code & Tooling

Debugging & Design

Infrastructure

Testing Infrastructure

Executive Summary

Key Findings

1. System Context and Testing Requirements

2. AI-Assisted Planning and Implementation

2.1 Planning Phase Output

2.2 Implementation Phase Output

3. The EventCollector Implementation

4. Verification Phase: Critical Issues Identified

5. Principles Established

Principle 1: Systematic Coverage Is the Highest-Value AI Testing Application

Principle 2: Economic Viability Thresholds Have Shifted

Principle 3: Requirement Traceability Is a Separate Discipline From Coverage

6. AI Capability Assessment

7. Coverage Metrics

8. Recommendations

9. Conclusion and Forward-Looking Assessment

Overview

Data & State

Code & Tooling

Debugging & Design

Infrastructure

Documentation Index

​Executive Summary

​Key Findings

​1. System Context and Testing Requirements

​2. AI-Assisted Planning and Implementation

​2.1 Planning Phase Output

​2.2 Implementation Phase Output

​3. The EventCollector Implementation

​4. Verification Phase: Critical Issues Identified

​5. Principles Established

​Principle 1: Systematic Coverage Is the Highest-Value AI Testing Application

​Principle 2: Economic Viability Thresholds Have Shifted

​Principle 3: Requirement Traceability Is a Separate Discipline From Coverage

​6. AI Capability Assessment

​7. Coverage Metrics

​8. Recommendations

​9. Conclusion and Forward-Looking Assessment

Executive Summary

Key Findings

1. System Context and Testing Requirements

2. AI-Assisted Planning and Implementation

2.1 Planning Phase Output

2.2 Implementation Phase Output

3. The EventCollector Implementation

4. Verification Phase: Critical Issues Identified

5. Principles Established

Principle 1: Systematic Coverage Is the Highest-Value AI Testing Application

Principle 2: Economic Viability Thresholds Have Shifted

Principle 3: Requirement Traceability Is a Separate Discipline From Coverage

6. AI Capability Assessment

7. Coverage Metrics

8. Recommendations

9. Conclusion and Forward-Looking Assessment