Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

Executive Summary

Large-scale breaking changes in distributed data systems represent a category of engineering risk that ad-hoc approaches consistently underestimate. This paper documents a structured migration methodology applied to a six-entity capsule isolation enforcement initiative affecting 1,003 tests, 47 API handlers, and 127 call sites across a production multi-tenant platform. The systematic approach—comprising comprehensive planning via a long-context reasoning model, template-entity pattern validation, parallel AI-assisted implementation, and staged verification—completed the migration in 32 hours against a manual estimate of 160 to 240 hours, yielding a 5 to 7x time reduction. A critical finding is that the planning investment of 8 hours produced a 464-page migration specification that prevented six production-class defects. Organizations executing breaking changes without upfront comprehensive planning incur compounding rework costs that frequently exceed the original migration effort by an order of magnitude. This paper presents the decision framework, execution methodology, empirical metrics, and replicable patterns for engineering teams facing analogous migration challenges.

Key Findings

  • Comprehensive upfront planning eliminates downstream rework. An 8-hour planning investment produced a migration specification that identified the critical GSI pattern defect before implementation began, preventing the defect from propagating across all six entities.
  • Template-entity validation is the highest-leverage quality gate in multi-entity migrations. Migrating the simplest entity first and verifying it completely before parallelizing catches pattern defects at unit cost rather than multiplied cost.
  • AI-assisted parallel implementation achieved a 5 to 7x time reduction relative to manual sequential migration, primarily through systematic call-site updates and bulk test data modifications.
  • AI agents require explicit dependency ordering for hierarchical entity migrations. Without human-specified migration sequences, agents default to alphabetical or arbitrary ordering that violates parent-child data relationships.
  • AI-generated data migration plans are insufficient for concurrent-access scenarios. Human engineers must supply atomic transaction strategies; AI agents do not independently reason about race conditions in distributed data stores.
  • Verification must be applied to every entity without exception. Skipping verification for entities presumed to follow an established template resulted in a staging-environment defect that required 2 additional hours to diagnose and resolve.

1. Problem Statement: Data Contamination Across Isolation Boundaries

1.1 The Production Anomaly

During the third week of integration testing, monitoring logs surfaced a critical isolation violation.
[WARN] Capsule isolation violation detected
Entity: FinancialConfig
Event: FinancialConfigUpdated
Issue: Missing capsule_id in partition key
Impact: DEVUS test data appearing in PRODUS production queries
A financial configuration record created in a development environment appeared in production query results because the entity was scoped at the tenant level rather than the capsule level. For financial services workloads, this constitutes a compliance violation with direct implications for SOC 2 audit outcomes.

1.2 Scope Analysis

Systematic analysis revealed that the defect was not isolated to a single entity. Six entities shared the same structural deficiency.
EntityScopeIssueRisk
FinancialConfigTenantTest configs in prod queriesHigh
ContractTenantTest contracts in revenue reportsCritical
ContractLineItemTenantTest line items in billingCritical
ContractAmendmentTenantTest amendments in audit trailHigh
RevenueScheduleTenantTest revenue in financial reportingCritical
AccessEntityTenantTest access grants in security queriesMedium
The aggregate migration scope comprised 6 entities, 21 repository methods per entity, 47 API handlers, and 1,003 tests, with a manual effort estimate of 4 to 6 weeks.

1.3 Root Cause

The vulnerable entity definition illustrates the structural problem.
// WRONG: Tenant-scoped only
#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    // capsule_id missing!
    pub industry: String,
    // ...
}
The absence of capsule_id in the partition key meant that query operations could not enforce isolation between environments sharing the same tenant identifier.
// Queries for "Technology" industry returned results from BOTH capsules
let configs = repo.query_by_industry(tenant_id, "Technology").await?;
// Returns: [DEVUS test config, PRODUS production config]
Compliance Implication: The commingling of test and production data in query results constitutes a breach of the security boundary between environments. In regulated industries, this pattern fails SOC 2 Type II controls and may trigger audit findings. The defect must be remediated comprehensively, not incrementally, because a partially enforced security boundary provides no meaningful protection.

2. Migration Strategy Selection

2.1 Option Evaluation

Two migration strategies were evaluated against the constraints of zero downtime, no data loss, backward compatibility during transition, and atomic completion across all six entities. Option A: Incremental Migration This approach would migrate one entity at a time, deploying after each entity completion over a six-week period. The approach was rejected because a security boundary in a partially enforced state offers no isolation guarantee. Additionally, conditional query logic to determine whether each entity is capsule-scoped introduces complexity proportional to the number of migration stages, and six deployment cycles create six independent risk windows. Option B: Coordinated Comprehensive Migration (Selected) This approach plans all six entities before implementation begins, executes in parallel, and deploys in a single coordinated release. The breaking change is absorbed once rather than distributed across six increments. Selection Rationale: Capsule isolation is a security boundary. The property is binary—either enforced or not. Incremental enforcement provides a false sense of security and introduces query complexity that must subsequently be removed. A coordinated migration accepts higher upfront planning cost in exchange for a single, verifiable transition.

3. Planning Methodology

3.1 Specification Development

A comprehensive migration specification was produced through a long-context reasoning session with the following input.
Planning session for capsule isolation migration.

Context:
- 6 entities currently tenant-scoped, need capsule scope
- ADR-0010 defines capsule isolation requirements
- Migration is breaking change (PK patterns change)
- Cannot break existing data

Requirements:
1. Migrate entity schemas (add capsule_id field)
2. Update PK/SK patterns (TENANT#...#CAPSULE#...)
3. Update all repositories (add capsule_id parameters)
4. Update all API handlers (pass capsule_id)
5. Migrate existing DynamoDB data
6. Update all tests (1,003 tests)

Constraints:
- Zero downtime
- Backward compatibility during migration
- All 6 entities migrate together (atomic)
- No data loss

Design migration strategy with:
- Entity modification plan
- Data migration plan
- Rollback plan
- Test plan
The resulting specification encompassed 464 pages for the Contract entity group alone, with three key architectural decisions.

Decision 1: Dual-Write Migration Strategy

Phase 1: Add capsule_id, maintain old PK pattern
  • Add capsule_id field to entities
  • Still use old PK: TENANT#{tenant_id}#CONTRACT#{id}
  • Dual-write: Write to both old and new patterns
  • Queries use old pattern (no behavior change)
Phase 2: Flip queries to new pattern
  • Start querying new PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONTRACT#{id}
  • Still dual-write to both patterns
  • Monitor for issues
Phase 3: Drop old pattern
  • Stop writing to old pattern
  • Clean up old data
  • Remove dual-write code
Rationale: Gradual cutover prevents big-bang deployment risk.

Decision 2: GSI Pattern Update

Old pattern:
GSI1PK: CAPSULE#{capsule_id}#ACCOUNT#{account_id}
Problem: Missing tenant_id prefix makes cross-tenant isolation unverifiable.New pattern:
GSI1PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCOUNT#{account_id}
Impact: All GSI helper methods need tenant_id parameter:
// Old signature
fn gsi1pk_for_account(capsule_id: CapsuleId, account_id: AccountId) -> String;

// New signature
fn gsi1pk_for_account(tenant_id: TenantId, capsule_id: CapsuleId, account_id: AccountId) -> String;
This signature change affects 127 call sites across the codebase.

Decision 3: Test Data Migration

Challenge: 1,003 tests use hard-coded tenant_id, no capsule_id.Options:
  1. Update all tests to include capsule_id (manual)
  2. Create default capsule for tests (automated)
  3. Generate migration script for test data
Choice: Option 2 with fallback to Option 1 for critical tests.Implementation:
  • Test helper: test_capsule() returns default CapsuleId for all tests
  • Critical tests (cross-capsule scenarios): Explicit capsule_id values

4. Execution Methodology

4.1 Template Entity Migration

The migration began with FinancialConfig, selected as the template entity for two reasons: it has the fewest call sites (21 versus 40+ for Contract) and no foreign key dependencies on other entities in the migration scope. This property makes it the lowest-risk candidate for pattern validation. The template migration produced one critical finding.
Issue: GSI Pattern InconsistencyEntity updated to:
PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL
But GSI pattern still used old format:
GSI1PK: CAPSULE#{capsule_id}#INDUSTRY#{industry}  // Missing TENANT#
Root Cause: The migration plan presented the PK update and the GSI update in separate sections. The implementation agent applied the PK update but did not correlate it with the GSI section.Impact: Without correction, GSI queries would not enforce tenant isolation, defeating the purpose of the migration.Resolution: Updated GSI pattern to include TENANT# prefix. This finding was propagated to the specification for all subsequent entities before parallel implementation began.
This single finding justified the template-entity approach. Had all six entities been migrated in parallel without prior validation, the same GSI defect would have appeared in all six, requiring a second remediation pass across the entire scope.

4.2 Parallel Implementation

With the corrected pattern validated in the template entity, five concurrent implementation sessions were launched. Session 1 — Contract Entity Group:
  • ContractEntity
  • ContractLineItemEntity
  • ContractAmendmentEntity
  • RevenueScheduleEntryEntity
Session 2 — Access Entity Group:
  • AccessEntity
Shared files (error type definitions, API common code) required coordinated merge sequencing to avoid conflicts. The merge order followed the rule of simplest-first: FinancialConfig established patterns in shared files; subsequent entities adopted those established patterns.

4.3 AI Performance Assessment

The following table documents AI agent performance by task category during the migration.
Task CategoryScopeAI TimeManual EstimateAI Efficacy
Call-site updates127 sites, 7 function signatures2 hours8 hoursHigh
Test data updates1,003 tests3 hours2–3 weeksHigh
Dependency orderingEntity hierarchy analysisNot applicableN/AInsufficient — human required
Atomic data migrationConcurrent-access strategyNot applicableN/AInsufficient — human required
AI Agent Limitation: Dependency Ordering. The Contract entity group has a parent-child-grandchild hierarchy. The implementation agent’s initial ordering was alphabetical. This was incorrect because child entities cannot be migrated before their parents. Human intervention was required to specify the correct sequence: ContractEntity first, ContractLineItemEntity second, ContractAmendmentEntity and RevenueScheduleEntryEntity third. AI Agent Limitation: Concurrency Reasoning. The agent proposed a scan-read-write-delete sequence for data migration, which contains a race condition between read and write operations. Human engineers specified the correct approach using atomic transactions.
// Atomic migration
client.transact_write_items()
    .transact_items(
        TransactWriteItem::builder()
            .put(/* new PK */)
            .condition_expression("attribute_not_exists(PK)")
            .build()
    )
    .transact_items(
        TransactWriteItem::builder()
            .delete(/* old PK */)
            .condition_expression("attribute_exists(PK)")
            .build()
    )
    .send()
    .await?;

5. The Migration Pattern

The following code sequence represents the validated migration pattern established through the template entity process and subsequently applied to all remaining entities.

The Consolidated Migration Pattern

The following code represents the complete validated pattern applied to all six entities. It combines entity schema update, repository signature update, API handler extraction, and cross-capsule isolation verification in a single reference implementation.
// Step 1: Update entity definition
#[derive(DynamoDbEntity, Debug, Clone)]
#[capsule_isolated]  // Enforces capsule_id field + PK pattern
#[table_name = "platform_data"]
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ENTITY#{entity_type}#{id}"]
#[sk = "METADATA"]
#[gsi1 = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#GSI1#{field}"]
pub struct MyEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // Required by #[capsule_isolated]
    pub id: EntityId,
    // ...
}

// Step 2: Update repository trait
pub trait MyEntityRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId, id: EntityId)
        -> Result<Option<MyEntity>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, entity: MyEntity)
        -> Result<()>;
}

// Step 3: Update API handler
pub async fn get_entity(
    Extension(context): Extension<RequestContext>,
    Path((tenant_id, entity_id)): Path<(TenantId, EntityId)>,
) -> Result<Json<EntityResponse>> {
    let capsule_id = context.capsule_id()?;  // Extract from context
    let entity = repo.get(tenant_id, capsule_id, entity_id).await?;
    Ok(Json(entity.into()))
}

// Step 4: Add negative test
#[tokio::test]
async fn test_cross_capsule_isolation() {
    let repo = DynamoDbMyEntityRepository::new(/* ... */);

    // Create in PRODUS capsule
    let entity = MyEntity::new(tenant_id(), capsule_id("PRODUS"), /* ... */);
    repo.save(tenant_id(), capsule_id("PRODUS"), entity).await?;

    // Try to fetch from DEVUS capsule
    let result = repo.get(tenant_id(), capsule_id("DEVUS"), entity.id).await?;

    // Should NOT find it (different capsule)
    assert!(result.is_none());
}

6. Empirical Results

Entities migrated: 6Files modified: 47Lines changed:
  • Added: 1,247 lines
  • Removed: 721 lines
  • Net: +526 lines (additional isolation code)
Call sites updated: 127Tests updated: 1,003All tests passing:

7. Established Principles

The following principles emerged from the migration and apply to analogous multi-entity breaking change scenarios.
PrincipleRuleRationale
Comprehensive planningComplete the full migration specification before writing any implementation codeAd-hoc migration of six entities would have required 12 migration passes to correct propagated defects
Template entity firstMigrate the simplest entity completely and verify before parallelizingDefects found in the template entity are corrected once; defects found after parallelization are corrected N times
Merge order disciplineIdentify shared files upfront; establish patterns in the first-merged entity; subsequent entities adopt those patternsParallel sessions generating conflicting changes to shared files require manual conflict resolution that negates velocity gains
Tests as part of migrationMigrate entity tests concurrently with entity schemas, not afterwardTests are the primary evidence that the migration succeeded; deferring them defers verification
Durable migration artifactsPreserve migration specifications alongside the codebase in version controlFuture engineers, auditors, and onboarding personnel require documented rationale for breaking change decisions

8. Recommendations

  1. Require comprehensive migration specifications before implementation begins. For breaking changes affecting more than two entities or more than 50 call sites, a detailed migration plan is not optional. The planning investment consistently returns 5 to 10 times its cost in prevented rework.
  2. Designate a template entity for every multi-entity migration. Select the entity with the fewest dependencies and call sites. Complete its migration and verification fully before parallelizing. Treat any defects found in the template entity as specification defects requiring plan correction before proceeding.
  3. Supply explicit dependency ordering to AI implementation agents. AI agents do not independently infer entity hierarchies. Engineering leads must analyze the dependency graph and provide migration sequencing as an explicit input to implementation sessions.
  4. Require human review of all data migration plans involving concurrent access. AI agents produce logically correct migration sequences for single-writer scenarios but do not account for distributed concurrency. All data migration plans involving live systems must include human review of atomicity and race condition handling.
  5. Apply verification to every migrated entity without exception. Verification shortcuts based on confidence in established patterns have been demonstrated to allow defects to reach staging environments. A consistent verification checklist applied to every entity is the only reliable quality gate.
  6. Preserve migration specifications as durable artifacts. Migration plans stored alongside the codebase provide institutional knowledge for future engineers, serve as the basis for post-migration audits, and accelerate onboarding for engineers joining after the migration is complete.

9. Forward-Looking Considerations

The migration framework described in this paper addresses the present state of AI-assisted development, in which agents excel at systematic pattern application but require human guidance for dependency analysis, concurrency strategy, and cross-entity coordination. As AI reasoning capabilities mature, the boundary between human and AI responsibility in migration planning will shift. However, the structural requirement for comprehensive upfront specification before implementation will persist, regardless of which party authors it. Organizations that institutionalize rigorous migration planning practices now will be able to leverage more capable AI agents as they become available, without accumulating the architectural debt that results from ad-hoc migration approaches. The compounding cost of unplanned breaking changes—measured in this study as 18 times the duration of a planned migration—provides a durable financial argument for sustained investment in migration methodology.
Disclaimer: All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.