Breaking Changes at Scale: A Systematic Framework for Large-Scale Data Model Migration

Executive Summary

Large-scale breaking changes in distributed data systems represent a category of engineering risk that ad-hoc approaches consistently underestimate. This paper documents a structured migration methodology applied to a six-entity capsule isolation enforcement initiative affecting 1,003 tests, 47 API handlers, and 127 call sites across a production multi-tenant platform. The systematic approach—comprising comprehensive planning via a long-context reasoning model, template-entity pattern validation, parallel AI-assisted implementation, and staged verification—completed the migration in 32 hours against a manual estimate of 160 to 240 hours, yielding a 5 to 7x time reduction. A critical finding is that the planning investment of 8 hours produced a 464-page migration specification that prevented six production-class defects. Organizations executing breaking changes without upfront comprehensive planning incur compounding rework costs that frequently exceed the original migration effort by an order of magnitude. This paper presents the decision framework, execution methodology, empirical metrics, and replicable patterns for engineering teams facing analogous migration challenges.

Key Findings

Comprehensive upfront planning eliminates downstream rework. An 8-hour planning investment produced a migration specification that identified the critical GSI pattern defect before implementation began, preventing the defect from propagating across all six entities.
Template-entity validation is the highest-leverage quality gate in multi-entity migrations. Migrating the simplest entity first and verifying it completely before parallelizing catches pattern defects at unit cost rather than multiplied cost.
AI-assisted parallel implementation achieved a 5 to 7x time reduction relative to manual sequential migration, primarily through systematic call-site updates and bulk test data modifications.
AI agents require explicit dependency ordering for hierarchical entity migrations. Without human-specified migration sequences, agents default to alphabetical or arbitrary ordering that violates parent-child data relationships.
AI-generated data migration plans are insufficient for concurrent-access scenarios. Human engineers must supply atomic transaction strategies; AI agents do not independently reason about race conditions in distributed data stores.
Verification must be applied to every entity without exception. Skipping verification for entities presumed to follow an established template resulted in a staging-environment defect that required 2 additional hours to diagnose and resolve.

1. Problem Statement: Data Contamination Across Isolation Boundaries

1.1 The Production Anomaly

During the third week of integration testing, monitoring logs surfaced a critical isolation violation.

[WARN] Capsule isolation violation detected
Entity: FinancialConfig
Event: FinancialConfigUpdated
Issue: Missing capsule_id in partition key
Impact: DEVUS test data appearing in PRODUS production queries

A financial configuration record created in a development environment appeared in production query results because the entity was scoped at the tenant level rather than the capsule level. For financial services workloads, this constitutes a compliance violation with direct implications for SOC 2 audit outcomes.

1.2 Scope Analysis

Systematic analysis revealed that the defect was not isolated to a single entity. Six entities shared the same structural deficiency.

Entity	Scope	Issue	Risk
FinancialConfig	Tenant	Test configs in prod queries	High
Contract	Tenant	Test contracts in revenue reports	Critical
ContractLineItem	Tenant	Test line items in billing	Critical
ContractAmendment	Tenant	Test amendments in audit trail	High
RevenueSchedule	Tenant	Test revenue in financial reporting	Critical
AccessEntity	Tenant	Test access grants in security queries	Medium

The aggregate migration scope comprised 6 entities, 21 repository methods per entity, 47 API handlers, and 1,003 tests, with a manual effort estimate of 4 to 6 weeks.

1.3 Root Cause

The vulnerable entity definition illustrates the structural problem.

// WRONG: Tenant-scoped only
#[derive(DynamoDbEntity)]
#[pk = "TENANT#{tenant_id}#CONFIG#FINANCIAL"]
pub struct FinancialConfigEntity {
    pub tenant_id: TenantId,
    // capsule_id missing!
    pub industry: String,
    // ...
}

The absence of capsule_id in the partition key meant that query operations could not enforce isolation between environments sharing the same tenant identifier.

// Queries for "Technology" industry returned results from BOTH capsules
let configs = repo.query_by_industry(tenant_id, "Technology").await?;
// Returns: [DEVUS test config, PRODUS production config]

Compliance Implication: The commingling of test and production data in query results constitutes a breach of the security boundary between environments. In regulated industries, this pattern fails SOC 2 Type II controls and may trigger audit findings. The defect must be remediated comprehensively, not incrementally, because a partially enforced security boundary provides no meaningful protection.

2. Migration Strategy Selection

2.1 Option Evaluation

Two migration strategies were evaluated against the constraints of zero downtime, no data loss, backward compatibility during transition, and atomic completion across all six entities. Option A: Incremental Migration This approach would migrate one entity at a time, deploying after each entity completion over a six-week period. The approach was rejected because a security boundary in a partially enforced state offers no isolation guarantee. Additionally, conditional query logic to determine whether each entity is capsule-scoped introduces complexity proportional to the number of migration stages, and six deployment cycles create six independent risk windows. Option B: Coordinated Comprehensive Migration (Selected) This approach plans all six entities before implementation begins, executes in parallel, and deploys in a single coordinated release. The breaking change is absorbed once rather than distributed across six increments. Selection Rationale: Capsule isolation is a security boundary. The property is binary—either enforced or not. Incremental enforcement provides a false sense of security and introduces query complexity that must subsequently be removed. A coordinated migration accepts higher upfront planning cost in exchange for a single, verifiable transition.

3. Planning Methodology

3.1 Specification Development

A comprehensive migration specification was produced through a long-context reasoning session with the following input.

Planning session for capsule isolation migration.

Context:
- 6 entities currently tenant-scoped, need capsule scope
- ADR-0010 defines capsule isolation requirements
- Migration is breaking change (PK patterns change)
- Cannot break existing data

Requirements:
1. Migrate entity schemas (add capsule_id field)
2. Update PK/SK patterns (TENANT#...#CAPSULE#...)
3. Update all repositories (add capsule_id parameters)
4. Update all API handlers (pass capsule_id)
5. Migrate existing DynamoDB data
6. Update all tests (1,003 tests)

Constraints:
- Zero downtime
- Backward compatibility during migration
- All 6 entities migrate together (atomic)
- No data loss

Design migration strategy with:
- Entity modification plan
- Data migration plan
- Rollback plan
- Test plan

The resulting specification encompassed 464 pages for the Contract entity group alone, with three key architectural decisions.

Key Architecture Decisions from Migration Plan

Decision 1: Dual-Write Migration Strategy

Phase 1: Add capsule_id, maintain old PK pattern

Add capsule_id field to entities
Still use old PK: TENANT#{tenant_id}#CONTRACT#{id}
Dual-write: Write to both old and new patterns
Queries use old pattern (no behavior change)

Phase 2: Flip queries to new pattern

Start querying new PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONTRACT#{id}
Still dual-write to both patterns
Monitor for issues

Phase 3: Drop old pattern

Stop writing to old pattern
Clean up old data
Remove dual-write code

Rationale: Gradual cutover prevents big-bang deployment risk.

Decision 2: GSI Pattern Update

Old pattern:

GSI1PK: CAPSULE#{capsule_id}#ACCOUNT#{account_id}

Problem: Missing tenant_id prefix makes cross-tenant isolation unverifiable.New pattern:

GSI1PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#ACCOUNT#{account_id}

Impact: All GSI helper methods need tenant_id parameter:

// Old signature
fn gsi1pk_for_account(capsule_id: CapsuleId, account_id: AccountId) -> String;

// New signature
fn gsi1pk_for_account(tenant_id: TenantId, capsule_id: CapsuleId, account_id: AccountId) -> String;

This signature change affects 127 call sites across the codebase.

Decision 3: Test Data Migration

Challenge: 1,003 tests use hard-coded tenant_id, no capsule_id.Options:

Update all tests to include capsule_id (manual)
Create default capsule for tests (automated)
Generate migration script for test data

Choice: Option 2 with fallback to Option 1 for critical tests.Implementation:

Test helper: test_capsule() returns default CapsuleId for all tests
Critical tests (cross-capsule scenarios): Explicit capsule_id values

4. Execution Methodology

4.1 Template Entity Migration

The migration began with FinancialConfig, selected as the template entity for two reasons: it has the fewest call sites (21 versus 40+ for Contract) and no foreign key dependencies on other entities in the migration scope. This property makes it the lowest-risk candidate for pattern validation. The template migration produced one critical finding.

Issue: GSI Pattern InconsistencyEntity updated to:

PK: TENANT#{tenant_id}#CAPSULE#{capsule_id}#CONFIG#FINANCIAL

But GSI pattern still used old format:

GSI1PK: CAPSULE#{capsule_id}#INDUSTRY#{industry}  // Missing TENANT#

Root Cause: The migration plan presented the PK update and the GSI update in separate sections. The implementation agent applied the PK update but did not correlate it with the GSI section.Impact: Without correction, GSI queries would not enforce tenant isolation, defeating the purpose of the migration.Resolution: Updated GSI pattern to include TENANT# prefix. This finding was propagated to the specification for all subsequent entities before parallel implementation began.

This single finding justified the template-entity approach. Had all six entities been migrated in parallel without prior validation, the same GSI defect would have appeared in all six, requiring a second remediation pass across the entire scope.

4.2 Parallel Implementation

With the corrected pattern validated in the template entity, five concurrent implementation sessions were launched. Session 1 — Contract Entity Group:

ContractEntity
ContractLineItemEntity
ContractAmendmentEntity
RevenueScheduleEntryEntity

Session 2 — Access Entity Group:

AccessEntity

Shared files (error type definitions, API common code) required coordinated merge sequencing to avoid conflicts. The merge order followed the rule of simplest-first: FinancialConfig established patterns in shared files; subsequent entities adopted those established patterns.

4.3 AI Performance Assessment

The following table documents AI agent performance by task category during the migration.

Task Category	Scope	AI Time	Manual Estimate	AI Efficacy
Call-site updates	127 sites, 7 function signatures	2 hours	8 hours	High
Test data updates	1,003 tests	3 hours	2–3 weeks	High
Dependency ordering	Entity hierarchy analysis	Not applicable	N/A	Insufficient — human required
Atomic data migration	Concurrent-access strategy	Not applicable	N/A	Insufficient — human required

AI Agent Limitation: Dependency Ordering. The Contract entity group has a parent-child-grandchild hierarchy. The implementation agent’s initial ordering was alphabetical. This was incorrect because child entities cannot be migrated before their parents. Human intervention was required to specify the correct sequence: ContractEntity first, ContractLineItemEntity second, ContractAmendmentEntity and RevenueScheduleEntryEntity third. AI Agent Limitation: Concurrency Reasoning. The agent proposed a scan-read-write-delete sequence for data migration, which contains a race condition between read and write operations. Human engineers specified the correct approach using atomic transactions.

// Atomic migration
client.transact_write_items()
    .transact_items(
        TransactWriteItem::builder()
            .put(/* new PK */)
            .condition_expression("attribute_not_exists(PK)")
            .build()
    )
    .transact_items(
        TransactWriteItem::builder()
            .delete(/* old PK */)
            .condition_expression("attribute_exists(PK)")
            .build()
    )
    .send()
    .await?;

5. The Migration Pattern

The following code sequence represents the validated migration pattern established through the template entity process and subsequently applied to all remaining entities.

The Consolidated Migration Pattern

The following code represents the complete validated pattern applied to all six entities. It combines entity schema update, repository signature update, API handler extraction, and cross-capsule isolation verification in a single reference implementation.

// Step 1: Update entity definition
#[derive(DynamoDbEntity, Debug, Clone)]
#[capsule_isolated]  // Enforces capsule_id field + PK pattern
#[table_name = "platform_data"]
#[pk = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#ENTITY#{entity_type}#{id}"]
#[sk = "METADATA"]
#[gsi1 = "TENANT#{tenant_id}#CAPSULE#{capsule_id}#GSI1#{field}"]
pub struct MyEntity {
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,  // Required by #[capsule_isolated]
    pub id: EntityId,
    // ...
}

// Step 2: Update repository trait
pub trait MyEntityRepository {
    async fn get(&self, tenant_id: TenantId, capsule_id: CapsuleId, id: EntityId)
        -> Result<Option<MyEntity>>;
    async fn save(&self, tenant_id: TenantId, capsule_id: CapsuleId, entity: MyEntity)
        -> Result<()>;
}

// Step 3: Update API handler
pub async fn get_entity(
    Extension(context): Extension<RequestContext>,
    Path((tenant_id, entity_id)): Path<(TenantId, EntityId)>,
) -> Result<Json<EntityResponse>> {
    let capsule_id = context.capsule_id()?;  // Extract from context
    let entity = repo.get(tenant_id, capsule_id, entity_id).await?;
    Ok(Json(entity.into()))
}

// Step 4: Add negative test
#[tokio::test]
async fn test_cross_capsule_isolation() {
    let repo = DynamoDbMyEntityRepository::new(/* ... */);

    // Create in PRODUS capsule
    let entity = MyEntity::new(tenant_id(), capsule_id("PRODUS"), /* ... */);
    repo.save(tenant_id(), capsule_id("PRODUS"), entity).await?;

    // Try to fetch from DEVUS capsule
    let result = repo.get(tenant_id(), capsule_id("DEVUS"), entity.id).await?;

    // Should NOT find it (different capsule)
    assert!(result.is_none());
}

6. Empirical Results

Scope
Time
Quality
Cost

Entities migrated: 6Files modified: 47Lines changed:

Added: 1,247 lines
Removed: 721 lines
Net: +526 lines (additional isolation code)

Call sites updated: 127Tests updated: 1,003All tests passing: ✅

7. Established Principles

The following principles emerged from the migration and apply to analogous multi-entity breaking change scenarios.

Principle	Rule	Rationale
Comprehensive planning	Complete the full migration specification before writing any implementation code	Ad-hoc migration of six entities would have required 12 migration passes to correct propagated defects
Template entity first	Migrate the simplest entity completely and verify before parallelizing	Defects found in the template entity are corrected once; defects found after parallelization are corrected N times
Merge order discipline	Identify shared files upfront; establish patterns in the first-merged entity; subsequent entities adopt those patterns	Parallel sessions generating conflicting changes to shared files require manual conflict resolution that negates velocity gains
Tests as part of migration	Migrate entity tests concurrently with entity schemas, not afterward	Tests are the primary evidence that the migration succeeded; deferring them defers verification
Durable migration artifacts	Preserve migration specifications alongside the codebase in version control	Future engineers, auditors, and onboarding personnel require documented rationale for breaking change decisions

8. Recommendations

Require comprehensive migration specifications before implementation begins. For breaking changes affecting more than two entities or more than 50 call sites, a detailed migration plan is not optional. The planning investment consistently returns 5 to 10 times its cost in prevented rework.
Designate a template entity for every multi-entity migration. Select the entity with the fewest dependencies and call sites. Complete its migration and verification fully before parallelizing. Treat any defects found in the template entity as specification defects requiring plan correction before proceeding.
Supply explicit dependency ordering to AI implementation agents. AI agents do not independently infer entity hierarchies. Engineering leads must analyze the dependency graph and provide migration sequencing as an explicit input to implementation sessions.
Require human review of all data migration plans involving concurrent access. AI agents produce logically correct migration sequences for single-writer scenarios but do not account for distributed concurrency. All data migration plans involving live systems must include human review of atomicity and race condition handling.
Apply verification to every migrated entity without exception. Verification shortcuts based on confidence in established patterns have been demonstrated to allow defects to reach staging environments. A consistent verification checklist applied to every entity is the only reliable quality gate.
Preserve migration specifications as durable artifacts. Migration plans stored alongside the codebase provide institutional knowledge for future engineers, serve as the basis for post-migration audits, and accelerate onboarding for engineers joining after the migration is complete.

9. Forward-Looking Considerations

The migration framework described in this paper addresses the present state of AI-assisted development, in which agents excel at systematic pattern application but require human guidance for dependency analysis, concurrency strategy, and cross-entity coordination. As AI reasoning capabilities mature, the boundary between human and AI responsibility in migration planning will shift. However, the structural requirement for comprehensive upfront specification before implementation will persist, regardless of which party authors it. Organizations that institutionalize rigorous migration planning practices now will be able to leverage more capable AI agents as they become available, without accumulating the architectural debt that results from ad-hoc migration approaches. The compounding cost of unplanned breaking changes—measured in this study as 18 times the duration of a planned migration—provides a durable financial argument for sustained investment in migration methodology.

Disclaimer: All content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.

Overview

Workflows

Process

Infrastructure

Breaking Changes at Scale: A Systematic Framework for Large-Scale Data Model Migration

Executive Summary

Key Findings

1. Problem Statement: Data Contamination Across Isolation Boundaries

1.1 The Production Anomaly

1.2 Scope Analysis

1.3 Root Cause

2. Migration Strategy Selection

2.1 Option Evaluation

3. Planning Methodology

3.1 Specification Development

Decision 1: Dual-Write Migration Strategy

Decision 2: GSI Pattern Update

Decision 3: Test Data Migration

4. Execution Methodology

4.1 Template Entity Migration

4.2 Parallel Implementation

4.3 AI Performance Assessment

5. The Migration Pattern

The Consolidated Migration Pattern

6. Empirical Results

7. Established Principles

8. Recommendations

9. Forward-Looking Considerations

Overview

Workflows

Process

Infrastructure

Documentation Index

​Executive Summary

​Key Findings

​1. Problem Statement: Data Contamination Across Isolation Boundaries

​1.1 The Production Anomaly

​1.2 Scope Analysis

​1.3 Root Cause

​2. Migration Strategy Selection

​2.1 Option Evaluation

​3. Planning Methodology

​3.1 Specification Development

​Decision 1: Dual-Write Migration Strategy

​Decision 2: GSI Pattern Update

​Decision 3: Test Data Migration

​4. Execution Methodology

​4.1 Template Entity Migration

​4.2 Parallel Implementation

​4.3 AI Performance Assessment

​5. The Migration Pattern

​The Consolidated Migration Pattern

​6. Empirical Results

​7. Established Principles

​8. Recommendations

​9. Forward-Looking Considerations

Executive Summary

Key Findings

1. Problem Statement: Data Contamination Across Isolation Boundaries

1.1 The Production Anomaly

1.2 Scope Analysis

1.3 Root Cause

2. Migration Strategy Selection

2.1 Option Evaluation

3. Planning Methodology

3.1 Specification Development

Decision 1: Dual-Write Migration Strategy

Decision 2: GSI Pattern Update

Decision 3: Test Data Migration

4. Execution Methodology

4.1 Template Entity Migration

4.2 Parallel Implementation

4.3 AI Performance Assessment

5. The Migration Pattern

The Consolidated Migration Pattern

6. Empirical Results

7. Established Principles

8. Recommendations

9. Forward-Looking Considerations