Building Workflows That Can't Half-Fail

Executive Summary

Multi-step business processes in distributed systems are vulnerable to partial failure: a sequence of operations that succeeds through step N and fails at step N+1 leaves the system in an inconsistent state. The saga pattern addresses this by pairing each forward step with a compensating transaction that restores prior state if a downstream step fails. This analysis documents the design and implementation of a saga-based workflow system for a SaaS subscription management platform, including the AI-assisted generation of orchestration boilerplate, the SagaStep macro abstraction that eliminated approximately 150 lines of boilerplate per workflow, and the critical production hardening requirement — idempotency in compensation steps — that AI tooling did not address autonomously. Implementations that deploy saga patterns without idempotent compensation will encounter data integrity failures on retry.

Key Findings

Partial failure is the default failure mode for multi-step distributed workflows. Without compensation logic, a failure at any step after the first produces inconsistent state that manual intervention cannot reliably resolve.
The saga pattern makes consistency guarantees explicit and enforceable. Forward steps and their corresponding compensation steps are defined structurally, not as scattered error handling.
AI tooling generates correct forward and compensation flows from business rule specifications. The architectural pattern and its boilerplate are well within AI generation capability; business logic definition is not.
AI tooling does not autonomously address retry idempotency in compensation steps. This is the primary production hardening gap in AI-generated saga implementations and requires explicit human design.
Macro-based boilerplate elimination scales proportionally with workflow count. At approximately 150 lines of boilerplate saved per saga, the leverage increases as the workflow library grows.
Separation of business logic from orchestration is the foundational design principle. Pure functions implementing domain rules compose cleanly with saga orchestration; mixed implementations do not.

1. The Partial Failure Problem

A subscription management workflow requiring bundle modification illustrates the problem class. The operation requires five sequential steps:

Validate the modification request against business rules.
Calculate adjusted pricing based on the resulting configuration.
Create an approval record if the modification exceeds defined thresholds.
Execute the modification operation in the persistence layer.
Emit domain events to the audit trail.

In a naive implementation, these steps execute sequentially without coordination:

pub async fn unbundle_product(
    opportunity_id: Uuid,
    items_to_remove: Vec<Uuid>,
) -> Result<()> {
    // Step 1: Validate
    validate_unbundle(&opportunity_id, &items_to_remove)?;

    // Step 2: Calculate pricing
    let new_price = calculate_pricing(&opportunity_id, &items_to_remove)?;

    // Step 3: Create approval if needed
    let approval_id = if requires_approval(&items_to_remove) {
        Some(create_approval(&opportunity_id).await?)  // ❌ What if this succeeds...
    } else {
        None
    };

    // Step 4: Execute unbundle
    execute_unbundle(&opportunity_id, &items_to_remove).await?;  // ❌ ...but this fails?

    // Step 5: Emit event
    emit_unbundled_event(&opportunity_id).await?;

    Ok(())
}
// Result: Approval exists but unbundle never happened. Corruption!

The failure scenario that motivates the saga pattern is precise:

Step 1: ✅ Validation passed
Step 2: ✅ Pricing calculated
Step 3: ✅ Approval created
Step 4: ❌ DynamoDB write failed
Result: Approval exists but operation didn't execute

When step 4 fails, the brittle implementation leaves an approval record referencing an operation that did not occur. The saga implementation executes compensation steps in reverse order, restoring system consistency:

Forward:       Validate → Price → Approve → Execute → Emit
                                             ↑
                                             FAIL!

Compensation:  Validate ← Revert Price ← Cancel Approval

2. Saga State Machine Design

The complete state machine for the bundle modification workflow is as follows:

                    ┌─────────────┐
                    │  Initiated  │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Validated  │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Priced      │ ◄──── Compensation: Revert Price
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Approved    │ ◄──── Compensation: Cancel Approval
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Executed   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Completed  │
                    └─────────────┘

If any step fails:
    Run compensation for previous steps (in reverse order)
    Transition to Failed state
    Database remains consistent

Each state with a compensation path maintains the data required to execute that compensation. The original_price and approval_id fields on UnbundleSaga exist specifically to enable compensation; they carry no business logic function.

3. The SagaStep Macro

Saga orchestration code follows a regular structure: a mapping from step index to forward method, a mapping from step index to compensation method, and metadata about step count and naming. This structure is a candidate for code generation. The SagaStep macro generates this orchestration layer from the struct definition: Input (authored):

#[derive(SagaStep)]
pub struct UnbundleSaga {
    opportunity_id: Uuid,
    items_to_remove: Vec<Uuid>,
}

Output (generated):

impl TransactionalStep for UnbundleSaga {
    fn step_name(&self) -> &str {
        "UnbundleSaga"
    }

    fn step_count(&self) -> usize {
        5  // validate, price, approve, execute, emit
    }

    async fn execute_step(&mut self, step: usize) -> Result<()> {
        match step {
            0 => self.validate().await,
            1 => self.calculate_pricing().await,
            2 => self.create_approval().await,
            3 => self.execute_unbundle().await,
            4 => self.emit_event().await,
            _ => Err(StepOutOfBounds),
        }
    }

    async fn compensate_step(&mut self, step: usize) -> Result<()> {
        match step {
            1 => self.compensate_calculate_pricing().await,
            2 => self.compensate_create_approval().await,
            _ => Ok(())  // Steps without compensation
        }
    }
}

At approximately 150 lines of boilerplate eliminated per saga, the macro produces compounding leverage as workflow complexity and count increase. The pattern also enforces a structural discipline: step methods and compensation methods must be defined before the macro can generate the dispatch layer, which makes incomplete compensation implementations a compile-time failure rather than a runtime one.

4. Business Logic Implementation

The domain rules governing the workflow are implemented as pure functions independent of the orchestration layer. This separation is structural, not stylistic: pure business logic functions can be tested in isolation, composed into different orchestration contexts, and reasoned about without understanding the saga machinery. Business Rules:

pub mod bundle_operations {
    /// Check if unbundle is allowed
    pub fn validate_unbundle_allowed(bundle: &Product) -> Result<()> {
        if !bundle.unbundle_allowed {
            return Err(UnbundleNotAllowed);
        }
        if bundle.min_components > remaining_components(&bundle) {
            return Err(RequiredComponentsMissing);
        }
        Ok(())
    }

    /// Calculate price adjustment with reduced discount
    pub fn calculate_unbundle_pricing(
        bundle: &Product,
        items_to_remove: &[Uuid],
    ) -> Decimal {
        let original_discount = bundle.bundle_discount_value;
        let penalty_factor = items_to_remove.len() as f64 / bundle.components.len() as f64;

        // Reduce discount proportionally
        let new_discount = original_discount * (1.0 - penalty_factor * bundle.unbundle_price_adjustment);

        bundle.base_price * (1.0 - new_discount)
    }

    /// Check if approval workflow needed
    pub fn requires_approval(
        bundle: &Product,
        items_to_remove: &[Uuid],
    ) -> bool {
        let removal_percentage = items_to_remove.len() as f64 / bundle.components.len() as f64;
        removal_percentage > 0.5  // >50% removal requires approval
    }
}

The approval threshold (greater than 50% of components removed) is a business rule, not an architectural rule. AI tooling can generate the orchestration that enforces this threshold. It cannot determine what the threshold should be. Business rule definition remains a human responsibility regardless of the sophistication of the code generation tool chain.

5. Test Coverage

The implementation included 11 test scenarios covering the happy path, approval trigger boundaries, compensation execution, and edge cases. The following examples illustrate the compensation test pattern:

#[tokio::test]
async fn test_unbundle_happy_path() {
    // Remove 1 of 5 components, no approval needed
    let result = unbundle_product(opportunity_id, vec![component_1]).await;
    assert!(result.is_ok());
    assert_eq!(approval_created, false);
}

#[tokio::test]
async fn test_unbundle_requires_approval() {
    // Remove 3 of 5 components (60%), approval required
    let result = unbundle_product(opportunity_id, vec![c1, c2, c3]).await;
    assert!(result.is_ok());
    assert_eq!(approval_created, true);
}

#[tokio::test]
async fn test_unbundle_compensation() {
    // Simulate failure after approval creation
    let saga = UnbundleSaga::new(opportunity_id, items_to_remove);
    saga.execute_until_step(2).await?;  // Approval created
    saga.fail_at_step(3).await?;  // Execute unbundle fails

    // Verify compensation ran
    assert!(approval_canceled());
    assert_eq!(price, original_price);
}

6. Production Hardening: The Idempotency Gap

Critical Production Issue: The initial saga implementation did not handle idempotency in compensation steps. If a compensation step failed and was retried, it would attempt to cancel an approval that was already canceled. DynamoDB raises ConditionalCheckFailedException on the second cancel attempt, causing the compensation itself to fail.Root Cause: AI-generated compensation logic assumes single-pass execution. It does not account for retry scenarios, which are a normal condition in distributed systems operating under network unreliability or transient failure.Resolution: Compensation methods require explicit idempotency guards:

async fn compensate_create_approval(&mut self) -> Result<()> {
    if let Some(approval_id) = self.approval_id {
        // Check if already canceled
        match get_approval_status(approval_id).await? {
            ApprovalStatus::Canceled => return Ok(()),  // Already done
            ApprovalStatus::Pending => cancel_approval(approval_id).await?,
            _ => {}  // Other states don't need cancellation
        }
    }
    Ok(())
}

Implication for AI-assisted saga implementation: Every compensation method generated by AI must be reviewed for idempotency. This is not an optional review. Compensation steps that are not idempotent will fail under retry conditions that occur in all production distributed systems.

7. AI Contribution Assessment

Capability Area	AI Performance	Human Requirement
Saga pattern design	High — proposed correct forward/backward flows from requirements	Business rule specification
Orchestration boilerplate generation	High — generated all dispatch and registration code	Validation and review
SagaStep macro implementation	High — generated boilerplate elimination correctly	Specification of target API
Business rule definition	None — approval thresholds and pricing formulas require domain knowledge	Full ownership
Retry idempotency in compensation	Not addressed autonomously	Full ownership; required explicit addition
Test scenario coverage	High — generated 11 scenarios systematically	Review for business rule coverage

The pattern that emerges from this assessment: AI tooling excels at generating the structural and mechanical components of saga implementation. It does not supply domain knowledge or production operational awareness. Human contribution is required at the intersection of business rule definition and distributed systems failure mode handling.

8. Implementation Metrics

Metric	Value
Workflow implementation	2,073 lines
Integration tests	457 lines
Test scenarios	11
Boilerplate saved per saga (macro)	~150 lines
Partial failures in production	0
Successful compensation executions	12 (during testing)
Failed compensations	0
Data consistency violations	0
Average workflow execution	180ms (5 steps)
Compensation execution	95ms (2 steps rollback)
Database transactions per workflow	1 (DynamoDB TransactWriteItems)

9. Recommendations

Recommendation 1: Adopt the saga pattern as the standard implementation approach for any multi-step business process that modifies more than one persistent entity. Sequential operations without compensation guarantees are not an acceptable design for workflows where partial failure produces inconsistent state. The saga pattern imposes a design discipline — paired forward and compensation steps — that makes consistency guarantees explicit and testable. Recommendation 2: Define compensation logic before writing forward logic. The compensation design should be specified before any forward implementation begins. AI assistance can help generate compensation logic from a complete specification. Generating compensation logic after forward implementation is complete tends to produce gaps, particularly around intermediate state that forward logic does not explicitly track. Recommendation 3: Apply idempotency guards to every compensation method prior to production deployment. AI-generated compensation steps must be reviewed for idempotency before any production deployment. Each compensation method should be able to execute multiple times without producing incorrect state. This review should be a formal checklist item in the implementation process, not an optional enhancement.

Apply the SagaStep macro pattern to workflow types that do not require full saga compensation. The structural discipline of separating business logic from orchestration, and the compile-time enforcement of complete step definitions, produces value independent of whether compensation paths are ever exercised.

Recommendation 4: Maintain strict separation between pure business logic functions and saga orchestration code. Pure functions implementing domain rules — validation, pricing calculation, threshold evaluation — must not contain orchestration logic. Saga orchestration code must not contain domain logic. This separation enables independent testing and reuse and is the design property that makes AI-assisted generation of the orchestration layer viable. Recommendation 5: Implement state machine representations for all saga states and require explicit transition definitions. Each saga should have a corresponding state machine with named states and defined valid transitions. This makes the workflow auditable and makes invalid state transitions detectable at compile time rather than at runtime.

10. Conclusion and Forward-Looking Assessment

The saga pattern addresses a fundamental challenge in distributed system design: the impossibility of atomic multi-step operations across independent persistence boundaries. The implementation documented here demonstrates that the pattern is tractable, that AI tooling materially reduces the cost of implementation, and that the residual human responsibility — business rule definition and production hardening — is non-negotiable. As distributed systems become increasingly common in SaaS architectures and as AI code generation tools become more capable, the adoption cost of rigorous consistency patterns will continue to decline. The primary constraint will shift from implementation effort to design discipline: ensuring that the right questions are asked before generation begins. Teams that establish saga patterns and supporting macro abstractions early in platform development will find that the marginal cost of adding new workflows approaches the cost of defining business rules alone, with orchestration and boilerplate generated automatically and reliably.

Resources and Further Reading

Disclaimer: This content represents personal learning from personal projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.All code examples are generic patterns or pseudocode for educational purposes.

Overview

Data & State

Code & Tooling

Debugging & Design

Infrastructure

Building Workflows That Can't Half-Fail

Executive Summary

Key Findings

1. The Partial Failure Problem

2. Saga State Machine Design

3. The SagaStep Macro

4. Business Logic Implementation

5. Test Coverage

6. Production Hardening: The Idempotency Gap

7. AI Contribution Assessment

8. Implementation Metrics

9. Recommendations

10. Conclusion and Forward-Looking Assessment

Resources and Further Reading

Overview

Data & State

Code & Tooling

Debugging & Design

Infrastructure

Documentation Index

​Executive Summary

​Key Findings

​1. The Partial Failure Problem

​2. Saga State Machine Design

​3. The SagaStep Macro

​4. Business Logic Implementation

​5. Test Coverage

​6. Production Hardening: The Idempotency Gap

​7. AI Contribution Assessment

​8. Implementation Metrics

​9. Recommendations

​10. Conclusion and Forward-Looking Assessment

​Resources and Further Reading

Executive Summary

Key Findings

1. The Partial Failure Problem

2. Saga State Machine Design

3. The SagaStep Macro

4. Business Logic Implementation

5. Test Coverage

6. Production Hardening: The Idempotency Gap

7. AI Contribution Assessment

8. Implementation Metrics

9. Recommendations

10. Conclusion and Forward-Looking Assessment

Resources and Further Reading