Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

Executive Summary

Production performance degradation in multi-tenant SaaS platforms typically originates from architectural decisions that perform acceptably at low data volumes but degrade predictably as tenant data grows. A systematic analysis of seven bottlenecks in a DynamoDB-backed platform identified five high-leverage optimization patterns: query scoping to database level, Global Secondary Index utilization, in-memory caching with TTL, HTTP connection pooling, and derive-macro code generation. Implementing these patterns required fewer than 500 lines of total code changes and produced latency improvements ranging from 33 percent to 1000x, with monthly infrastructure cost savings of $420. This article documents each pattern, its applicability conditions, its implementation, and the empirical results observed in production.

Key Findings

  • In-memory filtering of database results is the highest-leverage performance anti-pattern to eliminate. Moving filter logic to the database layer reduces latency and read costs by one or more orders of magnitude, with minimal code changes.
  • Global Secondary Indices convert O(n) table scans into O(1) partition lookups. The latency differential between a full table scan of 100,000 records and a direct GSI lookup is approximately 50x in observed production conditions.
  • Appropriate caching of low-churn reference data reduces database read volume by over 90 percent. An in-memory cache with a five-minute TTL achieved a 96 percent hit rate for configuration data that changes at most weekly.
  • Connection setup overhead accounts for one-third of HTTP request latency when pooling is absent. Enabling connection pooling is typically a configuration change, not a code change, and should be the first optimization applied to any HTTP-heavy workload.
  • Derive macros eliminate an entire class of performance bugs by preventing missing GSI annotations that cause accidental full table scans undetectable in development environments.
  • Profiling consistently reveals that the highest-impact bottleneck differs from initial intuition. Measurement is a prerequisite for effective optimization, not a best practice.

1. Development Data Volumes Mask Performance Failures That Become Critical as Tenant Data Grows — Production Degradation in Multi-Tenant Platforms Is Predictable, Not Accidental

Multi-tenant platforms are particularly susceptible to performance degradation at scale because development and testing typically occur with data volumes that are orders of magnitude smaller than production. A query that retrieves 50 records from a 10-record development dataset is indistinguishable from the same query retrieving 50 records from a 10,000-record production dataset—until the production query must load and filter 10,000 records to find the 50. A production platform built on DynamoDB with event sourcing exhibited the following symptoms following growth in tenant data volume:
  • Session queries: 500ms average against a target of 50ms
  • Contact lookups: Full table scans on a 100,000-record table at 800ms to 2,000ms
  • Webhook execution: 200ms per call, of which 50ms was connection establishment overhead
  • Memory usage: 50MB per simple query, caused by full dataset loading
The following sections document each optimization pattern applied, the conditions under which it applies, and the measured outcomes.

2. Pattern 1: Moving Filter Logic to the Database Layer Eliminates the Primary Source of Unnecessary Memory Allocation and Read Amplification

2.1 In-Memory Filtering of Full Partition Loads Produced 10x Latency Overhead and 50MB Memory Allocation Per Request at Production Data Volumes

Session retrieval for a specific tenant context was implemented as a full tenant load followed by in-memory filtering:
// Load ALL sessions for the tenant, then filter in memory
pub async fn get_sessions_for_capsule(
    &self,
    tenant_id: &str,
    capsule_id: &str,
) -> Result<Vec<Session>> {
    // Query DynamoDB for ALL tenant sessions
    let all_sessions = self.repository
        .query_by_tenant(tenant_id)
        .await?;

    // Filter in memory to find capsule sessions
    let filtered: Vec<Session> = all_sessions
        .into_iter()
        .filter(|s| s.capsule_id == capsule_id)
        .collect();

    Ok(filtered)
}
At development data volumes (10 sessions per tenant), this pattern performed adequately. At production volumes (1,000 sessions per tenant), the query required loading 1,000 records and allocating 50MB of memory to return 50 records.

2.2 Pushing the Filter Predicate Into the DynamoDB Query Expression Reduces Data Transfer Without Requiring Index Changes

Push the filtering predicate to the DynamoDB query layer:
// Filter at the database level using DynamoDB expressions
pub async fn get_sessions_for_capsule(
    &self,
    tenant_id: &str,
    capsule_id: &str,
) -> Result<Vec<Session>> {
    // Query with filter expression - DynamoDB does the filtering
    let sessions = self.repository
        .query_by_tenant(tenant_id)
        .filter_expression("capsule_id = :capsule_id")
        .expression_values(hashmap! {
            ":capsule_id" => AttributeValue::S(capsule_id.to_string())
        })
        .await?;

    Ok(sessions)
}

2.3 Query Scoping Delivered 10x Latency Reduction, 90 Percent Memory Savings, and 20x Fewer DynamoDB Read Units

MetricBeforeAfterImprovement
Query latency500ms50ms10x
Memory per request50MB5MB90% reduction
DynamoDB read units1,000 items50 items20x fewer
DynamoDB filter expressions reduce data transfer between database and application but still consume read capacity units for all scanned items. For workloads requiring true O(1) lookups regardless of partition size, a Global Secondary Index is required (see Pattern 2). Filter expressions are appropriate when scan-to-result ratios are manageable and the overhead of an additional GSI is not justified.

2.4 Filter Expressions Are Appropriate When Scan-to-Result Ratios Are Manageable — True O(1) Lookups Require a Global Secondary Index

This pattern yields significant returns when:
  1. The filter attribute has high cardinality relative to the partition being scanned
  2. The query pattern is executed with sufficient frequency to amortize optimization effort
  3. The unfiltered result set contains more than 100 records

3. Pattern 2: Global Secondary Indices Convert O(n) Full Table Scans Into O(1) Partition Lookups, Producing 53x Latency Improvements on High-Frequency Query Paths

3.1 Contact Retrieval Without an Index Required Scanning 100,000 Records, Producing 800ms to 2,000ms Latency That Scaled Linearly With Table Growth

Contact retrieval by account identifier required a full table scan:
// Find contact by account_id - requires full table scan
pub async fn get_contact_by_account(
    &self,
    account_id: &str,
) -> Result<Option<Contact>> {
    let all_contacts = self.repository
        .scan()  // ⚠️ FULL TABLE SCAN
        .await?;

    all_contacts
        .into_iter()
        .find(|c| c.account_id == account_id)
        .ok_or(Error::NotFound)
}
With 100,000 contacts in the table, every lookup required scanning the entire table. Best-case latency was 800ms; worst-case exceeded 2,000ms. Query complexity was O(n), scaling linearly with table growth.

3.2 Adding GSIs for the Three Highest-Frequency Query Patterns Replaced Full Table Scans With Direct Partition Reads

Add Global Secondary Indices for high-frequency query patterns:
#[derive(DynamoDbEntity)]
#[dynamodb(table = "contacts")]
pub struct ContactEntity {
    #[dynamodb(partition_key)]
    pub id: String,

    #[dynamodb(sort_key)]
    pub tenant_id: String,

    // GSI5: PrimaryContactIndex for account lookups
    #[dynamodb(gsi5_partition_key)]
    pub account_id: String,

    // GSI6: ExecutiveIndex for role-based queries
    #[dynamodb(gsi6_partition_key)]
    pub role: String,

    #[dynamodb(gsi6_sort_key)]
    pub department: String,
}

// Generated query method (from derive macro)
pub async fn query_by_account(
    &self,
    account_id: &str,
) -> Result<Vec<Contact>> {
    self.repository
        .query_gsi5(account_id)  // Direct GSI lookup
        .await
}

3.3 GSI Adoption Reduced Contact Lookup Latency From 800ms to 15ms and Converted Query Complexity From O(n) to O(1)

MetricBeforeAfterImprovement
Query latency800ms15ms53x
Query complexityO(n)O(1)Constant time
Read costFull table scanSingle partition readProportional to result set

3.4 Three Indices Cover Account Lookups, Role-Based Queries, and Temporal Status Filtering — the Three Access Patterns With the Highest Observed Query Frequency

GSI indices were created for the three highest-frequency query patterns:
Query PatternIndexPrimary Use Case
Account lookupsGSI5 (account_id)Contact retrieval by account
Role queriesGSI6 (role + department)Executive and role-based dashboards
Status filtersGSI7 (status + created_at)Active contact lists with temporal ordering

3.5 GSI Provisioning Is Justified When the Query Attribute Is Outside the Primary Key, Frequency Exceeds 100 Executions Per Day, and the Table Exceeds 10,000 Items

Create a GSI when:
  1. The query attribute is not part of the primary key
  2. The query pattern executes more than 100 times per day
  3. The table contains more than 10,000 items
Each GSI consumes additional write capacity and storage. GSI count per table should be bounded to 3–4 in most implementations to balance query performance against cost and write amplification. Evaluate each proposed GSI against actual query frequency data before provisioning.

4. Pattern 3: In-Memory Caching With TTL Achieves 96 Percent Hit Rates and 1000x Latency Improvement for Low-Churn Configuration Data

4.1 Webhook Configuration Was Fetched From DynamoDB on Every Invocation Despite Changing at Most Weekly, Producing 10,000 Unnecessary Reads Per Day

Webhook execution required loading hook configuration from DynamoDB on every invocation:
// Every webhook call hits DynamoDB
pub async fn execute_webhook(
    &self,
    hook_id: &str,
    payload: &str,
) -> Result<()> {
    // Load hook config from DynamoDB - 100ms
    let hook = self.repository
        .get_hook(hook_id)
        .await?;

    // Execute webhook - 150ms
    self.http_client
        .post(&hook.url)
        .body(payload)
        .send()
        .await?;

    Ok(())
}
Hook configuration changes at most weekly, yet was being retrieved on every execution. At 10,000 webhook calls per day, this produced 10,000 unnecessary DynamoDB reads representing 100ms of cumulative latency per call.

4.2 A Five-Minute TTL Cache Reduces Database Reads to 288 Per Day While Bounding Maximum Configuration Staleness to an Operationally Acceptable Window

In-memory cache with a five-minute TTL:
use std::sync::Arc;
use tokio::sync::RwLock;
use std::time::{Duration, Instant};

pub struct CachedHookRepository {
    repository: Arc<DynamoDbRepository>,
    cache: Arc<RwLock<HashMap<String, CachedHook>>>,
    ttl: Duration,
}

struct CachedHook {
    hook: Hook,
    expires_at: Instant,
}

impl CachedHookRepository {
    pub fn new(repository: DynamoDbRepository) -> Self {
        Self {
            repository: Arc::new(repository),
            cache: Arc::new(RwLock::new(HashMap::new())),
            ttl: Duration::from_secs(300), // 5 minutes
        }
    }

    pub async fn get_hook(&self, hook_id: &str) -> Result<Hook> {
        // Check cache first
        {
            let cache = self.cache.read().await;
            if let Some(cached) = cache.get(hook_id) {
                if cached.expires_at > Instant::now() {
                    return Ok(cached.hook.clone()); // Cache hit: 0.1ms
                }
            }
        }

        // Cache miss or expired - fetch from DynamoDB
        let hook = self.repository.get_hook(hook_id).await?;

        // Update cache
        {
            let mut cache = self.cache.write().await;
            cache.insert(hook_id.to_string(), CachedHook {
                hook: hook.clone(),
                expires_at: Instant::now() + self.ttl,
            });
        }

        Ok(hook)
    }
}

4.3 Caching Reduced Cache-Hit Latency From 100ms to 0.1ms and Daily DynamoDB Read Volume From 10,000 to 288 at a 96 Percent Hit Rate

MetricBeforeAfterImprovement
Cache hit latency100ms0.1ms1000x
Daily DynamoDB reads10,00028897% reduction
Cache hit rateN/A96%

4.4 A Five-Minute TTL Optimizes the Hit-Rate-to-Staleness Trade-off — In-Memory Cache Is Appropriate When a Single Instance Handles the Workload and Cross-Instance Consistency Is Not Required

A five-minute TTL balances the 83 percent hit rate of a one-minute window against the staleness risk of a 15-minute window. An in-memory cache is appropriate here because the application runs as a single instance, the dataset is smaller than 10MB, and cross-instance consistency is not required. Distributed cache infrastructure would add operational overhead without benefit for this workload.

4.5 In-Memory Caching Is Appropriate When Read-to-Write Ratio Exceeds 10:1, Staleness of Several Minutes Is Acceptable, and the Dataset Fits Within 100MB

In-memory caching is appropriate when:
  1. The read-to-write ratio exceeds 10:1
  2. Staleness of up to several minutes is acceptable
  3. The cached dataset fits within 100MB of application memory
  4. The application runs as a single instance or consistency across instances is not required
Do not apply in-memory caching to user session data, financial transaction records, or real-time analytics. These data classes require either real-time consistency guarantees or cross-instance synchronization that in-memory caching cannot provide.

5. Pattern 4: Connection Pooling Eliminates TLS Handshake and DNS Lookup Overhead That Would Otherwise Account for One-Third of Per-Request Latency

5.1 Per-Request HTTP Client Instantiation Incurred 50ms of Connection Overhead on Every Webhook Call — One-Third of Total Request Latency

Webhook execution established a new HTTP client on every call, incurring 30–50ms TLS handshake and 10–20ms DNS lookup overhead on each request. Combined with a 100ms HTTP request, connection setup represented one-third of total per-call latency.

5.2 A Shared HTTP Client With Connection Pool Reuses Established Connections, Reducing Per-Call Overhead to Under 1ms on Pool Hits

Shared HTTP client with connection pool:
use reqwest::Client;
use std::time::Duration;

pub struct WebhookExecutor {
    // Shared HTTP client with connection pool
    client: Client,
}

impl WebhookExecutor {
    pub fn new() -> Self {
        let client = Client::builder()
            .pool_max_idle_per_host(32)  // Keep 32 idle connections per host
            .pool_idle_timeout(Duration::from_secs(90))
            .timeout(Duration::from_secs(30))
            .build()
            .expect("Failed to build HTTP client");

        Self { client }
    }

    pub async fn execute_webhook(
        &self,
        url: &str,
        payload: &str,
    ) -> Result<()> {
        // Reuses connection from pool - no TLS handshake
        let response = self.client
            .post(url)
            .body(payload)
            .send()  // Only ~100ms (50ms saved)
            .await?;

        Ok(())
    }
}

5.3 Connection Pooling Reduced Connection Overhead by 50x at a 92 Percent Pool Hit Rate, Delivering 33 Percent Total Per-Call Latency Reduction

MetricBeforeAfterImprovement
Per-call latency150ms100ms33%
Connection overhead50msLess than 1ms (on pool hit)50x
Pool hit rateN/A92%

5.4 Connection Pooling Applies to All Services Making Frequent HTTP Requests and Is Typically a Configuration Change, Not a Code Change

Connection pooling applies to all services that make frequent HTTP requests, all database connections, and any external API calls with non-trivial TLS overhead. Most HTTP client libraries, including reqwest in Rust, include built-in connection pooling. Enabling it requires configuration, not code.
Connection pooling is one of the highest-return, lowest-effort optimizations available for HTTP-heavy workloads. If the application creates HTTP clients per request, this optimization should be the first change applied before any other latency reduction work.

6. Pattern 5: Derive Macro Code Generation Eliminates Missing GSI Annotations — an Accidental Full Table Scan Defect Class Undetectable in Development

6.1 Manual Boilerplate Across Seven Entities Produced 1,050 Lines of Error-Prone Code, Copy-Paste Defects, and Missing GSI Annotations That Caused Accidental Table Scans

Every DynamoDB entity required 100–200 lines of manual boilerplate for attribute mapping and query methods. Manual implementation across seven entities produced 1,050 lines of boilerplate with copy-paste errors, missing GSI annotations (causing accidental full table scans), and no compile-time validation of field names.

6.2 A DynamoDB Derive Macro Generates Attribute Mapping, Query Methods, and Compile-Time Validation From 15-Line Annotated Struct Definitions

Derive macro generating entity boilerplate at compile time:
// Macro-driven implementation - 15 lines total
#[derive(DynamoDbEntity)]
#[dynamodb(table = "contacts")]
pub struct ContactEntity {
    #[dynamodb(partition_key)]
    pub id: String,

    #[dynamodb(sort_key)]
    pub tenant_id: String,

    #[dynamodb(gsi5_partition_key)]
    pub account_id: String,

    #[dynamodb(gsi6_partition_key)]
    pub role: String,
}

// Macro generates:
// - from_item() / to_item() methods
// - query_gsi5() / query_gsi6() methods
// - Type-safe field accessors
// - Compile-time validation

6.3 Code Generation Reduced Boilerplate by 90 Percent per Entity and Moved Missing GSI Detection From Runtime Failures to Compile-Time Errors

MetricBeforeAfterImprovement
Lines per entity1501590% reduction
Total boilerplate (7 entities)1,050 lines105 lines945 lines eliminated
Time to add new entity30 minutes3 minutes10x
Copy-paste errorsPresentEliminated
Missing GSI annotations detectedAt runtimeAt compile time
The runtime performance of generated code is identical to equivalent hand-written code. The primary performance benefit is the elimination of a class of bugs—missing GSI annotations—that would otherwise produce accidental full table scans undetectable in development environments.

6.4 Code Generation Through Macros Is Appropriate When Structural Patterns Repeat Across Multiple Entities and Compile-Time Validation Can Prevent Runtime Errors

Code generation through macros or equivalent tooling is appropriate when:
  1. The same structural pattern repeats across multiple entities
  2. Manual implementation introduces error-prone boilerplate
  3. Compile-time validation can prevent runtime errors

7.1 Session Queries — Not Contact Queries — Were the True Primary Bottleneck, Identified Only Through Frequency-Weighted Profiling

The initial assumption in this engagement was that contact queries were the primary bottleneck, as they were the most visible in application logs. Profiling revealed that session queries were 10 times more frequent and constituted a larger share of total latency. Measurement is a prerequisite for effective optimization, not a recommended practice. Recommended instrumentation: application-level latency at P50, P95, and P99 by operation; database slow query logs; memory allocation profiling per request; distributed tracing for cross-service request flows.

7.2 Optimization Impact Is the Product of Frequency, Latency, and Business Criticality — Not Latency Alone

Not all slow queries merit equal optimization effort. Impact = Frequency × Latency × Business Criticality
QueryFrequencyLatencyBusiness CriticalityPriority
Session by context10,000/day500msHigh (API path)1
Contact by account5,000/day800msHigh (dashboard)2
Hook configuration load10,000/day100msMedium (background)3
Administrative reports10/day2,000msLow (internal)4

7.3 Architecture Decision Records Prevent Performance Regression by Preserving the Rationale Behind Cache TTLs, GSI Choices, and Filter Strategies

Each significant optimization decision should be captured in an Architecture Decision Record documenting the problem context, the decision made, positive and negative consequences, measured before-and-after metrics, and references to related decisions. ADRs preserve institutional knowledge and prevent regression when engineers unfamiliar with the original decision later modify the system. For a detailed ADR template, see ADRs as Architecture Documentation.

8. Five Patterns Delivered Latency Improvements From 33 Percent to 1000x, $420 Monthly Cost Savings, and 27 Developer Hours Saved per Month From Fewer Than 500 Lines of Code Changes

OptimizationMetricBeforeAfterImprovement
Query ScopingLatency500ms50ms10x
Memory50MB5MB10x
DynamoDB reads1,000 items50 items20x
Contact GSILatency800ms15ms53x
ComplexityO(n)O(1)
Hook CachingCache hit latency100ms0.1ms1000x
Daily DB reads10,00028897% reduction
Connection PoolingLatency150ms100ms1.5x
Connection overhead50msLess than 1ms50x
Code GenerationLines of code1,05010595% reduction
Dev time per entity30 min3 min10x
Total savings: $420/month in reduced DynamoDB read capacity; 27 developer hours/month saved; fewer than 500 lines of code changed across all five patterns.

9. Recommendations

  1. Establish performance budgets per operation type before production launch. Define P95 latency targets for API endpoints, background jobs, and database queries; alert before user impact.
  2. Instrument every DynamoDB query with read unit consumption metrics. Full table scans introduced by code changes are undetectable in development. Production instrumentation is the only reliable mechanism.
  3. Require connection pool configuration in all HTTP client initialization. Treat single-use client instantiation as a blocking defect in code review.
  4. Limit GSI proliferation through a formal review process. Each GSI represents write amplification cost. Require benchmark evidence of query frequency before approving additions.
  5. Document significant performance decisions in Architecture Decision Records. Cache TTL values, GSI choices, and filter strategies are not self-evident from code. ADRs prevent regression and preserve institutional knowledge.
  6. Measure P95 and P99 latency, not averages. Average metrics mask tail latency. Optimization targets must be defined against percentile thresholds.
The fastest query is the one that is never executed. Before applying caching or indexing to a query, evaluate whether the query can be eliminated entirely. In one case, 60 percent of hook configuration queries were eliminated by embedding configuration in the triggering event payload, removing the lookup requirement entirely.

10. Resources and Further Reading


11. Conclusion and Forward Outlook

The five patterns documented here address distinct performance failure modes that are predictable, measurable, and correctable. Their value lies not in novelty but in systematic application: identifying the correct bottleneck through measurement, selecting the appropriate pattern, implementing it with verification, and documenting the rationale for future engineers. As multi-tenant data volumes continue to grow, the cost of deferred optimization increases. Organizations that instrument performance from initial deployment and establish optimization standards before performance incidents occur will maintain system reliability and cost efficiency at scale. The patterns described here—particularly database-level filtering and connection pooling—require minimal engineering investment relative to their impact and should be treated as baseline standards rather than optional enhancements.
Disclaimer: This content represents my personal learning journey using AI for a personal project. It does not represent my employer’s views, technologies, or approaches.All code examples are generic patterns or pseudocode for educational purposes. Performance numbers are from real implementations but have been sanitized and rounded for clarity.