Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.aidonow.com/llms.txt

Use this file to discover all available pages before exploring further.

Executive Summary

Integrating a large language model into a production multi-tenant SaaS application is not primarily an AI engineering problem — it is a systems engineering problem. Four constraints arrive simultaneously: per-tenant cost control, audit compliance, correctness under concurrent retries, and graceful degradation when the LLM is unavailable. This analysis documents the complete vertical slice of an LLM-powered AI suggestions feature in a production CRM service, from DynamoDB entity design through an eight-step Rust service layer to a polling React interface. The implementation demonstrates that each constraint has a specific, composable solution: three-layer rate control addresses cost, initials-only prompt construction addresses PII, SHA-256 deterministic batch IDs address idempotency, and a 202 async pattern addresses degradation. Organizations planning LLM feature additions to existing SaaS platforms should treat these four requirements as non-negotiable prerequisites rather than post-launch concerns.

Key Findings

  • LLM integration into multi-tenant SaaS requires solving four orthogonal production constraints simultaneously: cost control, audit compliance, retry idempotency, and graceful degradation. Solving any three while deferring the fourth creates an incident surface that typically manifests under load.
  • A SHA-256 hash of opportunity ID and UTC date yields a deterministic idempotency key that eliminates coordination overhead. Two concurrent generate requests for the same opportunity on the same calendar day produce identical batch IDs; a single DynamoDB conditional write ensures only one suggestion set persists.
  • Three independent rate-control layers are required because each layer defends a different failure boundary. Feature flags disable the capability per tenant, HTTP-layer rate limiting prevents API abuse, and per-opportunity cooldown records in DynamoDB prevent redundant LLM calls within a generation window.
  • Event payloads must be structurally incapable of containing LLM output. Omitting suggestion text from the AiSuggestionGenerated domain event at the type level — not by convention — ensures downstream consumers cannot reconstruct raw model output from the audit stream, regardless of future development.
  • DynamoDB UUID v7 sort keys and a 24-hour TTL are architectural decisions, not implementation details. Time-sortable suggestion IDs eliminate secondary sort operations; automatic expiry removes the engineering burden of stale-suggestion cleanup and reduces storage cost.
  • The 202 async response pattern decouples perceived latency from actual LLM round-trip time. Clients receive an immediate acknowledgment with a polling reference; LLM unavailability becomes a retryable background state rather than a blocking user-facing failure.

1. The Four Production Requirements for LLM Integration in Multi-Tenant SaaS

Adding LLM inference to an existing multi-tenant platform introduces four production requirements that do not apply to deterministic features. Each requires deliberate architectural investment; none can be addressed retroactively without structural change to the persistence and event models. Cost control is the most immediately visible requirement. In a multi-tenant system, a single tenant calling the generate endpoint in a tight loop will exhaust LLM budget for all tenants. Standard HTTP rate limiting is necessary but insufficient — it does not prevent the same opportunity from being submitted repeatedly within a short window, each call producing a billable LLM request. Audit compliance surfaces when legal or security reviews examine the event stream. LLM outputs frequently contain synthesized information derived from customer data. If that output is logged verbatim in domain events, it becomes part of the audit record and may trigger data residency or PII handling obligations that the original event design did not anticipate. The solution is structural: define event schemas that cannot hold LLM output. Correctness under retries is the idempotency requirement. Network failures, client timeouts, and load balancer retries mean that any single user action may produce multiple inbound requests. A naive implementation creates duplicate suggestion records per retry. The correct solution eliminates duplication at the key level rather than through locking or deduplication queries. Graceful degradation is the availability requirement. The LLM is an external service with its own SLA, which is lower than the SLA of the CRM product embedding it. Users must be able to view opportunities, update records, and advance pipeline stages regardless of LLM availability. The suggestion feature must fail independently of the core product. These four requirements shape every layer of the implementation described in the sections that follow.

2. Entity Model: Time-Sortable Keys, Status Index, and TTL-Based Expiry

The DynamoDB table design for AI suggestions enforces three properties at the storage layer: tenant isolation, efficient status queries, and automatic expiry.
Table: crm_ai_suggestions
PK:   TENANT#{tenant_id}#CAPSULE#{capsule_id}#OPP#{opp_id}
SK:   SUGG#{suggestion_id}   // UUID v7 — time-sortable
All suggestions for one opportunity within one tenant-capsule context reside in a single partition. This layout enables a single Query call to retrieve all suggestions for an opportunity, ordered by generation time, without a sort step. UUID v7 sort keys are critical here: UUID v4 is random, which scatters suggestions within the partition and requires client-side sorting. UUID v7 embeds a millisecond-precision timestamp in the most-significant bits, making lexicographic sort identical to chronological sort. The status Global Secondary Index enables the second access pattern — querying all suggestions in a given status across all opportunities for a tenant:
GSI: status-index
PK:  TENANT#{tenant_id}#STATUS#{status}
SK:  created_at  // epoch seconds
This index supports the moderation and monitoring use case: operators can query all Pending or Generated suggestions for a tenant without scanning the full table. Two additional storage-layer decisions address operational concerns: TTL: Each record includes a generated_at epoch-second attribute. DynamoDB TTL is set to generated_at + 86400 (24 hours). Stale suggestions are removed automatically without a scheduled cleanup job. The 24-hour window reflects a deliberate product decision: AI suggestions that are more than a day old are likely to be based on stale opportunity context and may mislead users. Automatic expiry enforces this policy structurally. SSE with AES256: Server-side encryption is enabled on the table. This is baseline hygiene for any table containing customer-derived content, but it is especially important for LLM outputs, which may contain synthesized information that is sensitive in combination even when individual inputs are not. Stream: NEW_AND_OLD_IMAGES: DynamoDB Streams is enabled with full image capture. This supports downstream audit consumers that need to observe suggestion lifecycle transitions — particularly the Generated → Accepted and Generated → Dismissed transitions, which are relevant to understanding which suggestions influenced user decisions.

3. The Eight-Step Generation Flow: Feature Gate to Event Emission

The AiSuggestionService::generate method executes eight steps in strict sequence. Each step is a guard that can short-circuit the flow with a specific error response. Understanding each step’s purpose clarifies why the ordering matters. Step 1 — Feature flag guard. The service checks tenant_config.ai_assist_enabled before any other work. If the flag is absent or false, the service returns HTTP 403 immediately. This is the coarsest control lever: it allows the AI suggestions feature to be enabled per tenant without deployment changes, and it allows a tenant to be disabled without affecting others. Step 2 — Batch ID derivation. The service computes a deterministic batch ID:
let batch_id = {
    let raw = format!("{}{}", opp_id, date_utc_ymd());
    let digest = Sha256::digest(raw.as_bytes());
    format!("{:x}", digest)
};
This ID is the idempotency key for the generation request. Section 4 examines this decision in detail. Step 3 — Cooldown enforcement. The service calls cooldown_repo.check_cooldown(opp_id). If a cooldown record exists for this opportunity, the service returns HTTP 429 with a retry_after_secs field. If no cooldown is active, the service calls cooldown_repo.record_generation(opp_id) with a 60-second TTL before proceeding. Step 4 — PII-safe prompt construction. The service assembles the LLM prompt using only privacy-safe fields. Section 7 describes this in detail. Step 5 — LLM call with timeout. The service invokes the LlmClient trait implementation with a hard 30-second timeout enforced by Tokio:
let response = tokio::time::timeout(
    Duration::from_secs(30),
    self.llm_client.complete(prompt),
).await
.map_err(|_| AiSuggestionError::LlmTimeout)?
.map_err(AiSuggestionError::LlmError)?;
A timeout error at this layer returns HTTP 503. The LLM’s absence does not propagate further into the service. Step 6 — Response parsing. The service parses the LLM JSON response, extracting lead_score, score_factors, and next_action. A malformed response returns HTTP 400 to the caller. Step 7 — Conditional persistence. The AiSuggestion record is written to DynamoDB with a conditional expression that rejects writes if the item already exists. This is the write-side idempotency guard: if two concurrent requests arrive with the same batch_id, only one write succeeds. Step 8 — Event emission. The service emits an AiSuggestionGenerated domain event. The event payload contains the suggestion ID, opportunity ID, tenant ID, status, and timestamp — but not the suggestion text. This constraint is discussed in Section 7.

4. Idempotency Without Coordination: SHA-256 Batch IDs as Deterministic Keys

The idempotency requirement for LLM generation is more subtle than for standard CRUD operations. Standard idempotency patterns use client-supplied request IDs or server-generated tokens stored in a coordination table. Both approaches introduce coordination overhead: the client must manage and reuse the ID across retries, or the server must maintain a separate lookup table with its own consistency requirements. The SHA-256 batch ID approach eliminates coordination by deriving the idempotency key deterministically from the inputs that define what is being requested:
// Two concurrent requests for the same opportunity on the same day
// produce the same batch_id
SHA-256("opp_01HX7F3K2N4M5P6Q7R8S9T0UVW" + "2026-05-08") = "3a7f9c..."
This design has a specific scope: one batch of suggestions per opportunity per calendar day. The calendar-day granularity reflects the TTL policy — suggestions expire after 24 hours, so generating more than once per day would produce duplicates that expire before they are used. The conditional write in Step 7 is the enforcement mechanism:
// DynamoDB conditional expression — rejects if item already exists
let condition = "attribute_not_exists(pk) AND attribute_not_exists(sk)";
If two concurrent requests survive Steps 1–6 and both attempt to write the same suggestion record, exactly one will succeed and the other will receive a ConditionalCheckFailedException. The service treats this exception as a success — the suggestion was generated — and returns the batch ID to both callers. This pattern has one important limitation: the determinism depends on clock alignment. If two requests arrive at the UTC day boundary, one may compute 2026-05-08 and the other 2026-05-09. The result is two distinct batch IDs and two distinct suggestion sets. This is correct behavior — the suggestions were generated at different calendar dates with potentially different opportunity context — but implementors should be aware of the boundary condition.

5. Three-Layer Cost Control: Cooldown, Rate Limit, and Feature Gate

The three rate-control layers in this implementation are not redundant. Each defends a distinct failure boundary, and the failure modes they prevent do not overlap.
LayerMechanismScopeFailure Prevented
Feature gatetenant_config.ai_assist_enabledPer-tenantUnauthorized LLM usage, budget overrun for disabled tenants
HTTP rate limitrate_limit = "60/min" on handlerPer-API-keyAPI abuse, scripted hammering of the generate endpoint
CooldownDynamoDB TTL record, 60s per opportunityPer-opportunityRedundant LLM calls for same context within generation window
The feature gate is the coarsest control. It enables the AI feature to be offered as a paid tier add-on and provides an immediate kill switch if a tenant is suspended or if LLM costs exceed acceptable thresholds. The flag check in Step 1 precedes all other work, including the cooldown lookup, so a disabled tenant incurs no DynamoDB read cost on the generation path. HTTP rate limiting is applied at the handler level via a macro annotation:
#[eva_api(
    method = "POST",
    path = "/opportunities/{opp_id}/ai-suggestions",
    rate_limit = "60/min"
)]
async fn generate_suggestions(/* ... */) -> Result<impl IntoResponse, ApiError> {
    // ...
}
This limit applies per API key. It prevents a single integration or script from exhausting the platform’s LLM budget, but it does not prevent legitimate high-frequency usage from generating LLM calls for the same opportunity repeatedly. Per-opportunity cooldown addresses the gap HTTP rate limiting leaves open. The CooldownRepository writes a DynamoDB item with a 60-second TTL each time generation succeeds. For the duration of that TTL, subsequent generate requests for the same opportunity return HTTP 429 with a retry_after_secs field. This prevents the case where a user triggers generation, the LLM responds slowly, the user triggers generation again — and both requests proceed to the LLM. The three layers compose independently. An operator can tighten the HTTP rate limit without touching cooldown behavior. A tenant can be disabled at the feature gate without affecting the rate limit configuration. The cooldown window can be adjusted per deployment without changes to the API handler.

6. The LlmClient Trait: Abstraction Without Premature Extraction

The LlmClient abstraction is defined as a Rust async trait:
#[async_trait]
pub trait LlmClient: Send + Sync {
    async fn complete(&self, req: CompletionRequest) -> Result<CompletionResponse, LlmError>;
}

pub struct CompletionRequest {
    pub system_prompt: Option<String>,
    pub user_prompt: String,
    pub max_tokens: u32,      // default: 1024
    pub temperature: f32,     // default: 0.3
    pub model: Option<String>, // default: claude-haiku-4-5-20251001
}

pub enum LlmError {
    RateLimit,                // 429 — back off and retry
    Unauthorized,             // 401/403 — fix credentials, do not retry
    ModelNotFound(String),    // 404 — fix model name
    ContextTooLong,           // 400 — truncate input
    Timeout,                  // retry with back-off
    Upstream(String),         // unexpected — surface to operator
}
The trait provides two properties that justify its existence independently of extraction: testability and provider abstraction. With LlmClient as a trait, test suites inject a mock implementation that returns controlled responses, enabling exhaustive testing of the eight-step flow without network calls. Provider abstraction means a future migration from one LLM provider to another requires only a new struct implementing the trait, not changes to the service logic. Implementation constraint: The CRM service is currently the sole consumer of this abstraction. A principled refactoring impulse would extract LlmClient into a shared platform crate for reuse. That extraction was deliberately deferred, following YAGNI discipline. The trait exists in the CRM service crate. If a second service requires LLM access in the future, extraction is a straightforward, mechanical change — copy the trait definition and its single implementation to a shared crate, update the import paths. Extracting before a second consumer exists creates maintenance surface without delivering value. The LlmError enum is the error vocabulary that the service layer maps to HTTP responses:
LlmErrorHTTP ResponseRetry?
RateLimit429 with retry_after_secsYes, with backoff
Unauthorized500 (operator error)No
ModelNotFound(_)500 (operator error)No
ContextTooLong503 (prompt truncation required)No, requires fix
Timeout503Yes, with backoff
Upstream(_)503Conditional
The Unauthorized and ModelNotFound variants map to HTTP 500 rather than 4xx because they represent operator configuration errors, not client errors. A client cannot fix a missing API credential or an incorrect model name — these require operator intervention. Returning 500 routes them through the alerting path rather than silently returning a client error that may be ignored.

7. PII Discipline: Prompt Construction and Event Payload Constraints

Two independent PII controls operate in this implementation. They address different surfaces in the data flow.

7.1 Prompt Construction

The LLM prompt is assembled in Step 4 using a dedicated prompt builder that enforces a strict field allowlist. Contact data — email addresses, phone numbers, full names — is excluded. Only initials are included:
fn build_suggestion_prompt(opp: &Opportunity, contacts: &[Contact]) -> String {
    let contact_summary = contacts
        .iter()
        .map(|c| format!("  - {} ({})", c.initials(), c.role))
        .collect::<Vec<_>>()
        .join("\n");

    format!(
        "Opportunity: {title}\nValue: {value}\nStage: {stage}\nContacts:\n{contacts}\n\
         Provide: lead_score (0-100), score_factors (list), next_action (string).",
        title = opp.title,
        value = opp.estimated_value,
        stage = opp.stage,
        contacts = contact_summary,
    )
}
The function accepts typed domain objects and extracts only what the allowlist permits. There is no general serialization of the opportunity or contact structs — no risk of a new field being silently included if the domain model grows.
Using a general-purpose serializer (e.g., serde_json::to_string(&opportunity)) to construct LLM prompts is dangerous in multi-tenant systems. As domain models evolve, new fields — potentially containing PII — may be included in prompts without a deliberate review step. A typed extraction function with an explicit allowlist is the correct pattern.

7.2 Event Payload Constraints

The AiSuggestionGenerated domain event carries metadata about the generation event but does not carry the generated suggestion:
pub struct AiSuggestionGenerated {
    pub suggestion_id: SuggestionId,
    pub batch_id: String,
    pub opp_id: OpportunityId,
    pub tenant_id: TenantId,
    pub capsule_id: CapsuleId,
    pub status: SuggestionStatus,
    pub generated_at: DateTime<Utc>,
    // NOTE: suggestion_text is intentionally absent.
    // Downstream consumers must retrieve suggestion content
    // from the crm_ai_suggestions table directly.
}
The absence of suggestion_text from this struct is a CISO-level mandate, and its enforcement is structural rather than conventional. A future developer cannot accidentally add the field and have it silently flow into the event stream — they must make an explicit change to the event schema, which will be visible in code review. Downstream consumers that need suggestion content for legitimate purposes (analytics, quality assessment) must query the crm_ai_suggestions table directly, where access controls and TTL apply. This design preserves the audit utility of the event stream while preventing it from becoming a secondary store of LLM-synthesized content.

8. The 202 Async Pattern: Immediate Acceptance, Polled Completion

LLM inference is slow relative to the latency budget of a synchronous HTTP response. Round-trip times of 2–10 seconds are common; under load or provider degradation, 30 seconds or more. A synchronous response pattern — POST, wait for LLM, return result — creates two problems: the client connection is held open for the duration of the LLM call, and LLM unavailability directly causes user-facing errors in the core product. The 202 async pattern separates request acceptance from result availability:
1

Client POSTs a generate request

The service executes Steps 1–3 (feature gate, batch ID, cooldown) synchronously and returns HTTP 202 immediately with the batch ID and a Generating status. The LLM call is dispatched asynchronously.
2

Client polls the list endpoint

The client uses GET /opportunities/{opp_id}/ai-suggestions at an appropriate interval. The endpoint returns an AiSuggestionListResponse containing all suggestions for the opportunity and their current statuses.
3

Client detects completion

When a suggestion transitions from Pending to Generated, the client renders the suggestion. If the status transitions to an error state, the client surfaces a retry option.
The immediate 202 response body:
{
  "batch_id": "3a7f9c4b...",
  "status": "Generating"
}
The poll response shape:
pub struct AiSuggestionListResponse {
    pub items: Vec<AiSuggestionSummary>,
    pub count: usize,
}

pub struct AiSuggestionSummary {
    pub suggestion_id: String,
    pub batch_id: String,
    pub status: SuggestionStatus,
    pub lead_score: Option<u8>,
    pub next_action: Option<String>,
    pub created_at: DateTime<Utc>,
}
The suggestion status state machine governs all valid transitions:
Pending ──► Generated ──► Accepted
                     └──► Dismissed
Any ─────────────────────► Expired (TTL)
The Expired transition is not driven by application code — it is driven by DynamoDB TTL removal. The React UI should treat a suggestion that disappears from the list (was present on a previous poll, absent on the current poll) as expired and prompt the user to regenerate if needed.
On the React side, the polling logic should implement exponential backoff with a cap to avoid hammering the list endpoint:
async function pollUntilGenerated(oppId: string, batchId: string): Promise<AiSuggestion | null> {
  let delayMs = 1000;
  const maxDelayMs = 10000;
  const maxAttempts = 12; // ~90 seconds total at max backoff

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const suggestions = await fetchSuggestions(oppId);
    const match = suggestions.items.find(s => s.batch_id === batchId && s.status === 'Generated');
    if (match) return match;

    await delay(delayMs);
    delayMs = Math.min(delayMs * 1.5, maxDelayMs);
  }

  return null; // Timed out — surface retry option to user
}
The 202 pattern has a concrete degradation behavior under LLM unavailability: the suggestion remains in Pending status until the LLM call succeeds or the TTL expires. The core CRM product — opportunity management, contact records, pipeline stages — continues to function. The AI suggestions panel shows a loading state. No user-facing error appears in the core workflow.

9. Implementation Constraints

Several constraints in this implementation are specific to the production environment and represent trade-offs rather than general recommendations. Cooldown granularity is per-opportunity, not per-user. This means that if one user triggers generation for an opportunity, a second user is also blocked for the 60-second cooldown window. In the CRM context this is acceptable — two users editing the same opportunity simultaneously is uncommon, and the suggestion content would be identical regardless of which user requested it. For collaborative real-time systems, per-user cooldowns may be more appropriate. The batch ID rotates daily at UTC midnight. Teams in UTC+N time zones may observe the rotation at a time that coincides with business hours. The practical impact is minimal — users who triggered generation just before midnight and attempt to view suggestions just after will find their batch_id no longer matches any current Generating suggestion. The solution is a UI that falls back to listing all suggestions by creation time when a batch_id match is not found. The 30-second LLM timeout is a hard limit. This timeout was selected to remain within API gateway timeout budgets common in managed cloud environments. If the LLM call regularly approaches 30 seconds for complex opportunities, the correct solution is to reduce prompt context rather than extend the timeout, which would risk breaching downstream timeout constraints. Suggestion content is not indexed or searchable. The 24-hour TTL and the decision to keep suggestion text out of the event stream mean that suggestion content cannot be searched after the fact. If retrospective analysis of AI suggestion quality is required, a separate analytics pipeline that queries the DynamoDB table before TTL expiry must be designed separately.

10. Recommendations

  1. Treat cost control as a first-class architectural requirement, not a monitoring concern. Implement all three rate-control layers — feature gate, HTTP rate limit, and per-opportunity cooldown — before the feature reaches production. Removing runaway LLM cost after the fact requires emergency changes under pressure; preventing it requires one TTL field in a DynamoDB item.
  2. Define event schemas that structurally cannot contain LLM output. Do not rely on code review or convention to keep AI-generated content out of domain events. Write the event struct without the suggestion_text field. If a future developer wants to add it, the change is explicit and reviewable. If the struct always had it, the addition is invisible.
  3. Use a typed prompt extraction function, not a general serializer. Extract only allowlisted fields from domain objects when constructing LLM prompts. Audit the allowlist whenever the underlying domain model adds fields that could contain PII.
  4. Implement the 202 pattern before any LLM feature reaches users. Designing a synchronous LLM integration and converting it to async later requires changes to the client, the API contract, and the persistence model. Building async from the start costs one additional polling endpoint and approximately 50 lines of React polling logic.
  5. Set the TTL before the first write, not after. DynamoDB TTL must be included in the initial item write. There is no cost-effective way to add TTL to existing items at scale. If TTL was omitted from the initial design, stale suggestions will accumulate until a backfill job is run.
  6. Use UUID v7, not UUID v4, for sort keys in time-ordered collections. UUID v4 in a sort key produces random ordering within a partition, requiring client-side sort after every query. UUID v7 encodes millisecond precision in the most-significant bits, making lexicographic and chronological order identical. In DynamoDB, where sort order is determined at the storage layer, this eliminates a class of query post-processing permanently.
  7. Keep the LlmClient trait in the consuming service until a second consumer exists. Premature extraction to a shared crate creates a dependency that must be versioned and coordinated across services. The trait definition is small; when a second consumer appears, the extraction is mechanical.

Forward-Looking Statement

As LLM inference latency decreases and pricing continues to fall, the temptation will grow to treat AI features as lightweight additions to existing services rather than as first-class architectural concerns. The cost, compliance, and correctness requirements documented here do not diminish as inference becomes cheaper — they intensify as adoption broadens and regulatory scrutiny of AI-assisted business decisions increases. The patterns established at initial implementation — deterministic idempotency keys, structurally enforced PII boundaries, layered cost controls, async response contracts — become harder to retrofit as the feature accumulates production usage and downstream consumers take dependencies on its behavior. Teams planning LLM integration should build the production architecture on the first deployment, not the second.
All content represents personal learning from personal and open-source projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.