Documentation Index
Fetch the complete documentation index at: https://www.aidonow.com/llms.txt
Use this file to discover all available pages before exploring further.
Executive Summary
Integrating a large language model into a production multi-tenant SaaS application is not primarily an AI engineering problem — it is a systems engineering problem. Four constraints arrive simultaneously: per-tenant cost control, audit compliance, correctness under concurrent retries, and graceful degradation when the LLM is unavailable. This analysis documents the complete vertical slice of an LLM-powered AI suggestions feature in a production CRM service, from DynamoDB entity design through an eight-step Rust service layer to a polling React interface. The implementation demonstrates that each constraint has a specific, composable solution: three-layer rate control addresses cost, initials-only prompt construction addresses PII, SHA-256 deterministic batch IDs address idempotency, and a 202 async pattern addresses degradation. Organizations planning LLM feature additions to existing SaaS platforms should treat these four requirements as non-negotiable prerequisites rather than post-launch concerns.Key Findings
- LLM integration into multi-tenant SaaS requires solving four orthogonal production constraints simultaneously: cost control, audit compliance, retry idempotency, and graceful degradation. Solving any three while deferring the fourth creates an incident surface that typically manifests under load.
- A SHA-256 hash of opportunity ID and UTC date yields a deterministic idempotency key that eliminates coordination overhead. Two concurrent generate requests for the same opportunity on the same calendar day produce identical batch IDs; a single DynamoDB conditional write ensures only one suggestion set persists.
- Three independent rate-control layers are required because each layer defends a different failure boundary. Feature flags disable the capability per tenant, HTTP-layer rate limiting prevents API abuse, and per-opportunity cooldown records in DynamoDB prevent redundant LLM calls within a generation window.
- Event payloads must be structurally incapable of containing LLM output. Omitting suggestion text from the
AiSuggestionGenerateddomain event at the type level — not by convention — ensures downstream consumers cannot reconstruct raw model output from the audit stream, regardless of future development. - DynamoDB UUID v7 sort keys and a 24-hour TTL are architectural decisions, not implementation details. Time-sortable suggestion IDs eliminate secondary sort operations; automatic expiry removes the engineering burden of stale-suggestion cleanup and reduces storage cost.
- The 202 async response pattern decouples perceived latency from actual LLM round-trip time. Clients receive an immediate acknowledgment with a polling reference; LLM unavailability becomes a retryable background state rather than a blocking user-facing failure.
1. The Four Production Requirements for LLM Integration in Multi-Tenant SaaS
Adding LLM inference to an existing multi-tenant platform introduces four production requirements that do not apply to deterministic features. Each requires deliberate architectural investment; none can be addressed retroactively without structural change to the persistence and event models. Cost control is the most immediately visible requirement. In a multi-tenant system, a single tenant calling the generate endpoint in a tight loop will exhaust LLM budget for all tenants. Standard HTTP rate limiting is necessary but insufficient — it does not prevent the same opportunity from being submitted repeatedly within a short window, each call producing a billable LLM request. Audit compliance surfaces when legal or security reviews examine the event stream. LLM outputs frequently contain synthesized information derived from customer data. If that output is logged verbatim in domain events, it becomes part of the audit record and may trigger data residency or PII handling obligations that the original event design did not anticipate. The solution is structural: define event schemas that cannot hold LLM output. Correctness under retries is the idempotency requirement. Network failures, client timeouts, and load balancer retries mean that any single user action may produce multiple inbound requests. A naive implementation creates duplicate suggestion records per retry. The correct solution eliminates duplication at the key level rather than through locking or deduplication queries. Graceful degradation is the availability requirement. The LLM is an external service with its own SLA, which is lower than the SLA of the CRM product embedding it. Users must be able to view opportunities, update records, and advance pipeline stages regardless of LLM availability. The suggestion feature must fail independently of the core product. These four requirements shape every layer of the implementation described in the sections that follow.2. Entity Model: Time-Sortable Keys, Status Index, and TTL-Based Expiry
The DynamoDB table design for AI suggestions enforces three properties at the storage layer: tenant isolation, efficient status queries, and automatic expiry.Query call to retrieve all suggestions for an opportunity, ordered by generation time, without a sort step. UUID v7 sort keys are critical here: UUID v4 is random, which scatters suggestions within the partition and requires client-side sorting. UUID v7 embeds a millisecond-precision timestamp in the most-significant bits, making lexicographic sort identical to chronological sort.
The status Global Secondary Index enables the second access pattern — querying all suggestions in a given status across all opportunities for a tenant:
Pending or Generated suggestions for a tenant without scanning the full table.
Two additional storage-layer decisions address operational concerns:
TTL: Each record includes a generated_at epoch-second attribute. DynamoDB TTL is set to generated_at + 86400 (24 hours). Stale suggestions are removed automatically without a scheduled cleanup job. The 24-hour window reflects a deliberate product decision: AI suggestions that are more than a day old are likely to be based on stale opportunity context and may mislead users. Automatic expiry enforces this policy structurally.
SSE with AES256: Server-side encryption is enabled on the table. This is baseline hygiene for any table containing customer-derived content, but it is especially important for LLM outputs, which may contain synthesized information that is sensitive in combination even when individual inputs are not.
Stream: NEW_AND_OLD_IMAGES: DynamoDB Streams is enabled with full image capture. This supports downstream audit consumers that need to observe suggestion lifecycle transitions — particularly the Generated → Accepted and Generated → Dismissed transitions, which are relevant to understanding which suggestions influenced user decisions.
3. The Eight-Step Generation Flow: Feature Gate to Event Emission
TheAiSuggestionService::generate method executes eight steps in strict sequence. Each step is a guard that can short-circuit the flow with a specific error response. Understanding each step’s purpose clarifies why the ordering matters.
Step 1 — Feature flag guard. The service checks tenant_config.ai_assist_enabled before any other work. If the flag is absent or false, the service returns HTTP 403 immediately. This is the coarsest control lever: it allows the AI suggestions feature to be enabled per tenant without deployment changes, and it allows a tenant to be disabled without affecting others.
Step 2 — Batch ID derivation. The service computes a deterministic batch ID:
cooldown_repo.check_cooldown(opp_id). If a cooldown record exists for this opportunity, the service returns HTTP 429 with a retry_after_secs field. If no cooldown is active, the service calls cooldown_repo.record_generation(opp_id) with a 60-second TTL before proceeding.
Step 4 — PII-safe prompt construction. The service assembles the LLM prompt using only privacy-safe fields. Section 7 describes this in detail.
Step 5 — LLM call with timeout. The service invokes the LlmClient trait implementation with a hard 30-second timeout enforced by Tokio:
lead_score, score_factors, and next_action. A malformed response returns HTTP 400 to the caller.
Step 7 — Conditional persistence. The AiSuggestion record is written to DynamoDB with a conditional expression that rejects writes if the item already exists. This is the write-side idempotency guard: if two concurrent requests arrive with the same batch_id, only one write succeeds.
Step 8 — Event emission. The service emits an AiSuggestionGenerated domain event. The event payload contains the suggestion ID, opportunity ID, tenant ID, status, and timestamp — but not the suggestion text. This constraint is discussed in Section 7.
4. Idempotency Without Coordination: SHA-256 Batch IDs as Deterministic Keys
The idempotency requirement for LLM generation is more subtle than for standard CRUD operations. Standard idempotency patterns use client-supplied request IDs or server-generated tokens stored in a coordination table. Both approaches introduce coordination overhead: the client must manage and reuse the ID across retries, or the server must maintain a separate lookup table with its own consistency requirements. The SHA-256 batch ID approach eliminates coordination by deriving the idempotency key deterministically from the inputs that define what is being requested:ConditionalCheckFailedException. The service treats this exception as a success — the suggestion was generated — and returns the batch ID to both callers.
This pattern has one important limitation: the determinism depends on clock alignment. If two requests arrive at the UTC day boundary, one may compute 2026-05-08 and the other 2026-05-09. The result is two distinct batch IDs and two distinct suggestion sets. This is correct behavior — the suggestions were generated at different calendar dates with potentially different opportunity context — but implementors should be aware of the boundary condition.
5. Three-Layer Cost Control: Cooldown, Rate Limit, and Feature Gate
The three rate-control layers in this implementation are not redundant. Each defends a distinct failure boundary, and the failure modes they prevent do not overlap.| Layer | Mechanism | Scope | Failure Prevented |
|---|---|---|---|
| Feature gate | tenant_config.ai_assist_enabled | Per-tenant | Unauthorized LLM usage, budget overrun for disabled tenants |
| HTTP rate limit | rate_limit = "60/min" on handler | Per-API-key | API abuse, scripted hammering of the generate endpoint |
| Cooldown | DynamoDB TTL record, 60s per opportunity | Per-opportunity | Redundant LLM calls for same context within generation window |
CooldownRepository writes a DynamoDB item with a 60-second TTL each time generation succeeds. For the duration of that TTL, subsequent generate requests for the same opportunity return HTTP 429 with a retry_after_secs field. This prevents the case where a user triggers generation, the LLM responds slowly, the user triggers generation again — and both requests proceed to the LLM.
The three layers compose independently. An operator can tighten the HTTP rate limit without touching cooldown behavior. A tenant can be disabled at the feature gate without affecting the rate limit configuration. The cooldown window can be adjusted per deployment without changes to the API handler.
6. The LlmClient Trait: Abstraction Without Premature Extraction
TheLlmClient abstraction is defined as a Rust async trait:
LlmClient as a trait, test suites inject a mock implementation that returns controlled responses, enabling exhaustive testing of the eight-step flow without network calls. Provider abstraction means a future migration from one LLM provider to another requires only a new struct implementing the trait, not changes to the service logic.
Implementation constraint: The CRM service is currently the sole consumer of this abstraction. A principled refactoring impulse would extract LlmClient into a shared platform crate for reuse. That extraction was deliberately deferred, following YAGNI discipline. The trait exists in the CRM service crate. If a second service requires LLM access in the future, extraction is a straightforward, mechanical change — copy the trait definition and its single implementation to a shared crate, update the import paths. Extracting before a second consumer exists creates maintenance surface without delivering value.
The LlmError enum is the error vocabulary that the service layer maps to HTTP responses:
LlmError | HTTP Response | Retry? |
|---|---|---|
RateLimit | 429 with retry_after_secs | Yes, with backoff |
Unauthorized | 500 (operator error) | No |
ModelNotFound(_) | 500 (operator error) | No |
ContextTooLong | 503 (prompt truncation required) | No, requires fix |
Timeout | 503 | Yes, with backoff |
Upstream(_) | 503 | Conditional |
The
Unauthorized and ModelNotFound variants map to HTTP 500 rather than 4xx because they represent operator configuration errors, not client errors. A client cannot fix a missing API credential or an incorrect model name — these require operator intervention. Returning 500 routes them through the alerting path rather than silently returning a client error that may be ignored.7. PII Discipline: Prompt Construction and Event Payload Constraints
Two independent PII controls operate in this implementation. They address different surfaces in the data flow.7.1 Prompt Construction
The LLM prompt is assembled in Step 4 using a dedicated prompt builder that enforces a strict field allowlist. Contact data — email addresses, phone numbers, full names — is excluded. Only initials are included:7.2 Event Payload Constraints
TheAiSuggestionGenerated domain event carries metadata about the generation event but does not carry the generated suggestion:
suggestion_text from this struct is a CISO-level mandate, and its enforcement is structural rather than conventional. A future developer cannot accidentally add the field and have it silently flow into the event stream — they must make an explicit change to the event schema, which will be visible in code review.
Downstream consumers that need suggestion content for legitimate purposes (analytics, quality assessment) must query the crm_ai_suggestions table directly, where access controls and TTL apply. This design preserves the audit utility of the event stream while preventing it from becoming a secondary store of LLM-synthesized content.
8. The 202 Async Pattern: Immediate Acceptance, Polled Completion
LLM inference is slow relative to the latency budget of a synchronous HTTP response. Round-trip times of 2–10 seconds are common; under load or provider degradation, 30 seconds or more. A synchronous response pattern — POST, wait for LLM, return result — creates two problems: the client connection is held open for the duration of the LLM call, and LLM unavailability directly causes user-facing errors in the core product. The 202 async pattern separates request acceptance from result availability:Client POSTs a generate request
The service executes Steps 1–3 (feature gate, batch ID, cooldown) synchronously and returns HTTP 202 immediately with the batch ID and a
Generating status. The LLM call is dispatched asynchronously.Client polls the list endpoint
The client uses
GET /opportunities/{opp_id}/ai-suggestions at an appropriate interval. The endpoint returns an AiSuggestionListResponse containing all suggestions for the opportunity and their current statuses.The
Expired transition is not driven by application code — it is driven by DynamoDB TTL removal. The React UI should treat a suggestion that disappears from the list (was present on a previous poll, absent on the current poll) as expired and prompt the user to regenerate if needed.Pending status until the LLM call succeeds or the TTL expires. The core CRM product — opportunity management, contact records, pipeline stages — continues to function. The AI suggestions panel shows a loading state. No user-facing error appears in the core workflow.
9. Implementation Constraints
Several constraints in this implementation are specific to the production environment and represent trade-offs rather than general recommendations. Cooldown granularity is per-opportunity, not per-user. This means that if one user triggers generation for an opportunity, a second user is also blocked for the 60-second cooldown window. In the CRM context this is acceptable — two users editing the same opportunity simultaneously is uncommon, and the suggestion content would be identical regardless of which user requested it. For collaborative real-time systems, per-user cooldowns may be more appropriate. The batch ID rotates daily at UTC midnight. Teams in UTC+N time zones may observe the rotation at a time that coincides with business hours. The practical impact is minimal — users who triggered generation just before midnight and attempt to view suggestions just after will find theirbatch_id no longer matches any current Generating suggestion. The solution is a UI that falls back to listing all suggestions by creation time when a batch_id match is not found.
The 30-second LLM timeout is a hard limit. This timeout was selected to remain within API gateway timeout budgets common in managed cloud environments. If the LLM call regularly approaches 30 seconds for complex opportunities, the correct solution is to reduce prompt context rather than extend the timeout, which would risk breaching downstream timeout constraints.
Suggestion content is not indexed or searchable. The 24-hour TTL and the decision to keep suggestion text out of the event stream mean that suggestion content cannot be searched after the fact. If retrospective analysis of AI suggestion quality is required, a separate analytics pipeline that queries the DynamoDB table before TTL expiry must be designed separately.
10. Recommendations
- Treat cost control as a first-class architectural requirement, not a monitoring concern. Implement all three rate-control layers — feature gate, HTTP rate limit, and per-opportunity cooldown — before the feature reaches production. Removing runaway LLM cost after the fact requires emergency changes under pressure; preventing it requires one TTL field in a DynamoDB item.
-
Define event schemas that structurally cannot contain LLM output. Do not rely on code review or convention to keep AI-generated content out of domain events. Write the event struct without the
suggestion_textfield. If a future developer wants to add it, the change is explicit and reviewable. If the struct always had it, the addition is invisible. - Use a typed prompt extraction function, not a general serializer. Extract only allowlisted fields from domain objects when constructing LLM prompts. Audit the allowlist whenever the underlying domain model adds fields that could contain PII.
- Implement the 202 pattern before any LLM feature reaches users. Designing a synchronous LLM integration and converting it to async later requires changes to the client, the API contract, and the persistence model. Building async from the start costs one additional polling endpoint and approximately 50 lines of React polling logic.
- Set the TTL before the first write, not after. DynamoDB TTL must be included in the initial item write. There is no cost-effective way to add TTL to existing items at scale. If TTL was omitted from the initial design, stale suggestions will accumulate until a backfill job is run.
- Use UUID v7, not UUID v4, for sort keys in time-ordered collections. UUID v4 in a sort key produces random ordering within a partition, requiring client-side sort after every query. UUID v7 encodes millisecond precision in the most-significant bits, making lexicographic and chronological order identical. In DynamoDB, where sort order is determined at the storage layer, this eliminates a class of query post-processing permanently.
-
Keep the
LlmClienttrait in the consuming service until a second consumer exists. Premature extraction to a shared crate creates a dependency that must be versioned and coordinated across services. The trait definition is small; when a second consumer appears, the extraction is mechanical.
Forward-Looking Statement
As LLM inference latency decreases and pricing continues to fall, the temptation will grow to treat AI features as lightweight additions to existing services rather than as first-class architectural concerns. The cost, compliance, and correctness requirements documented here do not diminish as inference becomes cheaper — they intensify as adoption broadens and regulatory scrutiny of AI-assisted business decisions increases. The patterns established at initial implementation — deterministic idempotency keys, structurally enforced PII boundaries, layered cost controls, async response contracts — become harder to retrofit as the feature accumulates production usage and downstream consumers take dependencies on its behavior. Teams planning LLM integration should build the production architecture on the first deployment, not the second.All content represents personal learning from personal and open-source projects. Code examples are sanitized and generalized. No proprietary information is shared. Opinions are my own and do not reflect my employer’s views.