Tenant Isolation: What It Means and Why It Matters
In a multi-tenant AI SaaS, tenant isolation means one customer cannot see, influence, or interfere with another customer's data or model behavior. The isolation requirements are more subtle than traditional web apps because LLMs have state in their context window:
- Context isolation: Each API call must contain only the current tenant's data. Conversation history, retrieved documents, and user profile data must be scoped to the authenticated tenant ID and never mixed across tenants.
- Vector store isolation: If tenants share a vector database, queries must filter by tenant ID. Use payload-indexed filtering in Qdrant or namespace-level isolation in Pinecone. Never rely on the application layer alone to scope results — enforce it at the query level.
- Fine-tuned model isolation: If tenants have fine-tuned models, ensure model routing maps tenant IDs to their specific model versions. A mismatch silently serves the wrong model.
- Prompt isolation: If your system prompt contains any per-tenant configuration, generate it dynamically per request from a tenant configuration store. Never cache a system prompt across tenants.
# Tenant-scoped RAG query — enforce isolation at the query level
def retrieve_for_tenant(query: str, tenant_id: str, top_k: int = 5) -> list[dict]:
results = qdrant_client.search(
collection_name="documents",
query_vector=embed(query),
query_filter=Filter(
must=[
FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id)
)
]
),
limit=top_k,
)
# Verify every result belongs to the correct tenant — defense in depth
assert all(r.payload["tenant_id"] == tenant_id for r in results)
return resultsCost Attribution: Knowing What Each Tenant Costs You
LLM inference costs are variable per request — a long context costs more than a short one. Without per-tenant cost tracking, you cannot price correctly, cannot identify unprofitable customers, and cannot enforce usage limits before they become expensive.
- Log every LLM call with tenant ID, model name, prompt tokens, completion tokens, and computed cost.
- Store this in a usage_events table in your database. This is the source of truth for billing, limits, and analytics.
- Calculate cost server-side from token counts — do not rely on provider invoices for per-tenant attribution. The math is simple: (prompt_tokens / 1M) × prompt_price + (completion_tokens / 1M) × completion_price.
- Expose a usage dashboard to tenants — customers with visibility into their usage self-regulate. Those without visibility will always be surprised by their bill.
# Log LLM usage to database after every call
from decimal import Decimal
PRICING = { # per 1M tokens, in USD
"gpt-4o": {"prompt": Decimal("2.50"), "completion": Decimal("10.00")},
"gpt-4o-mini": {"prompt": Decimal("0.15"), "completion": Decimal("0.60")},
}
def log_llm_usage(tenant_id: str, model: str, usage: dict):
rates = PRICING[model]
cost = (
Decimal(usage["prompt_tokens"]) / 1_000_000 * rates["prompt"]
+ Decimal(usage["completion_tokens"]) / 1_000_000 * rates["completion"]
)
db.execute(
"""INSERT INTO usage_events
(tenant_id, model, prompt_tokens, completion_tokens, cost_usd, created_at)
VALUES (%s, %s, %s, %s, %s, NOW())""",
(tenant_id, model, usage["prompt_tokens"], usage["completion_tokens"], cost),
)Prompt Caching: The Easiest Cost Reduction
If your system prompt is long (500+ tokens) and the same for every request from a given tenant, you are paying full price for it on every call. Prompt caching lets you pay once:
- Anthropic prompt caching: prefix your system prompt with a cache_control marker. The first call processes and caches the prompt; subsequent calls within 5 minutes reuse the cache at ~10% of the normal token cost.
- OpenAI automatic caching: prompts over 1,024 tokens are automatically eligible for caching at 50% off if repeated within a session.
- Application-level caching: for identical queries (same prompt, same tenant context), cache the full LLM response in Redis with a TTL. FAQ-style queries can have hit rates above 40% in high-traffic applications.
# Anthropic prompt caching
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # 2,000+ token system prompt
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
messages=[{"role": "user", "content": user_message}],
)
# system prompt tokens are cached for 5 minutes
# subsequent calls pay ~10% for the cached portionRate Limiting
Rate limiting in AI SaaS has two layers: protecting your upstream API quota from being exhausted by a single tenant, and enforcing your product's pricing tiers:
- Token-based rate limits: track tokens used per tenant per minute and per day. Token limits are more accurate than request limits because request sizes vary enormously.
- Redis sliding window: use a sorted set in Redis to track usage over a rolling time window. More accurate than a fixed-window counter for bursty traffic.
- Tier enforcement: map each pricing tier to a daily token budget. When a tenant exceeds their budget, return a 429 with a clear message rather than silently degrading quality.
- Graceful degradation: for tenants near their limit, optionally route to a cheaper or smaller model instead of hard-blocking. Users experience slower or lower-quality responses rather than an error.
Observability: The Layer That Saves You
An AI SaaS without observability is flying blind. These are the metrics and logs that pay for themselves the first time something goes wrong:
- Per-request traces: log tenant ID, model, latency, token counts, cost, and a success/failure flag for every LLM call. This single table answers 90% of production support questions.
- Quality score trends: if you run automated quality evaluation (groundedness, relevance), track the daily average by tenant. A tenant whose scores drop suddenly may have changed their usage pattern or uploaded problematic documents.
- Cost anomaly detection: alert when a tenant's hourly cost is 3x their 7-day average. Runaway agent loops and prompt injection attacks both show up as cost spikes before they show up as complaints.
- Latency percentiles by tenant and model: p50, p95, p99 per day. Latency regressions are often model-specific and tenant-specific — aggregated averages hide them.
- Error rates by error type: distinguish rate limit errors (you are hitting upstream limits), validation errors (structured output failures), timeout errors, and application errors. Each has a different root cause and fix.
The observability stack does not need to be complex. Structured JSON logs shipped to a searchable store (Datadog, Grafana Loki, or even Postgres with a Grafana dashboard) is sufficient. The key is logging the right fields from the start — retrofitting observability onto a production system is painful.