All posts
Infrastructure · MLOps
April 28, 202611 min read

Choosing the Right Infrastructure for Your LLM in Production: A Decision Framework

M

Moneeb Abbas

AI Systems Architect

The question I get asked most often after 'which model should I use' is 'where should I run it'. The answer depends on five variables: your monthly inference volume, your latency requirements, your compliance constraints, your team's operational maturity, and your risk tolerance for outages. This framework walks through all five.

The Three Deployment Paths

Every LLM production deployment falls into one of three categories, and the right choice is almost entirely determined by volume and compliance constraints:

  1. 1Cloud API (OpenAI, Anthropic, Google): Pay-per-token, zero infrastructure management, fastest time to production. Right for low-to-medium volume and no data residency requirements.
  2. 2Managed inference (Replicate, Together AI, Groq, AWS Bedrock): You choose the model, a third party hosts it. Better economics than cloud APIs at scale, some data residency options available, still limited operational overhead.
  3. 3Self-hosted (your hardware, your cloud account): Full control, best unit economics at volume, maximum compliance flexibility. Requires GPU infrastructure management and a team capable of running it.
Note:The break-even between cloud API and self-hosted inference is typically $3,000–$5,000/month of API spend. Below that threshold, infrastructure costs exceed API costs. Above it, the math reverses quickly.

Decision Variable 1 — Monthly Inference Volume

Volume is the primary driver. Run the following calculation before making any infrastructure decision:

  • Estimate your monthly token volume: (requests/day) × (average tokens/request) × 30
  • Price that at your current or target API cost (e.g., $15/1M tokens for GPT-4o)
  • Price the equivalent self-hosted compute: reserved GPU instance cost + ops overhead
  • Crossover is where self-hosted monthly cost < API monthly cost

A concrete example: 10 million tokens/day = 300M tokens/month. At $15/1M tokens, that is $4,500/month on GPT-4o. A single A100 80GB reserved instance on AWS runs roughly $1,800–2,200/month and can handle that volume comfortably with vLLM and a well-sized model. The math favors self-hosting — but only if you have someone who can run it.

Decision Variable 2 — Latency Requirements

Not all use cases have the same latency tolerance. Batch processing jobs (nightly report generation, document classification pipelines) can tolerate 5–30 seconds per request. Interactive applications (chatbots, voice AI, copilots) need sub-1-second first-token latency.

  • Groq: Fastest managed inference available — 500+ tokens/second on Llama 3 models. Best for latency-critical applications that can use open-weight models.
  • vLLM on A100/H100: Sub-400ms TTFT for 70B models with continuous batching. Best self-hosted option for latency-sensitive workloads.
  • Ollama: Simple to deploy, good for development and low-concurrency production. Not suitable for high-throughput applications.
  • Cloud APIs: Latency is variable and outside your control. GPT-4o averages 400–800ms TTFT but can spike significantly under load.

Decision Variable 3 — Compliance Constraints

If your data is subject to HIPAA, GDPR data residency requirements, or SOC 2 controls, your infrastructure choices narrow significantly:

  • HIPAA: Cloud APIs are viable only with a signed BAA — available from Azure OpenAI and AWS Bedrock, not from OpenAI's standard API. Air-gapped self-hosted eliminates the risk entirely.
  • GDPR data residency: Requires inference to occur within EU borders. AWS Bedrock EU regions and self-hosted in EU data centers are the cleanest options.
  • SOC 2: Cloud API providers with SOC 2 certification (most major ones) are generally acceptable. Document your vendor risk assessment.
  • Air-gapped (highest security): Only self-hosted on your own hardware qualifies. No managed inference option provides true air-gapping.

GPU Selection Guide

If you are going self-hosted, GPU selection determines your throughput ceiling and cost floor. The options that matter in 2026:

  • NVIDIA H100 80GB SXM: Best throughput for large models (70B+). NVLink for multi-GPU tensor parallelism. Expensive — justified at high volume.
  • NVIDIA A100 80GB: Slightly lower throughput than H100 but significantly cheaper. The current sweet spot for most production deployments.
  • NVIDIA A10G 24GB: Good for smaller models (7B–13B) and quantized 34B models. Available on AWS g5 instances. Cost-effective for medium-volume workloads.
  • NVIDIA RTX 4090 24GB: Consumer card, surprisingly capable for self-hosted deployments with quantized models. Not available in cloud — for on-premises hardware only.
  • AMD MI300X 192GB: Large memory footprint enables very large models without tensor parallelism. ROCm ecosystem is maturing; worth evaluating for new deployments.
Tip:For most businesses starting a self-hosted deployment, 1–2 A100 80GB instances on AWS (p4d family) with reserved pricing is the right starting point. You can run a 70B model with room to spare, and reserved pricing drops the effective cost 40–60% vs on-demand.

Serving Framework Comparison

  • vLLM: Best throughput for production via PagedAttention and continuous batching. OpenAI-compatible API. The default choice for serious production deployments.
  • Text Generation Inference (TGI): Hugging Face's serving framework. Good model compatibility, slightly behind vLLM on raw throughput benchmarks.
  • Ollama: Simplest setup, good for development and single-user production. No native batching — not suitable for concurrent user workloads.
  • LiteLLM proxy: Not an inference engine, but a unified API gateway that routes to any backend. Excellent for teams that want to switch between providers without changing application code.

The Decision Matrix

  • Under $3K/month API spend + no compliance constraints → Cloud API (OpenAI, Anthropic)
  • Under $3K/month API spend + latency critical → Groq or managed inference on open-weight models
  • Over $3K/month + team can run infra + no air-gap requirement → Self-hosted on cloud GPUs
  • HIPAA or air-gap required → Self-hosted on-premises only
  • Rapidly changing volume or uncertain trajectory → Start cloud API, instrument costs, migrate when crossover is clear

The most expensive mistake I see teams make is premature optimization: building self-hosted infrastructure before they have the volume to justify it. The second most expensive is the opposite: staying on cloud APIs at $20K/month of spend because migration feels complex. The decision framework above prevents both.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch