All posts
Cost Optimization
April 15, 20269 min read

How I Cut $18K/Month in OpenAI API Costs with a Self-Hosted LLM

M

Moneeb Abbas

AI Systems Architect

A client came to me spending $18,000 per month on OpenAI API calls — and growing. By the time I finished the engagement, their monthly inference cost was $400. The project paid for itself in 47 days. Here is exactly how we did it.

Why OpenAI API Costs Spiral in Production

It usually starts the same way: a prototype that uses GPT-4 because it is the easiest model to get working. You ship it, users love it, and then you look at your invoice three months later and realize you are spending more on tokens than on your entire engineering team.

In this client's case, they had three product surfaces all hitting the OpenAI API independently — a document summarizer, a Q&A chatbot, and a classification pipeline. None of them had been optimized for token efficiency. The summarizer was sending full 40-page documents in every prompt. The classifier was using GPT-4 Turbo for decisions that a much smaller model could handle.

Note:The first step in any cost-cutting project is a token audit, not model replacement. Understand what you are actually spending tokens on before you change anything.

Step 1 — Token Audit Across All Surfaces

Before touching a single line of code, I instrumented all three surfaces to log prompt tokens, completion tokens, model used, and endpoint. Two weeks of production data revealed the breakdown:

  • Document summarizer: 61% of total cost — sending raw PDFs without preprocessing
  • Q&A chatbot: 28% of total cost — context window bloat from poor retrieval
  • Classifier: 11% of total cost — GPT-4 doing the work of a small fine-tuned model

Each surface had a different problem and needed a different solution. The summarizer needed better preprocessing before any model swap. The chatbot needed a proper RAG pipeline. The classifier was the easiest — it just needed the right model.

Step 2 — Choosing the Right Open-Weight Model

Not all open-weight models are equal for all tasks. The mistake most teams make is benchmarking against generic evals. What matters is how the model performs on your specific data distribution.

For this client's use case — legal document summarization and Q&A — I evaluated four models: Llama 3.1 70B, Mistral Large, Qwen 2.5 72B, and Deepseek-V2. I ran them against 500 real production examples from their anonymized logs, scored by a combination of automated metrics and a small human review sample.

  • Llama 3.1 70B: Strong on summarization, slightly weaker on instruction-following for complex queries
  • Qwen 2.5 72B: Best overall for this domain, particularly strong at structured extraction
  • Mistral Large: Good baseline, but context window handling was inconsistent above 32K tokens
  • Deepseek-V2: Excellent, but licensing restrictions ruled it out for this client

We went with Qwen 2.5 72B. For the classifier, we fine-tuned a much smaller model — Llama 3.2 3B — on the client's labeled classification data. The fine-tuned small model outperformed GPT-4 on precision for their specific categories.

Step 3 — The Deployment Architecture

The technology stack we used to serve these models in production:

  • vLLM: Serving engine for the 72B model — PagedAttention and continuous batching made the throughput economics work
  • 2x NVIDIA A100 80GB: Tensor-parallel inference for the 72B in BF16 — fits comfortably with room for context
  • AWS EC2 p4d.24xlarge: Reserved instance pricing dropped the effective GPU cost significantly
  • Quantized 3B classifier on a single A10G: The classification workload did not need a premium GPU
  • Redis queue: Decoupled request ingestion from model inference — critical for handling traffic spikes without scaling GPUs instantly
Tip:vLLM's continuous batching is the single biggest lever for throughput efficiency in production. With the right batch settings, you can serve 5-10x more requests per GPU-hour than naive single-request inference.

Step 4 — Zero-Downtime Cutover

The migration used a shadow traffic pattern. For three weeks, every API request was sent to both the OpenAI endpoint and our new endpoint in parallel. Responses from the self-hosted model were logged but not returned to users. We monitored for divergence in output quality and latency.

When we were satisfied with parity, we flipped the router: first 5% of traffic, then 25%, then 100%. The existing OpenAI API contracts were preserved — the new endpoint accepted the same request format and returned the same response structure. No changes required in the product code.

The Cost Math

  • Before: $18,000/month OpenAI API (GPT-4 Turbo and GPT-4o)
  • After: ~$400/month in AWS compute (reserved instances, amortized)
  • Setup cost (engineering + hardware setup): recovered in 47 days
  • Annual saving: ~$211,200

The 47-day payback period was the best outcome I have delivered on a cost project. It is not always this fast — it depends heavily on volume. But at $18K/month, even a conservative estimate puts breakeven well inside three months.

What to Watch Out For

  • Inference latency: vLLM with a 72B model has higher cold-token latency than GPT-4. We mitigated with streaming and a client-side skeleton UI.
  • Model updates: OpenAI improves silently; you own the update cycle with self-hosted. Plan a quarterly review cadence.
  • Ops overhead: You now own the serving infrastructure. If your team cannot run GPU infrastructure, factor in DevOps cost.
  • Quality regression edge cases: The 500-example benchmark caught most issues, but monitor production quality for the first 30 days post-migration.

Self-hosting is not right for every team. If your inference volume is low, the fixed infrastructure cost will exceed API costs. The crossover point is roughly $3,000–$5,000/month of API spend — below that, OpenAI is almost always cheaper. Above that, the math increasingly favors self-hosted.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch