All posts
Reliability · LLMs
May 10, 202610 min read

Why LLMs Hallucinate in Production and What You Can Actually Do About It

M

Moneeb Abbas

AI Systems Architect

Hallucination is not a bug you can patch. It is a fundamental property of how language models work. What you can do is design systems that detect it, constrain it, and ensure that wrong answers are caught before they reach users or downstream processes. This post covers the techniques that actually work in production — not just the demos.

Why Models Hallucinate

A language model generates text by predicting the next token based on patterns in training data and the current context. It has no built-in concept of truth or a mechanism to verify claims against an authoritative source. When asked about something it has weak signal on — a specific date, a niche legal clause, a recent event — it extrapolates from adjacent patterns and produces a plausible-sounding answer that may be entirely fabricated.

There are three distinct causes of hallucination that require different mitigations:

  1. 1Knowledge gaps: The model simply does not have reliable information about the topic — because it was not in training data, because training data was sparse, or because the information has changed since the cutoff date.
  2. 2Context neglect: The model is given the correct information in the prompt but ignores or contradicts it in the response — a particularly dangerous failure mode in RAG systems.
  3. 3Overconfident generation: The model produces confident-sounding text even when its internal 'uncertainty' is high. High temperature sampling makes this worse; the model takes riskier, less grounded token choices.
Warning:Context neglect is more dangerous than knowledge gaps in grounded systems. You assume your RAG pipeline means the model has the right information — but a sloppy prompt or long context window can cause the model to override the retrieved context with its parametric knowledge.

Detection: You Cannot Fix What You Cannot See

Before reducing hallucination, instrument your system to measure it. The best production approach uses a combination of automated checks and a human-reviewed sample.

  • Groundedness scoring: For RAG systems, use a second LLM call (or a dedicated model like HHEM-2.1 or TruLens) to score whether the response is supported by the retrieved context. Flag responses below a threshold for review.
  • Citation verification: If your system produces citations, verify each one programmatically — fetch the source document and confirm the claim is actually present.
  • Factual consistency check: For structured outputs (dates, names, numbers), validate against the source documents using regex or structured extraction before returning the response.
  • Human review sample: Route 1–5% of production queries to a human reviewer. This is the only way to catch failure modes that automated checks miss.
python
# Simple groundedness check using an LLM judge
def check_groundedness(context: str, response: str, model: str) -> float:
    prompt = f"""
Rate whether the following response is fully supported by the context.
Context: {context}
Response: {response}

Return a JSON object: {{"score": <0-1>, "reason": "<brief explanation>"}}
Respond with JSON only.
"""
    result = llm.complete(prompt, model=model, temperature=0)
    data = json.loads(result)
    return data["score"]  # Flag if score < 0.7

Mitigation Strategy 1 — Grounding with Retrieval

The most effective single intervention against hallucination is giving the model the correct information in the prompt and instructing it to use only that information. This is the foundation of RAG, but the details matter.

  • Explicit constraint instruction: Tell the model in the system prompt that it must answer only from the provided context and must say 'I cannot find this in the provided documents' when the answer is not there.
  • Context position matters: Place retrieved context close to the question, not at the top of a long system prompt. Models pay more attention to content near the query.
  • Limit context to what is relevant: More context is not always better. Irrelevant chunks give the model more material to misuse. A well-ranked top-3 beats a top-20 with noise.
  • Verify retrieval before trusting it: If retrieval fails silently (no relevant chunks returned), the model will still attempt an answer from parametric memory. Detect empty or low-scoring retrieval and return a 'no information found' response instead.

Mitigation Strategy 2 — Constrained Output Formats

Unstructured free-text outputs give models the most latitude to hallucinate. Structured outputs — JSON schemas, defined fields, explicit 'not found' values — force models into a narrower generation space where hallucination is easier to detect and less likely to pass unnoticed.

python
from pydantic import BaseModel
from typing import Optional

class ContractSummary(BaseModel):
    effective_date: Optional[str]  # None = not found in document
    parties: list[str]
    termination_clause: Optional[str]
    governing_law: Optional[str]
    confidence: float  # Model-reported confidence 0-1

# With OpenAI structured outputs or instructor library:
response = client.chat.completions.create(
    model="gpt-4o",
    response_format=ContractSummary,
    messages=[{"role": "user", "content": prompt}]
)
# None fields are explicit: the model said "not found"
# vs hallucinating a plausible-but-wrong value

Mitigation Strategy 3 — Temperature and Sampling

Temperature controls how much randomness is introduced into token selection. High temperature (above 0.7) makes the model more creative and more likely to hallucinate. For factual, grounded tasks, use temperature 0 or near 0.

For tasks requiring some creativity — summarization, drafting — a temperature of 0.3–0.5 is a reasonable middle ground. The key is to treat temperature as a parameter you tune empirically on your specific task, not a default you leave at 1.0.

Mitigation Strategy 4 — Uncertainty Acknowledgment

Prompt the model to express uncertainty explicitly rather than guessing confidently. A model that says 'I am not certain about this' is more useful than one that confidently states a wrong fact.

  • Add to your system prompt: 'If you are not certain of a fact, say so explicitly rather than guessing.'
  • For high-stakes outputs, require a self-consistency check: generate the response twice with different seeds and flag if they contradict each other.
  • Use model-reported logprobs (where available) to detect low-confidence token sequences — these correlate with hallucination risk.

What Does Not Work

  • Prompting the model to 'never hallucinate': Hallucination is not disobedience. The model cannot introspect whether it is hallucinating.
  • Switching to a bigger model: Larger models hallucinate differently, not necessarily less. GPT-4 produces more plausible-sounding hallucinations than smaller models, which can be harder to catch.
  • One-time testing: Hallucination patterns change with context, query distribution, and model updates. Treat it as ongoing monitoring, not a pre-launch checklist.
Note:The goal is not zero hallucination — that is not achievable with current models. The goal is a system where hallucinations are detected before they cause harm, and where users have appropriate signals about output reliability.

Working on something similar?

I take on 1–2 new projects per month. If you have a use case that needs this kind of engineering, tell me about it.

Get in touch