Why Unit Tests Are Not Enough
Traditional software testing checks that code produces a specific output given a specific input. LLM outputs are non-deterministic and cannot be compared against a fixed expected string. A response that is factually correct, helpfully phrased, and format-compliant is a good response — even if it is worded differently every time it is generated.
LLM evaluation requires a different mental model: instead of checking exact outputs, you check properties of outputs. Is the response grounded in the context? Does it answer the question that was asked? Is the format correct? Does it avoid forbidden content? Each of these is a separate evaluator.
Step 1 — Build a Golden Dataset
A golden dataset is a curated set of inputs with human-verified expected outputs or quality criteria. It is the foundation of every other evaluation step. Without it, you are flying blind.
- Size: 50–200 examples is sufficient for most applications. More is better, but 50 well-chosen examples beats 500 random ones.
- Composition: Include easy cases (the happy path), edge cases (unusual queries, ambiguous questions), and adversarial cases (attempts to jailbreak, off-topic requests, inputs that previously caused failures).
- Labels: Each example should have either a reference answer (for comparison) or a set of quality criteria checklist (for rubric-based evaluation). Both are useful; pick based on your use case.
- Source: Pull from real user queries if you have them. If launching fresh, write examples that reflect what you expect users to actually ask.
Step 2 — Automated Evaluators
Run your golden dataset through the system and score each output automatically. The evaluators that matter most depend on your use case, but these cover the majority of production LLM applications:
- Groundedness: Is every claim in the response supported by the provided context? Use a dedicated evaluator model or LLM-as-judge prompt. Critical for RAG systems.
- Answer relevance: Does the response actually answer the question asked? A response can be factually correct but fail to address the query.
- Format compliance: Does the output match the required structure — JSON schema, specific fields, length constraints? This is fully automatable with schema validation.
- Faithfulness to instructions: Does the response respect the constraints in the system prompt — persona, forbidden topics, required disclaimers?
- Toxicity and safety: For consumer-facing applications, run outputs through a content classifier.
Step 3 — LLM-as-Judge
For quality dimensions that are hard to measure algorithmically — clarity, helpfulness, completeness — a second LLM can serve as a judge. The approach: prompt a capable model (GPT-4o or Claude Sonnet) to rate the response on a rubric, and use the numeric score as your metric.
JUDGE_PROMPT = """
You are evaluating the quality of an AI assistant response.
Question: {question}
Context provided to the assistant: {context}
Assistant response: {response}
Rate the response on the following dimensions (1-5 each):
1. Relevance: Does it answer the question asked?
2. Groundedness: Is every claim supported by the context?
3. Completeness: Does it address all parts of the question?
4. Clarity: Is it clearly written and easy to understand?
Return JSON: {{"relevance": int, "groundedness": int, "completeness": int, "clarity": int, "explanation": str}}
"""
def judge_response(question, context, response):
result = judge_llm.complete(
JUDGE_PROMPT.format(
question=question, context=context, response=response
),
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result)LLM-as-judge has a known bias toward verbose, confident-sounding responses. Mitigate this by including a counter-argument in the judge prompt ('A response should not receive a high score simply for being long') and by calibrating the judge against a set of human-scored examples.
Step 4 — Regression Testing on Model Updates
LLM providers update their models without warning. An OpenAI model update, a change in your prompt, or a new version of your embedding model can silently degrade quality. Regression testing catches this before users notice.
- Run the full golden dataset eval on every code change that touches prompts, retrieval logic, or model configuration.
- Track metric history over time — plot groundedness score, relevance score, and format compliance rate per deployment.
- Set a threshold: if any core metric drops more than X% from the previous run, block the deployment and investigate.
- A/B test model changes: run old and new configurations against the same golden dataset in parallel and compare scores before switching.
Step 5 — Production Monitoring
Pre-launch evaluation tells you the system works on your golden dataset. Production monitoring tells you it is still working on real queries after launch. The two are complementary.
- Sample and score: Route 1–5% of production queries through your evaluator pipeline. Log scores to a dashboard. Alert on threshold violations.
- User feedback signals: Thumbs up/down, regeneration requests, and conversation abandonment are weak signals of quality. Aggregate them and correlate with evaluator scores to calibrate your automated metrics.
- Failure queue: When a response scores below a threshold, log it to a review queue. A human reviews it weekly and adds genuine failures to the golden dataset.
- Tools: LangSmith, Braintrust, and Weights & Biases all support LLM evaluation pipelines. For simpler setups, a Postgres table and a Grafana dashboard is sufficient.