Evals: The Unglamorous Key to AI Quality
How to build evaluation suites that actually catch problems - metrics that matter, test case design, and making evals part of your development workflow.
Your AI works great in demos and fails in production.
You don't have a model problem. You have an eval problem.
Why Traditional Testing Fails
# This test is useless
def test_ai_response():
response = ai.ask("What is 2+2?")
assert response == "4"
AI doesn't give identical outputs. It gives distributions.
"4", "The answer is 4", "Four", "2+2=4" - all correct. Test fails on three of them.
What Evals Actually Are
Evals measure whether AI behavior meets your standards across a distribution of cases.
Not: "Is this exact output correct?" But: "Does this output achieve the goal?"
# This eval works
def eval_math_response(response, expected_value):
# Extract numeric answer from any format
extracted = extract_number(response)
return extracted == expected_value
The Eval Pyramid
Level 1: Component Evals
Test each piece in isolation.
- Does retrieval return relevant docs?
- Does the prompt elicit the right format?
- Does parsing handle edge cases?
Easy to debug. Run on every commit.
Level 2: Integration Evals
Test the full pipeline.
- Given user input, is final output acceptable?
- Do components work together?
- Is latency within bounds?
Slower. Run daily.
Level 3: Behavioral Evals
Test high-level properties.
- Is the AI consistent across paraphrases?
- Does it handle adversarial inputs?
- Does it refuse harmful requests?
Expensive. Run weekly.
Metrics That Matter
Task Completion Rate
Did the user achieve their goal?
Not "was the response good?" but "did the user's problem get solved?"
This is the metric. Everything else supports it.
Factual Accuracy
When claims are made, are they true?
def eval_factual(response, ground_truth):
claims = extract_claims(response)
correct = sum(1 for c in claims if verify(c, ground_truth))
return correct / len(claims)
Ground truth is expensive to create. Do it anyway.
Hallucination Rate
How often does the AI make things up?
def eval_hallucination(response, context):
claims = extract_claims(response)
unsupported = sum(1 for c in claims if not in_context(c, context))
return unsupported / len(claims)
Target: < 5% for production systems.
Latency Distribution
Not average. Distribution.
P50 latency tells you the median experience. P99 latency tells you the worst experience. Max latency tells you if something's broken.
P50: 1.2s ✅
P95: 2.8s ⚠️
P99: 8.4s ❌
Fix the tail before optimizing the median.
Stay Updated
Get updates on new labs and experiments.
The RAG Reality Check: Why Retrieval Isn't Magic
RAG is everywhere, but production implementations fail constantly. Common failure modes, debugging strategies, and what actually works in retrieval-augmented generation.
Building Good Test Sets
Diversity Matters
Cover the space of inputs users actually send.
- Common cases (80% of volume)
- Edge cases (where failure hurts most)
- Adversarial cases (where failure is dangerous)
- Random samples (where you're blind)
Size Matters
100 test cases give you ±10% confidence. 1000 test cases give you ±3% confidence. 10000 test cases give you ±1% confidence.
What precision do you need?
Freshness Matters
User behavior changes. Test sets get stale.
Continuously add real failures to your test set.
The Eval Workflow
1. Code change
2. Fast evals (< 5 min) - block merge if fail
3. Merge to main
4. Full evals (< 1 hour) - alert if degradation
5. Production monitoring - trigger investigation if drift
Evals gate deployment. Not vibes. Not code review. Evals.
When Evals Lie
Teaching to the Test
Model scores well on evals but fails in production.
Your test set doesn't match production distribution.
Fix: Sample from production. Regularly.
Metric Gaming
Numbers go up but quality goes down.
You're measuring the wrong thing.
Fix: Multiple metrics. Human review. Task completion.
Overfitting
You've tuned the prompt to ace specific tests.
New inputs still fail.
Fix: Held-out test sets. Never optimize on the final eval set.
Automation vs Human Judgment
Some things humans do better:
- "Is this response helpful?"
- "Does this feel right?"
- "Would a user trust this?"
Some things automation does better:
- "Is this factually correct?"
- "Does this contain PII?"
- "Is this fast enough?"
Use both. Weight by what matters.
Our Eval Stack
At Kingly:
- Unit evals: Every PR, blocking
- Integration evals: Nightly, alerting
- Human review: Weekly sample
- Production monitoring: Real-time
Takes work to build. Saves more work in debugging.
Further Reading
Technical Resources
Related Posts
- RAG Reality Check - Debugging retrieval
- Building AI Products That Ship - Quality at speed
Shipping AI without evals is like deploying code without tests. It works until it doesn't. And when it doesn't, you have no idea why.
Explore our services
AI consulting, development, and strategic advisory.
2026 Field Notes: The Reality of Local Context Gateways
The consensus among AI engineers is clear: Fine-tuning is for tone/style compliance, while RAG is for up-to-date facts. The number one failure mode in production RAG remains "Right documents, wrong chunks."
Additionally, with recent Model Context Protocol (MCP) privilege escalation vulnerabilities, security researchers are pushing for sandboxed local environments. We treat Local LLMs not as toys, but as Local Context Gateways—mandatory infrastructure for privacy-critical VPCs where data cannot leave the network.
Related Posts
The Architecture of Autonomous Flight
How we built a neural-symbolic hybrid system to control manned aircraft in real-time.
From YAML to Deterministic + Agentic Runners
Why disk-based orchestration beats fancy state management for multi-agent systems.
The AI Dictionary: Technical Terms in Plain English
27 AI and ML terms explained for developers and everyone else.