Evals: The Unglamorous Key to AI Quality

Your AI works great in demos and fails in production.

You don't have a model problem. You have an eval problem.

Why Traditional Testing Fails#

# This test is useless
def test_ai_response():
    response = ai.ask("What is 2+2?")
    assert response == "4"

AI doesn't give identical outputs. It gives distributions.

"4", "The answer is 4", "Four", "2+2=4" - all correct. Test fails on three of them.

What Evals Actually Are#

Evals measure whether AI behavior meets your standards across a distribution of cases.

Not: "Is this exact output correct?" But: "Does this output achieve the goal?"

# This eval works
def eval_math_response(response, expected_value):
    # Extract numeric answer from any format
    extracted = extract_number(response)
    return extracted == expected_value

The Eval Pyramid#

Level 1: Component Evals#

Test each piece in isolation.

Does retrieval return relevant docs?
Does the prompt elicit the right format?
Does parsing handle edge cases?

Easy to debug. Run on every commit.

Level 2: Integration Evals#

Test the full pipeline.

Given user input, is final output acceptable?
Do components work together?
Is latency within bounds?

Slower. Run daily.

Level 3: Behavioral Evals#

Test high-level properties.

Is the AI consistent across paraphrases?
Does it handle adversarial inputs?
Does it refuse harmful requests?

Expensive. Run weekly.

Metrics That Matter#

Task Completion Rate#

Did the user achieve their goal?

Not "was the response good?" but "did the user's problem get solved?"

This is the metric. Everything else supports it.

Factual Accuracy#

When claims are made, are they true?

def eval_factual(response, ground_truth):
    claims = extract_claims(response)
    correct = sum(1 for c in claims if verify(c, ground_truth))
    return correct / len(claims)

Ground truth is expensive to create. Do it anyway.

Hallucination Rate#

How often does the AI make things up?

def eval_hallucination(response, context):
    claims = extract_claims(response)
    unsupported = sum(1 for c in claims if not in_context(c, context))
    return unsupported / len(claims)

Target: < 5% for production systems.

Latency Distribution#

Not average. Distribution.

P50 latency tells you the median experience. P99 latency tells you the worst experience. Max latency tells you if something's broken.

P50: 1.2s ✅
P95: 2.8s ⚠️  
P99: 8.4s ❌

Fix the tail before optimizing the median.

Stay Updated

Get updates on new labs and experiments.

The RAG Reality Check: Why Retrieval Isn't Magic

RAG is everywhere, but production implementations fail constantly. Common failure modes, debugging strategies, and what actually works in retrieval-augmented generation.

Building Good Test Sets#

Diversity Matters#

Cover the space of inputs users actually send.

Common cases (80% of volume)
Edge cases (where failure hurts most)
Adversarial cases (where failure is dangerous)
Random samples (where you're blind)

Size Matters#

100 test cases give you ±10% confidence. 1000 test cases give you ±3% confidence. 10000 test cases give you ±1% confidence.

What precision do you need?

Freshness Matters#

User behavior changes. Test sets get stale.

Continuously add real failures to your test set.

The Eval Workflow#

1. Code change
2. Fast evals (< 5 min) - block merge if fail
3. Merge to main
4. Full evals (< 1 hour) - alert if degradation
5. Production monitoring - trigger investigation if drift

Evals gate deployment. Not vibes. Not code review. Evals.

When Evals Lie#

Teaching to the Test#

Model scores well on evals but fails in production.

Your test set doesn't match production distribution.

Fix: Sample from production. Regularly.

Metric Gaming#

Numbers go up but quality goes down.

You're measuring the wrong thing.

Fix: Multiple metrics. Human review. Task completion.

Overfitting#

You've tuned the prompt to ace specific tests.

New inputs still fail.

Fix: Held-out test sets. Never optimize on the final eval set.

Automation vs Human Judgment#

Some things humans do better:

"Is this response helpful?"
"Does this feel right?"
"Would a user trust this?"

Some things automation does better:

"Is this factually correct?"
"Does this contain PII?"
"Is this fast enough?"

Use both. Weight by what matters.

Our Eval Stack#

At Kingly:

Unit evals: Every PR, blocking
Integration evals: Nightly, alerting
Human review: Weekly sample
Production monitoring: Real-time

Takes work to build. Saves more work in debugging.

2026 Field Notes: The Reality of Local Context Gateways#

The consensus among AI engineers is clear: Fine-tuning is for tone/style compliance, while RAG is for up-to-date facts. The number one failure mode in production RAG remains "Right documents, wrong chunks."

Additionally, with recent Model Context Protocol (MCP) privilege escalation vulnerabilities, security researchers are pushing for sandboxed local environments. We treat Local LLMs not as toys, but as Local Context Gateways—mandatory infrastructure for privacy-critical VPCs where data cannot leave the network.