Skip to content

LLM Evaluation Cheat Sheet

Testing LLM applications differs from traditional software testing. Outputs are non-deterministic, quality is subjective, and no single “correct” answer exists for most tasks. Evaluation bridges this gap with grading strategies designed for probabilistic systems.

I need to…ApproachTool / Method
Check exact output formatDeterministic assertionString match, regex, JSON schema
Grade subjective qualityLLM-as-judgeRubric prompt with stronger model
Measure factual accuracyGolden set + scoringExact match, F1, ROUGE-L
Detect regressions from prompt editsRegression eval suitepromptfoo / Braintrust experiment diff
Compare two models or promptsSide-by-side evalpromptfoo matrix view, Braintrust experiments
Measure consistencyCosine similaritySentence embeddings across paraphrased inputs
Test safety and toxicityRed-teaming + classificationpromptfoo red-team, LLM binary classifier
Monitor production qualityOnline evalBraintrust online scoring, sampled human review
Run evals in CIGitHub Actions integrationpromptfoo --fail-on-error, threshold scripts
Control eval costsSampling + cachingSubset runs, prompt caching, smaller judge models

Traditional tests assert deterministic outcomes: f(x) == y. LLM evaluation handles three problems traditional testing avoids:

ProblemTraditional testLLM eval
Non-determinismSame input, same outputSame input, different outputs each run
Subjective qualityBinary pass/failGraded on rubrics, similarity, human judgement
No ground truthKnown correct answerMultiple valid answers; “best” is contextual
Prompt sensitivityCode is the specificationSmall wording changes cause large output swings

This means evaluation must combine deterministic checks (where possible), statistical scoring (for quality), and human review (for edge cases).

Use when output format is constrained and verifiable by code. Fastest and most reliable.

# Exact match
def eval_exact(output: str, expected: str) -> bool:
return output.strip().lower() == expected.lower()
# Contains required content
def eval_contains(output: str, required: list[str]) -> bool:
return all(phrase in output for phrase in required)
# JSON structure validation
import json
def eval_json_schema(output: str) -> bool:
try:
data = json.loads(output)
return "answer" in data and "confidence" in data
except json.JSONDecodeError:
return False

Promptfoo YAML equivalents:

assert:
- type: equals
value: "expected output"
- type: contains-all
value: ["required phrase", "another phrase"]
- type: is-json
- type: regex
value: '^\d{4}-\d{2}-\d{2}$'
- type: cost
threshold: 0.01
- type: latency
threshold: 500

Use when quality is subjective or multi-dimensional. A stronger model grades the output of the model under test.

import anthropic
client = anthropic.Anthropic()
def llm_judge(output: str, rubric: str) -> dict:
"""Grade output against a rubric using a stronger model."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Grade this output against the rubric.
<rubric>{rubric}</rubric>
<output>{output}</output>
Think step by step in <thinking> tags.
Then output a JSON object with "score" (1-5) and "reason" (one sentence).
"""
}]
)
return parse_json(response.content[0].text)

Promptfoo YAML:

assert:
- type: llm-rubric
value: "Response mentions the return policy and provides a timeframe"
provider: openai:gpt-4o
- type: similar
value: "reference answer text"
threshold: 0.8

Rubric design principles:

  • Be specific: “mentions Acme Inc. in the first sentence” beats “good response”
  • Use scales: Likert 1-5, binary correct/incorrect, or ordinal categories
  • Require reasoning before scoring to improve judge accuracy
  • Use a different (ideally stronger) model than the one being evaluated

Use for final validation, calibrating automated judges, and edge cases where automated grading fails. Expensive and slow — reserve for high-stakes decisions.

Human eval methodWhen to useScale
Side-by-side rankingComparing two prompt variantsLow
Likert ratingMeasuring tone, helpfulness, coherenceMedium
Correctness labelingBuilding golden datasetsLow
Error categorizationUnderstanding failure modesLow
Spot-check productionValidating automated scoring calibrationMedium

Calibration loop: Run human evals on a sample, then train LLM-as-judge rubrics to match human scores. Periodically re-calibrate.

Curated input-output pairs with verified correct answers. The foundation of regression testing.

# golden_set.jsonl — one case per line
# {"input": "What is the capital of France?", "expected": "Paris"}
# {"input": "Summarize photosynthesis in one sentence.", "expected": "..."}
import json
def load_golden_set(path: str) -> list[dict]:
with open(path) as f:
return [json.loads(line) for line in f]

Building golden sets:

  1. Start with 20-50 hand-crafted cases covering core behaviors and edge cases
  2. Add cases from production failures (every bug becomes a test case)
  3. Include adversarial inputs: jailbreaks, prompt injections, ambiguous queries
  4. Label expected output, acceptable variations, and unacceptable outputs

Use an LLM to expand a small seed set into a larger eval dataset.

def generate_test_cases(seed_examples: list[dict], n: int = 100) -> list[dict]:
"""Generate synthetic eval cases from seed examples."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"""Generate {n} test cases similar to these examples.
Include edge cases, adversarial inputs, and boundary conditions.
Output as JSON array with "input" and "expected" fields.
Examples:
{json.dumps(seed_examples, indent=2)}"""
}]
)
return json.loads(response.content[0].text)

Caution: Verify synthetic cases. LLM-generated test data can encode the model’s own biases, creating a self-reinforcing loop.

Sample real user interactions to build eval sets that reflect actual usage patterns, not imagined ones.

# Sample strategy: log interactions, then filter
def sample_production_data(logs: list, n: int = 200) -> list:
"""Stratified sample: common cases + edge cases + failures."""
failures = [l for l in logs if l["user_rating"] < 3]
normal = [l for l in logs if l["user_rating"] >= 3]
# Over-sample failures to catch regressions
sample = (
random.sample(failures, min(n // 4, len(failures)))
+ random.sample(normal, min(3 * n // 4, len(normal)))
)
return sample
MetricWhat it measuresMethod
AccuracyCorrect answers / total answersExact match against golden set
F1 ScorePrecision-recall balance for classificationCompare predicted vs. true labels
ROUGE-LSummary quality (longest common subsequence)Compare against reference summary
Cosine similaritySemantic consistency across paraphrased inputsSentence embeddings comparison
FaithfulnessOutput grounded in provided context (no hallucination)LLM-as-judge with source docs
RelevanceOutput addresses the question askedLLM-as-judge with rubric
ToxicityHarmful or inappropriate contentClassifier or LLM binary judge
MetricWhat it measuresTarget example
Latency (p50)Median response time< 500ms
Latency (p99)Tail response time< 2000ms
Cost per callAPI spend per invocation< $0.01
Token usageInput + output tokens< 4096 total
Error rateFailed API calls< 0.1%

Track both. A model that scores well on quality but costs 10x more or adds 5 seconds of latency may lose to a “worse” model in production.

The most common pattern. Use a frontier model to evaluate a smaller, cheaper model’s output.

def judge_with_rubric(question: str, output: str, rubric: str) -> int:
"""Returns score 1-5."""
response = client.messages.create(
model="claude-sonnet-4-20250514", # Judge: stronger model
max_tokens=512,
messages=[{
"role": "user",
"content": f"""You are an evaluation judge. Score this output 1-5.
<question>{question}</question>
<output>{output}</output>
<rubric>{rubric}</rubric>
Think step by step, then output only the integer score on the last line."""
}]
)
# Parse the last line as the score
return int(response.content[0].text.strip().split('\n')[-1])

Run the same prompt multiple times. If the model gives conflicting answers, the output is unreliable.

def self_consistency(prompt: str, n: int = 5, threshold: float = 0.7) -> dict:
"""Run n times, check agreement."""
responses = [get_completion(prompt) for _ in range(n)]
# For classification: majority vote
from collections import Counter
votes = Counter(r.strip().lower() for r in responses)
most_common, count = votes.most_common(1)[0]
return {
"answer": most_common,
"agreement": count / n,
"reliable": count / n >= threshold,
}

Ask the judge to choose the better output from two candidates. More reliable than absolute scoring for comparing variants.

def pairwise_judge(question: str, output_a: str, output_b: str) -> str:
"""Returns 'A', 'B', or 'tie'."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Which response better answers the question?
<question>{question}</question>
<response_a>{output_a}</response_a>
<response_b>{output_b}</response_b>
Think step by step, then output exactly one of: A, B, tie"""
}]
)
result = response.content[0].text.strip().split('\n')[-1].upper()
if "A" in result: return "A"
if "B" in result: return "B"
return "tie"

Position bias: LLM judges favor the first response shown. Mitigate by running each comparison twice with swapped order.

Prompt changes break existing behavior more often than they improve it. Treat prompt edits like code changes: test before deploying.

1. Baseline — Run eval suite on current prompt, save scores
2. Edit — Change the prompt
3. Re-evaluate — Run same eval suite on new prompt
4. Compare — Diff scores against baseline
5. Ship/revert — Accept if no regressions, or fix and re-test

Promptfoo Configuration for Regression Testing

Section titled “Promptfoo Configuration for Regression Testing”
promptfooconfig.yaml
description: "Summarization prompt regression test"
prompts:
- file://prompts/summarize_v1.txt
- file://prompts/summarize_v2.txt
providers:
- anthropic:messages:claude-sonnet-4-20250514
defaultTest:
assert:
- type: llm-rubric
value: "Summary captures the main point in 1-2 sentences"
- type: latency
threshold: 3000
tests:
- vars:
article: "In a groundbreaking study, researchers at MIT..."
assert:
- type: contains
value: "MIT"
- type: similar
value: "MIT researchers discovered a new antibiotic compound"
threshold: 0.7
- vars:
article: "The quarterly earnings report showed..."
assert:
- type: not-contains
value: "I don't know"
Terminal window
# Run eval and view matrix comparison
npx promptfoo eval
npx promptfoo view
from braintrust import Eval
from autoevals import Factuality, ClosedQA
Eval(
"Summarizer",
data=lambda: load_golden_set("golden_set.jsonl"),
task=lambda input: summarize(input, prompt_version="v2"),
scores=[Factuality, ClosedQA],
)
# Braintrust UI shows diff against previous experiment run

Split production traffic between prompt variants and measure user-facing metrics.

import random
def route_prompt(user_id: str, variants: dict, split: float = 0.5) -> str:
"""Deterministic routing based on user ID."""
bucket = hash(user_id) % 100
variant = "B" if bucket < split * 100 else "A"
return variants[variant]

Measure: Task completion rate, user satisfaction, error escalations — not just automated scores.

Deploy the new prompt to a small slice of traffic (1-5%). Monitor automated scores and error rates before full rollout.

1. Deploy new prompt to 2% of traffic
2. Run automated evals on canary outputs for 24-48 hours
3. Compare canary scores against baseline (control group)
4. If canary scores match or exceed baseline → increase to 100%
5. If canary shows regressions → roll back immediately
.github/workflows/llm-eval.yml
name: LLM Eval
on:
pull_request:
paths:
- "prompts/**"
- "promptfooconfig.yaml"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- name: Cache promptfoo
uses: actions/cache@v4
with:
path: ~/.cache/promptfoo
key: promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
- name: Run evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
run: |
npx promptfoo eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(jq '.results.stats.successes / .results.stats.total' results.json)
echo "Pass rate: $PASS_RATE"
if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then
echo "::error::Eval pass rate $PASS_RATE below 95% threshold"
exit 1
fi
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results.json
StrategyWhen to useExample
Zero failuresSafety-critical assertions--fail-on-error
Pass rateProbabilistic qualityFail if < 95% of cases pass
Score regressionComparing against baselineFail if avg score drops > 5%
Per-metric floorsMulti-dimensional qualityAccuracy > 90% AND toxicity < 1%

LLM evals call APIs for every test case and (for LLM-as-judge) again for every grading step. Costs compound fast.

StrategySavingsTradeoff
Cache responses50-90%Stale on model updates
Sample eval set (20-30%)70-80%Lower statistical confidence
Smaller judge model60-80%Less nuanced grading
Deterministic checks firstVariableOnly works for structured output
Batch API50%Higher latency (hours, not seconds)
Run full suite nightlyN/ACI gets fast subset; nightly gets full
PR-level (every push):
→ 50 critical cases, deterministic assertions only
→ Cost: ~$0.50 per run
Merge-level (on merge to main):
→ 200 cases, deterministic + LLM-as-judge
→ Cost: ~$5 per run
Nightly (scheduled):
→ Full golden set (1000+ cases), all metrics
→ Cost: ~$25 per run

Write evals first, then iterate on prompts. Analogous to TDD for non-deterministic systems.

1. Red — Write eval cases that define desired behavior
2. Red — Run against initial prompt (expect many failures)
3. Green — Iterate on prompt until eval suite passes
4. Refine — Add edge cases to eval set, tighten thresholds
5. Ship — Baseline scores become the regression floor

The key insight: Prompts without evals drift silently. The eval suite is the specification. Without it, you are guessing.

ToolStrengthsBest for
promptfooOpen-source, YAML config, matrix comparison, red-teamPrompt comparison, CI integration
BraintrustExperiment tracking, online scoring, dataset mgmtProduction monitoring, team workflows
Anthropic EvalsNative Claude integration, grading examplesClaude-specific applications
Custom harnessFull control, no vendor lock-inUnique eval logic, internal tooling
autoevals (lib)Pre-built scorers (Factuality, ClosedQA, Relevance)Quick scoring without writing rubrics
Terminal window
# Install and initialize
npx promptfoo init
# Edit promptfooconfig.yaml, then run
npx promptfoo eval
# View results in browser
npx promptfoo view
# Compare two prompts side by side
npx promptfoo eval --output results.json
Terminal window
# Install
pip install braintrust autoevals
from braintrust import Eval
from autoevals import Factuality
Eval(
"My App",
data=lambda: [
{"input": "What is 2+2?", "expected": "4"},
{"input": "Capital of France?", "expected": "Paris"},
],
task=lambda input: get_completion(input),
scores=[Factuality],
)
import json
from dataclasses import dataclass
@dataclass
class EvalResult:
case_id: str
passed: bool
score: float
reason: str
def run_eval_suite(
cases: list[dict],
task_fn,
graders: list,
threshold: float = 0.9,
) -> dict:
"""Run eval suite and return summary."""
results = []
for case in cases:
output = task_fn(case["input"])
scores = [g(output, case.get("expected", "")) for g in graders]
avg_score = sum(s for s in scores) / len(scores)
results.append(EvalResult(
case_id=case.get("id", ""),
passed=avg_score >= threshold,
score=avg_score,
reason=f"Avg score {avg_score:.2f}",
))
passed = sum(1 for r in results if r.passed)
return {
"total": len(results),
"passed": passed,
"pass_rate": passed / len(results),
"results": results,
}
Anti-PatternProblemFix
Vibes-based evaluation”It looks good” catches nothing at scaleDefine rubrics, measure scores, track over time
Testing on training dataModel memorizes answers; eval scores inflate artificiallyUse held-out test sets; rotate eval data
Single-metric optimizationOptimizing accuracy alone tanks safety, tone, or latencyTrack multiple metrics; set floors for each
Eval gamingTuning prompts to pass specific test cases rather than general behaviorUse large, diverse eval sets; add new cases regularly
No baselineCannot tell if changes improved or regressed qualitySave baseline scores; diff every prompt change
Judge model = tested modelModel grades itself favorably; biases go undetectedUse a different (stronger) model as judge
Static eval setProduction inputs drift; eval set becomes unrepresentativeRefresh eval data from production samples quarterly
Asserting on exact LLM outputNon-deterministic outputs make exact-match flakyUse semantic similarity, contains checks, or LLM-as-judge
Skipping operational metricsModel quality is high but latency or cost makes it unusableAlways measure latency, cost, and token usage alongside quality
  • Testing — Test runner commands and patterns for deterministic software
  • Testing Principles — Strategy, pyramid, TDD — the foundations eval-driven development builds on
  • Performance Profiling — Benchmarking and latency measurement, analogous to operational eval metrics
  • Prompt Engineering — The prompts these evals validate