s--- id: evals-retrieval-and-answer title: "Retrieval and Answer Evals" description: "Build automated graders for the first two eval families: did retrieval find the right evidence, and was the answer correct and grounded?" type: lesson path: ai-engineering module: observability-and-evals module_order: 6 order: 4 prerequisites:

observability-building-your-harness estimated_minutes: 90 content_status: final tags:
evals
retrieval
grounding
llm-as-judge
ragas
grading

Retrieval and Answer Evals

You can now run the harness with one command and get a traced, costed run log. But grading still happens by hand: you open the file, read each answer, compare it to the gold answer, and assign a grade. That's fine for 15 questions, but it doesn't work for 300, and it doesn't work for checking every pull request.

This lesson builds automated graders for the first two eval families: retrieval evals (did the system find the right evidence?) and answer evals (was the answer correct and grounded?). We'll teach two grading approaches (rule-based checks and LLM-as-judge) and you'll build runnable graders that the harness can call directly. By the end, your harness will produce auto-graded run logs that you can compare without opening a single file.

What you'll learn

Distinguish the four eval families and understand why we're starting with retrieval and answer evals
Build rule-based retrieval evals that check whether the right files and symbols appeared in the evidence
Understand LLM-as-judge as a grading pattern: when it's useful, when it's risky, and how to make it consistent
Build an LLM-as-judge answer grader with a structured rubric
Use RAGAS metrics (faithfulness, relevance, context precision) as a framework for retrieval eval
Connect graders to the harness so runs can be auto-graded

Concepts

The four eval families — AI systems need evaluation at four levels, and each catches different failure modes:

Retrieval evals — did the right files, symbols, or passages appear in the evidence? Catches retrieval failures before they become answer failures.
Answer evals — was the final answer correct, grounded in evidence, and appropriately scoped? Catches generation failures, hallucinations, and grounding breakdowns.
Tool-use evals — did the agent call the right tools with the right arguments? Catches unnecessary tool calls, missing tool calls, and argument errors.
Trace evals — did the overall execution path make sense? Catches inefficient routes, wasteful patterns, and architectural failures.

This lesson covers families 1 and 2. The next lesson covers families 3 and 4.

Retrieval eval metrics — three metrics capture different aspects of retrieval quality:

Context precision — of the chunks retrieved, how many were actually relevant? Low precision means the retrieval returned noise alongside signal. High precision means every retrieved chunk was useful.
Context recall — of the chunks that should have been retrieved, how many were? Low recall means the system missed important evidence. High recall means the system found everything it needed.
Faithfulness — does the answer actually use the retrieved evidence, or does it ignore the evidence and hallucinate? Faithfulness connects retrieval quality to answer quality and catches the case where retrieval worked fine but the model ignored the evidence.

These metrics come from the RAGAS framework, but the concepts are portable. Any retrieval eval system needs to answer the same three questions: did we find the right things, did we avoid finding the wrong things, and did the model actually use what we found?

LLM-as-judge — using a language model to evaluate or grade the output of another language model (or the same model). This is a pretty powerful pattern for evals that can't be reduced to exact-match checks. Questions like "is this answer correct and well-grounded?" require judgment that rules alone can't provide.

LLM-as-judge is useful when:

The correct answer can be phrased many different ways (open-ended questions)
You need to evaluate qualities like "grounded," "complete," or "well-reasoned"
You have a clear rubric that a model can follow consistently
You're willing to spot-check judge outputs for quality

LLM-as-judge is risky when:

The evaluation requires domain expertise the judge model doesn't have
You need deterministic, reproducible grades (LLM judges have variance)
The judge model is the same model that generated the output (self-evaluation bias)
You haven't validated the judge against human grades

The key to reliable LLM-as-judge is rubric quality. A vague rubric ("is this answer good?") produces inconsistent grades. A structured rubric with specific criteria and examples produces consistent grades. We'll build a structured rubric in this lesson.

RAGAS — Retrieval Augmented Generation Assessment, an open-source framework that implements retrieval and answer eval metrics. We'll use RAGAS concepts (faithfulness, relevance, context precision) as our metric framework, and show how to compute them both manually and with the RAGAS library. The metrics matter more than the tool. If you understand what faithfulness measures, you can implement it with any library.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Retrieval failures hidden in good answers	The model compensates for bad retrieval with general knowledge	Check retrieval separately from answer	Retrieval evals with expected-file labels
Inconsistent manual grading	Two graders give different grades to the same answer	Detailed rubric	Structured rubric with LLM-as-judge
Hallucination detection	Answer sounds correct but includes unsupported claims	Read each answer manually	Faithfulness eval comparing answer claims to evidence
Answer quality regression	A pipeline change made answers worse but it's hard to tell how	Re-grade everything by hand	Automated answer grader running through the harness
Noisy retrieval	Evidence contains irrelevant chunks that dilute useful context	Increase top-k and hope	Context precision metric identifying low-relevance chunks

Walkthrough

Rule-based retrieval evals

The simplest retrieval eval is a file-match check: did the expected files appear in the evidence? In Module 2, you added gold answers to your benchmark questions. Now we'll add expected evidence files:

{"id": "q001", "question": "What does validate_path return?", "category": "symbol_lookup", "gold_answer": "Returns a Path object...", "expected_files": ["src/utils/path_validator.py"], "expected_symbols": ["validate_path"]}
{"id": "q002", "question": "What is the project architecture?", "category": "architecture", "gold_answer": "The project uses...", "expected_files": ["README.md", "docs/architecture.md"], "expected_symbols": []}

Then build a grader that checks whether retrieval found what it should have:

# harness/graders/retrieval_grader.py
"""Rule-based retrieval evaluation.

Checks whether the expected files and symbols appeared in the
evidence bundle for each benchmark question.
"""
import json
import sys
from pathlib import Path


def grade_retrieval(entry: dict, benchmark_question: dict) -> dict:
    """Grade a single entry's retrieval quality.

    Returns a dict with retrieval-specific metrics.
    """
    expected_files = set(benchmark_question.get("expected_files", []))
    expected_symbols = set(benchmark_question.get("expected_symbols", []))

    # What the system actually retrieved
    retrieved_files = set(entry.get("evidence_files", []))

    # File-level metrics
    if expected_files:
        file_hits = expected_files & retrieved_files
        file_precision = len(file_hits) / len(retrieved_files) if retrieved_files else 0
        file_recall = len(file_hits) / len(expected_files)
    else:
        file_precision = None
        file_recall = None

    # Symbol-level metrics (if evidence includes symbol names)
    # This will be more useful once we add symbol tracking to the run log
    symbol_recall = None
    if expected_symbols:
        # For now, check if expected symbols appear anywhere in the answer
        answer_lower = entry.get("answer", "").lower()
        found_symbols = {
            s for s in expected_symbols if s.lower() in answer_lower
        }
        symbol_recall = len(found_symbols) / len(expected_symbols)

    return {
        "retrieval_file_precision": round(file_precision, 3) if file_precision is not None else None,
        "retrieval_file_recall": round(file_recall, 3) if file_recall is not None else None,
        "retrieval_symbol_recall": round(symbol_recall, 3) if symbol_recall is not None else None,
        "expected_files": list(expected_files),
        "retrieved_files": list(retrieved_files),
        "missing_files": list(expected_files - retrieved_files),
        "extra_files": list(retrieved_files - expected_files),
    }


def grade_run_retrieval(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
) -> list[dict]:
    """Grade retrieval quality for an entire run."""
    # Load benchmark questions with expected files
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    # Load run entries
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    results = []
    for entry in entries:
        q_id = entry["question_id"]
        if q_id in benchmark:
            bq = benchmark[q_id]
            if bq.get("expected_files"):
                grade = grade_retrieval(entry, bq)
                grade["question_id"] = q_id
                grade["question"] = entry["question"][:60]
                results.append(grade)

    return results


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/retrieval_grader.py <run-file.jsonl>")
        sys.exit(1)

    results = grade_run_retrieval(sys.argv[1])

    if not results:
        print("No questions with expected_files found in benchmark.")
        print("Add 'expected_files' to your benchmark questions to enable retrieval evals.")
        sys.exit(0)

    # Summary
    recalls = [r["retrieval_file_recall"] for r in results if r["retrieval_file_recall"] is not None]
    precisions = [r["retrieval_file_precision"] for r in results if r["retrieval_file_precision"] is not None]

    print(f"Retrieval eval: {len(results)} questions with expected files\n")
    if recalls:
        avg_recall = sum(recalls) / len(recalls)
        print(f"  Average file recall:    {avg_recall:.1%}")
    if precisions:
        avg_precision = sum(precisions) / len(precisions)
        print(f"  Average file precision: {avg_precision:.1%}")

    # Show missed files
    print(f"\nMissed files by question:")
    for r in results:
        if r["missing_files"]:
            print(f"  {r['question']}")
            for f in r["missing_files"]:
                print(f"    MISSING: {f}")

mkdir -p harness/graders
python harness/graders/retrieval_grader.py harness/runs/harness-2026-03-25-142233.jsonl

Rule-based retrieval evals are fast, deterministic, and free (no API calls). They catch the most common retrieval failure: the system simply didn't find the right file. They can't evaluate how well the evidence was used, and that's where answer evals come in.

LLM-as-judge: building a reliable answer grader

For answer evaluation, we need judgment that rules can't provide. "Is this answer correct and grounded?" requires understanding the question, the gold answer, the system's answer, and the evidence. This is where LLM-as-judge earns its place.

The key to making LLM-as-judge reliable is the rubric. Here's what a good rubric looks like:

# harness/graders/answer_grader.py
"""LLM-as-judge answer evaluation.

Uses a structured rubric to grade answers for correctness,
grounding, and completeness. The rubric is the most important
part — a vague rubric produces inconsistent grades.
"""
import json
import sys
from openai import OpenAI

client = OpenAI()

JUDGE_MODEL = "gpt-4o"  # Use a capable model for judging

GRADING_RUBRIC = """\
You are an expert grader evaluating answers from a code assistant.
You will receive:
- The QUESTION asked
- The GOLD ANSWER (the known-correct reference)
- The SYSTEM ANSWER (what the code assistant produced)
- The EVIDENCE (what the system retrieved, if anything)

Grade the system answer using this rubric:

## Grades

**fully_correct** — The system answer contains the same key facts as the gold
answer. Minor wording differences are fine. The answer doesn't need to be
identical — it needs to convey the same information accurately.

**partially_correct** — The system answer contains some correct information
but is missing key facts from the gold answer, or includes both correct and
incorrect information.

**unsupported** — The system answer doesn't contain enough information to
evaluate correctness. This includes "I don't know" responses, vague answers
that avoid committing to specifics, and answers that are technically true
but don't address the question.

**wrong** — The system answer contains incorrect information that contradicts
the gold answer, or confidently states something that isn't true about the
codebase.

## Failure labels (for non-fully_correct grades only)

- **hallucination** — The answer states something as fact that isn't supported
  by the evidence or the gold answer
- **missing_evidence** — The answer would need evidence that wasn't retrieved
- **retrieval_miss** — Evidence was retrieved but didn't contain what was needed
- **wrong_chunk** — The system found related but wrong evidence
- **reasoning_error** — The evidence was correct but the system drew the wrong
  conclusion
- **scope_confusion** — The answer addresses a different question or scope than
  what was asked

## Output format

Respond with ONLY a JSON object:
{
    "grade": "fully_correct | partially_correct | unsupported | wrong",
    "failure_label": "label or null if fully_correct",
    "confidence": 0.0 to 1.0,
    "reasoning": "1-2 sentences explaining the grade"
}
"""

FAILURE_LABELS = [
    "hallucination",
    "missing_evidence",
    "retrieval_miss",
    "wrong_chunk",
    "reasoning_error",
    "scope_confusion",
]


def grade_answer(
    question: str,
    gold_answer: str,
    system_answer: str,
    evidence: str = "",
    model: str = JUDGE_MODEL,
) -> dict:
    """Grade a single answer using LLM-as-judge.

    Returns a dict with grade, failure_label, confidence, and reasoning.
    """
    user_prompt = f"""QUESTION: {question}

GOLD ANSWER: {gold_answer}

SYSTEM ANSWER: {system_answer}

EVIDENCE: {evidence if evidence else "(no evidence retrieved)"}
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": GRADING_RUBRIC},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        # Validate expected fields
        assert result.get("grade") in [
            "fully_correct", "partially_correct", "unsupported", "wrong"
        ]
        # Some models keep the grade stable but drift on failure_label.
        if result["grade"] == "fully_correct":
            result["failure_label"] = None
        elif result.get("failure_label") not in FAILURE_LABELS:
            result["failure_label"] = "grading_error"
        return result
    except (json.JSONDecodeError, AssertionError, KeyError):
        return {
            "grade": "unsupported",
            "failure_label": "grading_error",
            "confidence": 0.0,
            "reasoning": "Judge output could not be parsed",
        }


def grade_run_answers(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
    model: str = JUDGE_MODEL,
) -> str:
    """Auto-grade all answers in a run file.

    Writes a new *-graded.jsonl file with grades filled in.
    Returns the path to the graded file.
    """
    # Load benchmark questions with gold answers
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    # Load run entries
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    print(f"Auto-grading {len(entries)} answers with {model}...\n")

    for i, entry in enumerate(entries):
        q_id = entry["question_id"]
        if q_id not in benchmark:
            print(f"  [{i+1}] {q_id}: no benchmark question found, skipping")
            continue

        bq = benchmark[q_id]
        gold = bq.get("gold_answer", "")
        if not gold:
            print(f"  [{i+1}] {q_id}: no gold answer, skipping")
            continue

        print(f"  [{i+1}/{len(entries)}] {entry['category']}: {entry['question'][:50]}...")

        result = grade_answer(
            question=entry["question"],
            gold_answer=gold,
            system_answer=entry["answer"],
            model=model,
        )

        entry["grade"] = result["grade"]
        entry["failure_label"] = result.get("failure_label")
        entry["grading_notes"] = result.get("reasoning", "")
        entry["judge_confidence"] = result.get("confidence", 0)
        entry["judge_model"] = model

    # Save graded version
    output = run_file.replace(".jsonl", "-graded.jsonl")
    with open(output, "w") as f:
        for entry in entries:
            f.write(json.dumps(entry) + "\n")

    print(f"\nGraded results saved to {output}")
    return output


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/answer_grader.py <run-file.jsonl>")
        sys.exit(1)

    output = grade_run_answers(sys.argv[1])
    print(f"\nRun summary to see results:")
    print(f"  python harness/summarize_run.py {output}")

python harness/graders/answer_grader.py harness/runs/harness-2026-03-25-142233.jsonl

Expected behavior: the script prints one progress line per graded answer, writes a new -graded.jsonl file, and then points you at harness/summarize_run.py for the next step.

Making LLM-as-judge reliable

The rubric above works, but LLM judges have known failure modes. Here's how to catch them:

Spot-checking. Grade 10-15 answers by hand, then compare your grades to the judge's grades. If agreement is below 80%, the rubric needs work. Common rubric problems:

Grade boundaries are ambiguous ("partially correct" vs. "unsupported")
The rubric doesn't handle edge cases (abstentions, off-topic answers)
Examples are missing. Adding 2-3 examples per grade level significantly improves consistency

Judge consistency. Run the same grading twice and compare. LLM judges have variance even at temperature 0. If the same answer gets different grades across runs, that's a rubric clarity problem, not a model problem. Tighten the rubric criteria or add examples.

Self-evaluation bias. If the judge model is the same model that generated the answers, it may be biased toward grading them favorably. I've found it's worth using a different model for judging than for generation when possible, or at least validating against human grades to calibrate.

Grade inflation. LLM judges tend to be generous. If your auto-graded accuracy is significantly higher than your hand-graded accuracy, add stricter criteria to the rubric. Phrases like "the answer must contain the specific function name, not just describe what it does" help the judge apply the right bar.

RAGAS metrics: retrieval quality as numbers

RAGAS provides three metrics that quantify retrieval quality. We'll implement simplified versions first, then show how to use the RAGAS library:

# harness/graders/ragas_metrics.py
"""Simplified RAGAS-style retrieval metrics.

Implements faithfulness, relevance, and context precision
as concepts. For production use, install the ragas library.
"""
import json
import sys
from openai import OpenAI

client = OpenAI()


def score_faithfulness(
    answer: str,
    evidence: str,
    model: str = "gpt-4o-mini",
) -> float:
    """Score how faithfully the answer uses the provided evidence.

    Returns a score from 0.0 (ignores evidence entirely) to 1.0
    (every claim is supported by evidence).

    Faithfulness catches the case where retrieval worked but the
    model hallucinated anyway.
    """
    prompt = f"""Given an ANSWER and the EVIDENCE it was supposed to be based on,
score how faithfully the answer uses the evidence.

Score 1.0: Every factual claim in the answer is directly supported by the evidence.
Score 0.5: Some claims are supported, but the answer also includes unsupported claims.
Score 0.0: The answer ignores the evidence and makes claims not found there.

ANSWER: {answer}

EVIDENCE: {evidence}

Respond with ONLY a JSON object: {{"score": 0.0_to_1.0, "reasoning": "brief explanation"}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        return float(result.get("score", 0))
    except (json.JSONDecodeError, ValueError):
        return 0.0


def score_answer_relevance(
    question: str,
    answer: str,
    model: str = "gpt-4o-mini",
) -> float:
    """Score how relevant the answer is to the question asked.

    Returns 0.0 (completely off-topic) to 1.0 (directly addresses
    the question).

    Catches scope confusion — the answer is about something else.
    """
    prompt = f"""Given a QUESTION and an ANSWER, score how relevant the answer
is to the question.

Score 1.0: The answer directly addresses exactly what was asked.
Score 0.5: The answer is related but doesn't fully address the question.
Score 0.0: The answer is about something else entirely.

QUESTION: {question}
ANSWER: {answer}

Respond with ONLY a JSON object: {{"score": 0.0_to_1.0, "reasoning": "brief explanation"}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        return float(result.get("score", 0))
    except (json.JSONDecodeError, ValueError):
        return 0.0


if __name__ == "__main__":
    # Demo with a sample question
    question = "What does validate_path return?"
    answer = "The validate_path function returns a Path object if the path is valid and within the allowed directory tree. It raises a ValueError for paths outside the allowed root."
    evidence = "def validate_path(p: str) -> Path:\n    resolved = Path(p).resolve()\n    if not resolved.is_relative_to(ALLOWED_ROOT):\n        raise ValueError(f'Path {p} outside allowed root')\n    return resolved"

    print("RAGAS-style metrics demo\n")

    faith = score_faithfulness(answer, evidence)
    print(f"Faithfulness: {faith:.2f}")

    relevance = score_answer_relevance(question, answer)
    print(f"Answer relevance: {relevance:.2f}")

    # For contrast, test with a hallucinated answer
    bad_answer = "The validate_path function returns True if the path exists and False otherwise. It also logs the access to a database."
    bad_faith = score_faithfulness(bad_answer, evidence)
    print(f"\nHallucinated answer faithfulness: {bad_faith:.2f}")

python harness/graders/ragas_metrics.py

Expected output:

RAGAS-style metrics demo

Faithfulness: 0.90
Answer relevance: 0.95

Hallucinated answer faithfulness: 0.20

The faithfulness score drops sharply for the hallucinated answer because the claim about logging to a database isn't in the evidence. This is exactly the kind of failure that's hard to catch with rule-based checks but easy for a faithfulness eval.

Using the RAGAS library

If you want the full RAGAS metric suite, install the library:

pip install ragas

RAGAS provides pre-built metrics that handle the LLM calls and scoring internally. The concepts are the same as what we implemented manually. RAGAS just packages them with more sophisticated prompting and scoring:

# harness/graders/ragas_eval.py
"""RAGAS library integration for retrieval and answer metrics.

Uses the ragas library for production-grade metric computation.
The manual implementations in ragas_metrics.py teach the concepts;
this file shows the library approach.
"""
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset


def run_ragas_eval(questions, answers, contexts, ground_truths):
    """Run RAGAS evaluation on a batch of questions.

    Args:
        questions: list of question strings
        answers: list of answer strings
        contexts: list of lists of context strings
        ground_truths: list of gold answer strings

    Returns a dict of metric scores.
    """
    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )

    return results


if __name__ == "__main__":
    # Demo with sample data
    results = run_ragas_eval(
        questions=["What does validate_path return?"],
        answers=["Returns a Path object if valid, raises ValueError otherwise."],
        contexts=[["def validate_path(p: str) -> Path: ..."]],
        ground_truths=["Returns a Path object for valid paths within the allowed directory tree. Raises ValueError for paths outside the allowed root."],
    )
    print("RAGAS results:")
    print(results)

Use whichever approach fits your needs. The manual implementations are easier to customize and debug. The RAGAS library is more thoroughly tested and handles edge cases. I'd recommend starting with the manual implementations to understand what the metrics measure, then switching to the library if you need production-grade scoring.

Connecting graders to the harness

The graders plug into the harness flow like this:

# Run the harness
python harness/run_harness.py

# Auto-grade answers
python harness/graders/answer_grader.py harness/runs/harness-2026-03-25-*.jsonl

# Check retrieval quality
python harness/graders/retrieval_grader.py harness/runs/harness-2026-03-25-*.jsonl

# Summarize and compare
python harness/summarize_run.py harness/runs/harness-2026-03-25-*-graded.jsonl
python harness/compare_runs.py harness/runs/baseline-graded.jsonl harness/runs/harness-2026-03-25-*-graded.jsonl

The grading step adds grade, failure_label, and grading_notes to each entry in the run log. The summarize and compare scripts work the same regardless of whether grades came from human review or LLM-as-judge, since both produce the same fields in the same schema.

Exercises

Add expected_files and expected_symbols to at least 10 of your benchmark questions. Run the retrieval grader and check file recall. Which questions have the worst retrieval?
Run the answer grader on your latest harness run. Compare the auto-grades to at least 5 grades you assigned by hand. What's the agreement rate? Where does the judge disagree with you?
Run the RAGAS-style faithfulness scorer on 5 answers: 3 that you know are well-grounded and 2 that you know contain hallucinations. Does the score distinguish between them?
Modify the grading rubric to add examples for each grade level (one example per grade). Re-run the grader on the same run and check whether consistency improved.
Calculate how much auto-grading costs for a full benchmark pass. Compare the grading cost to the pipeline cost. Is automated grading a significant fraction of the total?

Completion checkpoint

You have:

A rule-based retrieval grader that checks file recall and precision against expected evidence
An LLM-as-judge answer grader with a structured rubric that produces grades compatible with the run-log schema
At least one auto-graded run log that you've spot-checked against manual grades (target: 80%+ agreement)
An understanding of RAGAS metrics (faithfulness, relevance, context precision) and how they quantify retrieval quality
Graders connected to the harness flow so you can run, grade, summarize, and compare with a sequence of commands

Reflection prompts

Where did the LLM judge disagree with your manual grades? Was the judge too lenient, too strict, or just interpreting the rubric differently?
How much did adding expected_files to your benchmark questions change your understanding of retrieval quality? Were there retrieval failures you hadn't noticed from answer quality alone?
If you could only have one eval (retrieval or answer), which would you keep? What would you miss from the one you dropped?

Connecting to the project

We can now automatically evaluate whether retrieval found the right evidence and whether the answer was correct and grounded. These two eval families catch the most common failures in a RAG system: bad retrieval and bad generation.

But there are failure modes these evals can't catch. The system might call unnecessary tools, take a wasteful execution path, or route to the wrong retrieval mode. These are architectural failures where the answer might be fine, but the process was inefficient or fragile. The next lesson adds tool-use evals and trace evals that evaluate the system's behavior, not just its output.

What's next

Tool-Use and Trace Evals. That covers outputs; the next lesson grades behavior, looking at tool choice and execution path quality.

References

Start here

RAGAS documentation — the RAGAS framework for retrieval and answer evaluation metrics, including faithfulness, relevance, and context precision

Build with this

OpenAI: Structured outputs — the JSON mode we use for judge responses, ensuring parseable grading output
Langfuse: Scores — how to attach eval scores to traces in Langfuse for visualization and filtering

Deep dive

Hamel Husain: Your AI product needs evals — practical advice on building eval suites, including LLM-as-judge patterns and rubric design
OpenAI: Evaluation getting started — OpenAI's approach to structured evaluation, useful as a second perspective on grading design