Module 6: Observability and Evals Retrieval and Answer Evals

s--- id: evals-retrieval-and-answer title: "Retrieval and Answer Evals" description: "Build automated graders for the first two eval families: did retrieval find the right evidence, and was the answer correct and grounded?" type: lesson path: ai-engineering module: observability-and-evals module_order: 6 order: 4 prerequisites:

  • observability-building-your-harness estimated_minutes: 90 content_status: final tags:
  • evals
  • retrieval
  • grounding
  • llm-as-judge
  • ragas
  • grading

Retrieval and Answer Evals

You can now run the harness with one command and get a traced, costed run log. But grading still happens by hand: you open the file, read each answer, compare it to the gold answer, and assign a grade. That's fine for 15 questions, but it doesn't work for 300, and it doesn't work for checking every pull request.

This lesson builds automated graders for the first two eval families: retrieval evals (did the system find the right evidence?) and answer evals (was the answer correct and grounded?). We'll teach two grading approaches (rule-based checks and LLM-as-judge) and you'll build runnable graders that the harness can call directly. By the end, your harness will produce auto-graded run logs that you can compare without opening a single file.

What you'll learn

  • Distinguish the four eval families and understand why we're starting with retrieval and answer evals
  • Build rule-based retrieval evals that check whether the right files and symbols appeared in the evidence
  • Understand LLM-as-judge as a grading pattern: when it's useful, when it's risky, and how to make it consistent
  • Build an LLM-as-judge answer grader with a structured rubric
  • Use RAGAS metrics (faithfulness, relevance, context precision) as a framework for retrieval eval
  • Connect graders to the harness so runs can be auto-graded

Concepts

The four eval families — AI systems need evaluation at four levels, and each catches different failure modes:

  1. Retrieval evals — did the right files, symbols, or passages appear in the evidence? Catches retrieval failures before they become answer failures.
  2. Answer evals — was the final answer correct, grounded in evidence, and appropriately scoped? Catches generation failures, hallucinations, and grounding breakdowns.
  3. Tool-use evals — did the agent call the right tools with the right arguments? Catches unnecessary tool calls, missing tool calls, and argument errors.
  4. Trace evals — did the overall execution path make sense? Catches inefficient routes, wasteful patterns, and architectural failures.

This lesson covers families 1 and 2. The next lesson covers families 3 and 4.

Retrieval eval metrics — three metrics capture different aspects of retrieval quality:

  • Context precision — of the chunks retrieved, how many were actually relevant? Low precision means the retrieval returned noise alongside signal. High precision means every retrieved chunk was useful.
  • Context recall — of the chunks that should have been retrieved, how many were? Low recall means the system missed important evidence. High recall means the system found everything it needed.
  • Faithfulness — does the answer actually use the retrieved evidence, or does it ignore the evidence and hallucinate? Faithfulness connects retrieval quality to answer quality and catches the case where retrieval worked fine but the model ignored the evidence.

These metrics come from the RAGAS framework, but the concepts are portable. Any retrieval eval system needs to answer the same three questions: did we find the right things, did we avoid finding the wrong things, and did the model actually use what we found?

LLM-as-judge — using a language model to evaluate or grade the output of another language model (or the same model). This is a pretty powerful pattern for evals that can't be reduced to exact-match checks. Questions like "is this answer correct and well-grounded?" require judgment that rules alone can't provide.

LLM-as-judge is useful when:

  • The correct answer can be phrased many different ways (open-ended questions)
  • You need to evaluate qualities like "grounded," "complete," or "well-reasoned"
  • You have a clear rubric that a model can follow consistently
  • You're willing to spot-check judge outputs for quality

LLM-as-judge is risky when:

  • The evaluation requires domain expertise the judge model doesn't have
  • You need deterministic, reproducible grades (LLM judges have variance)
  • The judge model is the same model that generated the output (self-evaluation bias)
  • You haven't validated the judge against human grades

The key to reliable LLM-as-judge is rubric quality. A vague rubric ("is this answer good?") produces inconsistent grades. A structured rubric with specific criteria and examples produces consistent grades. We'll build a structured rubric in this lesson.

RAGAS — Retrieval Augmented Generation Assessment, an open-source framework that implements retrieval and answer eval metrics. We'll use RAGAS concepts (faithfulness, relevance, context precision) as our metric framework, and show how to compute them both manually and with the RAGAS library. The metrics matter more than the tool. If you understand what faithfulness measures, you can implement it with any library.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Retrieval failures hidden in good answersThe model compensates for bad retrieval with general knowledgeCheck retrieval separately from answerRetrieval evals with expected-file labels
Inconsistent manual gradingTwo graders give different grades to the same answerDetailed rubricStructured rubric with LLM-as-judge
Hallucination detectionAnswer sounds correct but includes unsupported claimsRead each answer manuallyFaithfulness eval comparing answer claims to evidence
Answer quality regressionA pipeline change made answers worse but it's hard to tell howRe-grade everything by handAutomated answer grader running through the harness
Noisy retrievalEvidence contains irrelevant chunks that dilute useful contextIncrease top-k and hopeContext precision metric identifying low-relevance chunks

Walkthrough

Rule-based retrieval evals

The simplest retrieval eval is a file-match check: did the expected files appear in the evidence? In Module 2, you added gold answers to your benchmark questions. Now we'll add expected evidence files:

{"id": "q001", "question": "What does validate_path return?", "category": "symbol_lookup", "gold_answer": "Returns a Path object...", "expected_files": ["src/utils/path_validator.py"], "expected_symbols": ["validate_path"]}
{"id": "q002", "question": "What is the project architecture?", "category": "architecture", "gold_answer": "The project uses...", "expected_files": ["README.md", "docs/architecture.md"], "expected_symbols": []}

Then build a grader that checks whether retrieval found what it should have:

# harness/graders/retrieval_grader.py
"""Rule-based retrieval evaluation.

Checks whether the expected files and symbols appeared in the
evidence bundle for each benchmark question.
"""
import json
import sys
from pathlib import Path


def grade_retrieval(entry: dict, benchmark_question: dict) -> dict:
    """Grade a single entry's retrieval quality.

    Returns a dict with retrieval-specific metrics.
    """
    expected_files = set(benchmark_question.get("expected_files", []))
    expected_symbols = set(benchmark_question.get("expected_symbols", []))

    # What the system actually retrieved
    retrieved_files = set(entry.get("evidence_files", []))

    # File-level metrics
    if expected_files:
        file_hits = expected_files & retrieved_files
        file_precision = len(file_hits) / len(retrieved_files) if retrieved_files else 0
        file_recall = len(file_hits) / len(expected_files)
    else:
        file_precision = None
        file_recall = None

    # Symbol-level metrics (if evidence includes symbol names)
    # This will be more useful once we add symbol tracking to the run log
    symbol_recall = None
    if expected_symbols:
        # For now, check if expected symbols appear anywhere in the answer
        answer_lower = entry.get("answer", "").lower()
        found_symbols = {
            s for s in expected_symbols if s.lower() in answer_lower
        }
        symbol_recall = len(found_symbols) / len(expected_symbols)

    return {
        "retrieval_file_precision": round(file_precision, 3) if file_precision is not None else None,
        "retrieval_file_recall": round(file_recall, 3) if file_recall is not None else None,
        "retrieval_symbol_recall": round(symbol_recall, 3) if symbol_recall is not None else None,
        "expected_files": list(expected_files),
        "retrieved_files": list(retrieved_files),
        "missing_files": list(expected_files - retrieved_files),
        "extra_files": list(retrieved_files - expected_files),
    }


def grade_run_retrieval(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
) -> list[dict]:
    """Grade retrieval quality for an entire run."""
    # Load benchmark questions with expected files
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    # Load run entries
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    results = []
    for entry in entries:
        q_id = entry["question_id"]
        if q_id in benchmark:
            bq = benchmark[q_id]
            if bq.get("expected_files"):
                grade = grade_retrieval(entry, bq)
                grade["question_id"] = q_id
                grade["question"] = entry["question"][:60]
                results.append(grade)

    return results


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/retrieval_grader.py <run-file.jsonl>")
        sys.exit(1)

    results = grade_run_retrieval(sys.argv[1])

    if not results:
        print("No questions with expected_files found in benchmark.")
        print("Add 'expected_files' to your benchmark questions to enable retrieval evals.")
        sys.exit(0)

    # Summary
    recalls = [r["retrieval_file_recall"] for r in results if r["retrieval_file_recall"] is not None]
    precisions = [r["retrieval_file_precision"] for r in results if r["retrieval_file_precision"] is not None]

    print(f"Retrieval eval: {len(results)} questions with expected files\n")
    if recalls:
        avg_recall = sum(recalls) / len(recalls)
        print(f"  Average file recall:    {avg_recall:.1%}")
    if precisions:
        avg_precision = sum(precisions) / len(precisions)
        print(f"  Average file precision: {avg_precision:.1%}")

    # Show missed files
    print(f"\nMissed files by question:")
    for r in results:
        if r["missing_files"]:
            print(f"  {r['question']}")
            for f in r["missing_files"]:
                print(f"    MISSING: {f}")
mkdir -p harness/graders
python harness/graders/retrieval_grader.py harness/runs/harness-2026-03-25-142233.jsonl

Rule-based retrieval evals are fast, deterministic, and free (no API calls). They catch the most common retrieval failure: the system simply didn't find the right file. They can't evaluate how well the evidence was used, and that's where answer evals come in.

LLM-as-judge: building a reliable answer grader

For answer evaluation, we need judgment that rules can't provide. "Is this answer correct and grounded?" requires understanding the question, the gold answer, the system's answer, and the evidence. This is where LLM-as-judge earns its place.

The key to making LLM-as-judge reliable is the rubric. Here's what a good rubric looks like:

# harness/graders/answer_grader.py
"""LLM-as-judge answer evaluation.

Uses a structured rubric to grade answers for correctness,
grounding, and completeness. The rubric is the most important
part — a vague rubric produces inconsistent grades.
"""
import json
import sys
from openai import OpenAI

client = OpenAI()

JUDGE_MODEL = "gpt-4o"  # Use a capable model for judging

GRADING_RUBRIC = """\
You are an expert grader evaluating answers from a code assistant.
You will receive:
- The QUESTION asked
- The GOLD ANSWER (the known-correct reference)
- The SYSTEM ANSWER (what the code assistant produced)
- The EVIDENCE (what the system retrieved, if anything)

Grade the system answer using this rubric:

## Grades

**fully_correct** — The system answer contains the same key facts as the gold
answer. Minor wording differences are fine. The answer doesn't need to be
identical — it needs to convey the same information accurately.

**partially_correct** — The system answer contains some correct information
but is missing key facts from the gold answer, or includes both correct and
incorrect information.

**unsupported** — The system answer doesn't contain enough information to
evaluate correctness. This includes "I don't know" responses, vague answers
that avoid committing to specifics, and answers that are technically true
but don't address the question.

**wrong** — The system answer contains incorrect information that contradicts
the gold answer, or confidently states something that isn't true about the
codebase.

## Failure labels (for non-fully_correct grades only)

- **hallucination** — The answer states something as fact that isn't supported
  by the evidence or the gold answer
- **missing_evidence** — The answer would need evidence that wasn't retrieved
- **retrieval_miss** — Evidence was retrieved but didn't contain what was needed
- **wrong_chunk** — The system found related but wrong evidence
- **reasoning_error** — The evidence was correct but the system drew the wrong
  conclusion
- **scope_confusion** — The answer addresses a different question or scope than
  what was asked

## Output format

Respond with ONLY a JSON object:
{
    "grade": "fully_correct | partially_correct | unsupported | wrong",
    "failure_label": "label or null if fully_correct",
    "confidence": 0.0 to 1.0,
    "reasoning": "1-2 sentences explaining the grade"
}
"""

FAILURE_LABELS = [
    "hallucination",
    "missing_evidence",
    "retrieval_miss",
    "wrong_chunk",
    "reasoning_error",
    "scope_confusion",
]


def grade_answer(
    question: str,
    gold_answer: str,
    system_answer: str,
    evidence: str = "",
    model: str = JUDGE_MODEL,
) -> dict:
    """Grade a single answer using LLM-as-judge.

    Returns a dict with grade, failure_label, confidence, and reasoning.
    """
    user_prompt = f"""QUESTION: {question}

GOLD ANSWER: {gold_answer}

SYSTEM ANSWER: {system_answer}

EVIDENCE: {evidence if evidence else "(no evidence retrieved)"}
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": GRADING_RUBRIC},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        # Validate expected fields
        assert result.get("grade") in [
            "fully_correct", "partially_correct", "unsupported", "wrong"
        ]
        # Some models keep the grade stable but drift on failure_label.
        if result["grade"] == "fully_correct":
            result["failure_label"] = None
        elif result.get("failure_label") not in FAILURE_LABELS:
            result["failure_label"] = "grading_error"
        return result
    except (json.JSONDecodeError, AssertionError, KeyError):
        return {
            "grade": "unsupported",
            "failure_label": "grading_error",
            "confidence": 0.0,
            "reasoning": "Judge output could not be parsed",
        }


def grade_run_answers(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
    model: str = JUDGE_MODEL,
) -> str:
    """Auto-grade all answers in a run file.

    Writes a new *-graded.jsonl file with grades filled in.
    Returns the path to the graded file.
    """
    # Load benchmark questions with gold answers
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    # Load run entries
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    print(f"Auto-grading {len(entries)} answers with {model}...\n")

    for i, entry in enumerate(entries):
        q_id = entry["question_id"]
        if q_id not in benchmark:
            print(f"  [{i+1}] {q_id}: no benchmark question found, skipping")
            continue

        bq = benchmark[q_id]
        gold = bq.get("gold_answer", "")
        if not gold:
            print(f"  [{i+1}] {q_id}: no gold answer, skipping")
            continue

        print(f"  [{i+1}/{len(entries)}] {entry['category']}: {entry['question'][:50]}...")

        result = grade_answer(
            question=entry["question"],
            gold_answer=gold,
            system_answer=entry["answer"],
            model=model,
        )

        entry["grade"] = result["grade"]
        entry["failure_label"] = result.get("failure_label")
        entry["grading_notes"] = result.get("reasoning", "")
        entry["judge_confidence"] = result.get("confidence", 0)
        entry["judge_model"] = model

    # Save graded version
    output = run_file.replace(".jsonl", "-graded.jsonl")
    with open(output, "w") as f:
        for entry in entries:
            f.write(json.dumps(entry) + "\n")

    print(f"\nGraded results saved to {output}")
    return output


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/answer_grader.py <run-file.jsonl>")
        sys.exit(1)

    output = grade_run_answers(sys.argv[1])
    print(f"\nRun summary to see results:")
    print(f"  python harness/summarize_run.py {output}")
python harness/graders/answer_grader.py harness/runs/harness-2026-03-25-142233.jsonl

Expected behavior: the script prints one progress line per graded answer, writes a new -graded.jsonl file, and then points you at harness/summarize_run.py for the next step.

Making LLM-as-judge reliable

The rubric above works, but LLM judges have known failure modes. Here's how to catch them:

Spot-checking. Grade 10-15 answers by hand, then compare your grades to the judge's grades. If agreement is below 80%, the rubric needs work. Common rubric problems:

  • Grade boundaries are ambiguous ("partially correct" vs. "unsupported")
  • The rubric doesn't handle edge cases (abstentions, off-topic answers)
  • Examples are missing. Adding 2-3 examples per grade level significantly improves consistency

Judge consistency. Run the same grading twice and compare. LLM judges have variance even at temperature 0. If the same answer gets different grades across runs, that's a rubric clarity problem, not a model problem. Tighten the rubric criteria or add examples.

Self-evaluation bias. If the judge model is the same model that generated the answers, it may be biased toward grading them favorably. I've found it's worth using a different model for judging than for generation when possible, or at least validating against human grades to calibrate.

Grade inflation. LLM judges tend to be generous. If your auto-graded accuracy is significantly higher than your hand-graded accuracy, add stricter criteria to the rubric. Phrases like "the answer must contain the specific function name, not just describe what it does" help the judge apply the right bar.

RAGAS metrics: retrieval quality as numbers

RAGAS provides three metrics that quantify retrieval quality. We'll implement simplified versions first, then show how to use the RAGAS library:

# harness/graders/ragas_metrics.py
"""Simplified RAGAS-style retrieval metrics.

Implements faithfulness, relevance, and context precision
as concepts. For production use, install the ragas library.
"""
import json
import sys
from openai import OpenAI

client = OpenAI()


def score_faithfulness(
    answer: str,
    evidence: str,
    model: str = "gpt-4o-mini",
) -> float:
    """Score how faithfully the answer uses the provided evidence.

    Returns a score from 0.0 (ignores evidence entirely) to 1.0
    (every claim is supported by evidence).

    Faithfulness catches the case where retrieval worked but the
    model hallucinated anyway.
    """
    prompt = f"""Given an ANSWER and the EVIDENCE it was supposed to be based on,
score how faithfully the answer uses the evidence.

Score 1.0: Every factual claim in the answer is directly supported by the evidence.
Score 0.5: Some claims are supported, but the answer also includes unsupported claims.
Score 0.0: The answer ignores the evidence and makes claims not found there.

ANSWER: {answer}

EVIDENCE: {evidence}

Respond with ONLY a JSON object: {{"score": 0.0_to_1.0, "reasoning": "brief explanation"}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        return float(result.get("score", 0))
    except (json.JSONDecodeError, ValueError):
        return 0.0


def score_answer_relevance(
    question: str,
    answer: str,
    model: str = "gpt-4o-mini",
) -> float:
    """Score how relevant the answer is to the question asked.

    Returns 0.0 (completely off-topic) to 1.0 (directly addresses
    the question).

    Catches scope confusion — the answer is about something else.
    """
    prompt = f"""Given a QUESTION and an ANSWER, score how relevant the answer
is to the question.

Score 1.0: The answer directly addresses exactly what was asked.
Score 0.5: The answer is related but doesn't fully address the question.
Score 0.0: The answer is about something else entirely.

QUESTION: {question}
ANSWER: {answer}

Respond with ONLY a JSON object: {{"score": 0.0_to_1.0, "reasoning": "brief explanation"}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    try:
        result = json.loads(response.choices[0].message.content)
        return float(result.get("score", 0))
    except (json.JSONDecodeError, ValueError):
        return 0.0


if __name__ == "__main__":
    # Demo with a sample question
    question = "What does validate_path return?"
    answer = "The validate_path function returns a Path object if the path is valid and within the allowed directory tree. It raises a ValueError for paths outside the allowed root."
    evidence = "def validate_path(p: str) -> Path:\n    resolved = Path(p).resolve()\n    if not resolved.is_relative_to(ALLOWED_ROOT):\n        raise ValueError(f'Path {p} outside allowed root')\n    return resolved"

    print("RAGAS-style metrics demo\n")

    faith = score_faithfulness(answer, evidence)
    print(f"Faithfulness: {faith:.2f}")

    relevance = score_answer_relevance(question, answer)
    print(f"Answer relevance: {relevance:.2f}")

    # For contrast, test with a hallucinated answer
    bad_answer = "The validate_path function returns True if the path exists and False otherwise. It also logs the access to a database."
    bad_faith = score_faithfulness(bad_answer, evidence)
    print(f"\nHallucinated answer faithfulness: {bad_faith:.2f}")
python harness/graders/ragas_metrics.py

Expected output:

RAGAS-style metrics demo

Faithfulness: 0.90
Answer relevance: 0.95

Hallucinated answer faithfulness: 0.20

The faithfulness score drops sharply for the hallucinated answer because the claim about logging to a database isn't in the evidence. This is exactly the kind of failure that's hard to catch with rule-based checks but easy for a faithfulness eval.

Using the RAGAS library

If you want the full RAGAS metric suite, install the library:

pip install ragas

RAGAS provides pre-built metrics that handle the LLM calls and scoring internally. The concepts are the same as what we implemented manually. RAGAS just packages them with more sophisticated prompting and scoring:

# harness/graders/ragas_eval.py
"""RAGAS library integration for retrieval and answer metrics.

Uses the ragas library for production-grade metric computation.
The manual implementations in ragas_metrics.py teach the concepts;
this file shows the library approach.
"""
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset


def run_ragas_eval(questions, answers, contexts, ground_truths):
    """Run RAGAS evaluation on a batch of questions.

    Args:
        questions: list of question strings
        answers: list of answer strings
        contexts: list of lists of context strings
        ground_truths: list of gold answer strings

    Returns a dict of metric scores.
    """
    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )

    return results


if __name__ == "__main__":
    # Demo with sample data
    results = run_ragas_eval(
        questions=["What does validate_path return?"],
        answers=["Returns a Path object if valid, raises ValueError otherwise."],
        contexts=[["def validate_path(p: str) -> Path: ..."]],
        ground_truths=["Returns a Path object for valid paths within the allowed directory tree. Raises ValueError for paths outside the allowed root."],
    )
    print("RAGAS results:")
    print(results)

Use whichever approach fits your needs. The manual implementations are easier to customize and debug. The RAGAS library is more thoroughly tested and handles edge cases. I'd recommend starting with the manual implementations to understand what the metrics measure, then switching to the library if you need production-grade scoring.

Connecting graders to the harness

The graders plug into the harness flow like this:

# Run the harness
python harness/run_harness.py

# Auto-grade answers
python harness/graders/answer_grader.py harness/runs/harness-2026-03-25-*.jsonl

# Check retrieval quality
python harness/graders/retrieval_grader.py harness/runs/harness-2026-03-25-*.jsonl

# Summarize and compare
python harness/summarize_run.py harness/runs/harness-2026-03-25-*-graded.jsonl
python harness/compare_runs.py harness/runs/baseline-graded.jsonl harness/runs/harness-2026-03-25-*-graded.jsonl

The grading step adds grade, failure_label, and grading_notes to each entry in the run log. The summarize and compare scripts work the same regardless of whether grades came from human review or LLM-as-judge, since both produce the same fields in the same schema.

Exercises

  1. Add expected_files and expected_symbols to at least 10 of your benchmark questions. Run the retrieval grader and check file recall. Which questions have the worst retrieval?
  2. Run the answer grader on your latest harness run. Compare the auto-grades to at least 5 grades you assigned by hand. What's the agreement rate? Where does the judge disagree with you?
  3. Run the RAGAS-style faithfulness scorer on 5 answers: 3 that you know are well-grounded and 2 that you know contain hallucinations. Does the score distinguish between them?
  4. Modify the grading rubric to add examples for each grade level (one example per grade). Re-run the grader on the same run and check whether consistency improved.
  5. Calculate how much auto-grading costs for a full benchmark pass. Compare the grading cost to the pipeline cost. Is automated grading a significant fraction of the total?

Completion checkpoint

You have:

  • A rule-based retrieval grader that checks file recall and precision against expected evidence
  • An LLM-as-judge answer grader with a structured rubric that produces grades compatible with the run-log schema
  • At least one auto-graded run log that you've spot-checked against manual grades (target: 80%+ agreement)
  • An understanding of RAGAS metrics (faithfulness, relevance, context precision) and how they quantify retrieval quality
  • Graders connected to the harness flow so you can run, grade, summarize, and compare with a sequence of commands

Reflection prompts

  • Where did the LLM judge disagree with your manual grades? Was the judge too lenient, too strict, or just interpreting the rubric differently?
  • How much did adding expected_files to your benchmark questions change your understanding of retrieval quality? Were there retrieval failures you hadn't noticed from answer quality alone?
  • If you could only have one eval (retrieval or answer), which would you keep? What would you miss from the one you dropped?

Connecting to the project

We can now automatically evaluate whether retrieval found the right evidence and whether the answer was correct and grounded. These two eval families catch the most common failures in a RAG system: bad retrieval and bad generation.

But there are failure modes these evals can't catch. The system might call unnecessary tools, take a wasteful execution path, or route to the wrong retrieval mode. These are architectural failures where the answer might be fine, but the process was inefficient or fragile. The next lesson adds tool-use evals and trace evals that evaluate the system's behavior, not just its output.

What's next

Tool-Use and Trace Evals. That covers outputs; the next lesson grades behavior, looking at tool choice and execution path quality.

References

Start here

  • RAGAS documentation — the RAGAS framework for retrieval and answer evaluation metrics, including faithfulness, relevance, and context precision

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.