Module 6: Observability and Evals Tool-Use and Trace Evals

Tool-Use and Trace Evals

The previous lesson evaluated outputs: did retrieval find the right evidence, and was the answer correct? This lesson evaluates behavior: did the system call the right tools, and did the overall execution path make sense?

These eval families catch a different class of failure. The answer might be correct, but the system got there by calling three unnecessary tools, retrieving from the wrong index, and burning 5x the normal token budget. Or the answer might be wrong, and the retrieval eval says "evidence was fine," which means the failure happened somewhere in the execution path between retrieval and generation. Tool-use evals and trace evals find these behavioral failures that output evals can't see.

What you'll learn

  • Build tool-use evals that check whether the right tools were called with the right arguments
  • Apply a trace labeling taxonomy to classify execution paths as optimal, wasteful, or broken
  • Label traces with specific failure modes: wrong route, missed retrieval, unnecessary tool call, and others
  • Build CI-friendly eval patterns that run automatically and flag regressions
  • Connect all four eval families into the harness for a complete evaluation pass

Concepts

Tool-use eval — an evaluation of whether the system called the right tools with the right arguments and avoided calling tools it didn't need. Tool-use evals sit between retrieval evals and trace evals in the evaluation hierarchy. A retrieval eval asks "did we find the right evidence?" A tool-use eval asks "did we call the right functions to find that evidence?" The distinction matters because a correct answer can come from an incorrect tool path — the system might have called search_code when read_file would have been cheaper, or called both when only one was needed.

Trace eval — an evaluation of the full execution path for a single request, from routing decision through tool calls through generation. Trace evals are the most holistic eval family because they evaluate the system's strategy, not just its output or individual steps. A trace eval might flag a request where the system used hybrid retrieval (expensive) for a question that should have been code-only (cheap), even though the final answer was correct.

Trace labeling taxonomy — a structured set of labels for classifying trace-level failures. Where the answer grading rubric has four grades (fully correct, partially correct, unsupported, wrong), the trace taxonomy has labels for how the system behaved:

  • wrong_route — the router picked the wrong retrieval mode (e.g., docs mode for a code question)
  • missed_retrieval — the system should have retrieved but didn't, or retrieved from the wrong source
  • bad_citation — the answer claims to cite evidence but the citation is incorrect or fabricated
  • unnecessary_tool_call — the system called a tool that wasn't needed for this question
  • correct_but_wasteful — the answer is right, but the execution path used more resources than necessary
  • correct_and_optimal — the answer is right and the path was efficient

This taxonomy turns "the system feels slow" into "23% of traces are correct_but_wasteful, and the waste comes from unnecessary hybrid retrieval." That's actionable.

CI-friendly eval — an evaluation that can run in a continuous integration pipeline and produce a pass/fail result. CI-friendly evals are fast (minutes, not hours), deterministic enough to avoid flaky failures, and produce structured output that CI systems can parse. They're the eval equivalent of unit tests: not comprehensive, but fast enough to run on every commit.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Unnecessary tool callsSystem calls tools for questions it could answer directlyReview traces manuallyTool-call count eval with expected tool lists
Wrong tool argumentsTools are called correctly but with suboptimal argumentsRead tool call logsArgument validation checks
Wasteful execution pathsCorrect answers that cost 5x the averageSort by cost and inspectTrace labeling with the wasteful taxonomy
Routing regressionsPipeline change broke routing for a class of questionsRe-run benchmarkCI-friendly routing accuracy check
Invisible path failuresThe answer is fine but the execution path was fragileNo trace-level evalsTrace labeling pass on benchmark runs

Walkthrough

Tool-use evals

Tool-use evals check three things:

  1. Were the right tools called? For a code lookup question, we expect search_code or read_file, not search_docs.
  2. Were the arguments correct? If search_code was called, was the query reasonable?
  3. Were unnecessary tools avoided? For a general knowledge question (skip mode), no tools should be called at all.

To make this work, we'll add expected tool information to benchmark questions:

{"id": "q001", "question": "What does validate_path return?", "expected_tools": ["search_code"], "expected_route": "code"}
{"id": "q003", "question": "What is a Python list?", "expected_tools": [], "expected_route": "skip"}
{"id": "q004", "question": "What calls read_file and why?", "expected_tools": ["search_code", "search_docs"], "expected_route": "hybrid"}

Then build the grader:

# harness/graders/tool_grader.py
"""Tool-use evaluation.

Checks whether the system called the expected tools and avoided
unnecessary ones. Works with the tools_called field in run logs.
"""
import json
import sys
from collections import Counter


def grade_tool_use(entry: dict, benchmark_question: dict) -> dict:
    """Grade tool use for a single entry.

    Returns metrics for tool precision, recall, and waste.
    """
    expected_tools = set(benchmark_question.get("expected_tools", []))
    expected_route = benchmark_question.get("expected_route")
    actual_tools = set(entry.get("tools_called", []))
    actual_route = entry.get("retrieval_method", "unknown")

    # Tool-level metrics
    if expected_tools or actual_tools:
        correct_tools = expected_tools & actual_tools
        unnecessary = actual_tools - expected_tools
        missing = expected_tools - actual_tools

        tool_precision = (
            len(correct_tools) / len(actual_tools)
            if actual_tools else (1.0 if not expected_tools else 0.0)
        )
        tool_recall = (
            len(correct_tools) / len(expected_tools)
            if expected_tools else (1.0 if not actual_tools else 0.0)
        )
    else:
        # No tools expected, no tools called — correct
        tool_precision = 1.0
        tool_recall = 1.0
        unnecessary = set()
        missing = set()

    # Route accuracy
    route_correct = None
    if expected_route:
        route_correct = expected_route == actual_route

    return {
        "tool_precision": round(tool_precision, 3),
        "tool_recall": round(tool_recall, 3),
        "unnecessary_tools": list(unnecessary),
        "missing_tools": list(missing),
        "route_correct": route_correct,
        "expected_route": expected_route,
        "actual_route": actual_route,
    }


def grade_run_tools(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
) -> list[dict]:
    """Grade tool use for an entire run."""
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    results = []
    for entry in entries:
        q_id = entry["question_id"]
        if q_id in benchmark:
            bq = benchmark[q_id]
            if bq.get("expected_tools") is not None or bq.get("expected_route"):
                grade = grade_tool_use(entry, bq)
                grade["question_id"] = q_id
                grade["question"] = entry["question"][:60]
                results.append(grade)

    return results


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/tool_grader.py <run-file.jsonl>")
        sys.exit(1)

    results = grade_run_tools(sys.argv[1])

    if not results:
        print("No questions with expected_tools or expected_route found.")
        print("Add these fields to your benchmark questions to enable tool-use evals.")
        sys.exit(0)

    # Summary
    precisions = [r["tool_precision"] for r in results]
    recalls = [r["tool_recall"] for r in results]
    route_checks = [r for r in results if r["route_correct"] is not None]

    print(f"Tool-use eval: {len(results)} questions\n")
    print(f"  Avg tool precision: {sum(precisions)/len(precisions):.1%}")
    print(f"  Avg tool recall:    {sum(recalls)/len(recalls):.1%}")

    if route_checks:
        route_accuracy = sum(1 for r in route_checks if r["route_correct"]) / len(route_checks)
        print(f"  Route accuracy:     {route_accuracy:.1%}")

    # Show unnecessary tool calls
    wasteful = [r for r in results if r["unnecessary_tools"]]
    if wasteful:
        print(f"\nUnnecessary tool calls ({len(wasteful)} questions):")
        for r in wasteful:
            print(f"  {r['question']}")
            for t in r["unnecessary_tools"]:
                print(f"    UNNECESSARY: {t}")

    # Show missing tools
    missing = [r for r in results if r["missing_tools"]]
    if missing:
        print(f"\nMissing tool calls ({len(missing)} questions):")
        for r in missing:
            print(f"  {r['question']}")
            for t in r["missing_tools"]:
                print(f"    MISSING: {t}")
python harness/graders/tool_grader.py harness/runs/harness-2026-03-25-142233.jsonl

Trace labeling

Trace labeling applies the taxonomy to each trace in a run. Unlike the tool grader (which uses rules), trace labeling requires looking at the full execution path, including the combination of routing, retrieval, tool calls, and generation:

# harness/graders/trace_labeler.py
"""Trace-level evaluation using the trace labeling taxonomy.

Labels each trace as:
- correct_and_optimal: right answer, efficient path
- correct_but_wasteful: right answer, inefficient path
- wrong_route: routing error that affected the outcome
- missed_retrieval: retrieval should have happened but didn't
- bad_citation: answer cites evidence that doesn't exist or is wrong
- unnecessary_tool_call: tools called that weren't needed
"""
import json
import sys


# Trace labeling taxonomy
TRACE_LABELS = [
    "correct_and_optimal",
    "correct_but_wasteful",
    "wrong_route",
    "missed_retrieval",
    "bad_citation",
    "unnecessary_tool_call",
]


def label_trace(entry: dict, benchmark_question: dict) -> dict:
    """Apply trace labels to a single entry.

    Uses the grade, tool use, and route information to classify
    the overall execution path.
    """
    grade = entry.get("grade", "unknown")
    failure_label = entry.get("failure_label")
    expected_route = benchmark_question.get("expected_route")
    actual_route = entry.get("retrieval_method", "unknown")
    expected_tools = set(benchmark_question.get("expected_tools", []))
    actual_tools = set(entry.get("tools_called", []))
    duration = entry.get("duration_seconds", 0)
    context_tokens = entry.get("context_tokens", 0)

    labels = []
    reasoning = []

    # Check routing
    if expected_route and expected_route != actual_route:
        labels.append("wrong_route")
        reasoning.append(
            f"Expected route '{expected_route}', got '{actual_route}'"
        )

    # Check for unnecessary tools
    unnecessary = actual_tools - expected_tools
    if unnecessary:
        labels.append("unnecessary_tool_call")
        reasoning.append(f"Unnecessary tools: {unnecessary}")

    # Check for missed retrieval
    if failure_label in ("missing_evidence", "retrieval_miss"):
        labels.append("missed_retrieval")
        reasoning.append(f"Failure label indicates retrieval problem: {failure_label}")

    # Check for citation issues
    if failure_label == "hallucination" and context_tokens > 0:
        labels.append("bad_citation")
        reasoning.append("Evidence was retrieved but answer contained hallucinations")

    # Determine overall path quality
    if grade == "fully_correct":
        # Check if the path was wasteful
        # Heuristic: if context tokens are >2x the median, it's wasteful
        # (in practice, you'd compute the median from the full run)
        if context_tokens > 6000 or duration > 5.0:
            labels.append("correct_but_wasteful")
            reasoning.append(
                f"Correct but used {context_tokens} context tokens / {duration}s"
            )
        elif not labels:  # no other issues found
            labels.append("correct_and_optimal")
            reasoning.append("Correct answer via efficient path")

    if not labels:
        labels.append("correct_but_wasteful" if grade in ("fully_correct", "partially_correct") else "wrong_route")

    return {
        "trace_labels": labels,
        "primary_label": labels[0],
        "reasoning": "; ".join(reasoning),
    }


def label_run_traces(
    run_file: str,
    benchmark_file: str = "benchmark-questions.jsonl",
) -> str:
    """Apply trace labels to an entire graded run.

    Writes a new file with trace labels added to each entry.
    """
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                benchmark[q["id"]] = q

    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    label_counts = {}
    for entry in entries:
        q_id = entry["question_id"]
        bq = benchmark.get(q_id, {})

        result = label_trace(entry, bq)
        entry["trace_labels"] = result["trace_labels"]
        entry["primary_trace_label"] = result["primary_label"]
        entry["trace_reasoning"] = result["reasoning"]

        for label in result["trace_labels"]:
            label_counts[label] = label_counts.get(label, 0) + 1

    # Save labeled version
    output = run_file.replace(".jsonl", "-traced.jsonl")
    with open(output, "w") as f:
        for entry in entries:
            f.write(json.dumps(entry) + "\n")

    # Print summary
    total = len(entries)
    print(f"Trace labeling: {total} entries\n")
    print("Label distribution:")
    for label in TRACE_LABELS:
        count = label_counts.get(label, 0)
        pct = count / total * 100 if total > 0 else 0
        bar = "#" * int(pct / 2)
        print(f"  {label:30s}: {count:3d} ({pct:5.1f}%) {bar}")

    print(f"\nLabeled results saved to {output}")
    return output


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python harness/graders/trace_labeler.py <graded-run-file.jsonl>")
        print("Note: run this on a GRADED file (after answer_grader.py)")
        sys.exit(1)

    label_run_traces(sys.argv[1])
python harness/graders/trace_labeler.py harness/runs/harness-2026-03-25-142233-graded.jsonl

Expected output:

Trace labeling: 30 entries

Label distribution:
  correct_and_optimal           :  12 ( 40.0%) ####################
  correct_but_wasteful          :   6 ( 20.0%) ##########
  wrong_route                   :   4 ( 13.3%) ######
  missed_retrieval              :   5 ( 16.7%) ########
  bad_citation                  :   2 (  6.7%) ###
  unnecessary_tool_call         :   1 (  3.3%) #

Labeled results saved to harness/runs/harness-2026-03-25-142233-graded-traced.jsonl

This distribution is what turns vague complaints into engineering priorities. If 20% of traces are correct_but_wasteful, you know cost optimization will help. If 13% have wrong_route, routing accuracy is the bottleneck. The labels tell you where to focus.

CI-friendly eval patterns

For CI pipelines, you need evals that run fast and produce pass/fail results. Here's a pattern that runs a small eval suite and fails if quality drops below a threshold:

# harness/ci_eval.py
"""CI-friendly eval runner.

Runs a small benchmark subset, auto-grades, and exits with
a non-zero code if quality is below threshold.

Usage:
    python harness/ci_eval.py [--threshold 0.6]
"""
import argparse
import json
import os
import sys
import time

sys.path.insert(0, ".")

# CI eval uses a smaller benchmark subset for speed
CI_BENCHMARK = "benchmark-questions-ci.jsonl"
THRESHOLD_DEFAULT = 0.6  # 60% fully_correct minimum


def run_ci_eval(threshold: float = THRESHOLD_DEFAULT) -> bool:
    """Run CI eval and return True if quality meets threshold."""
    from datetime import datetime, timezone
    from observability.traced_pipeline import traced_rag_pipeline, langfuse
    from retrieval.hybrid_retrieve import hybrid_retrieve
    from harness.graders.answer_grader import grade_answer

    # Load CI benchmark (smaller subset)
    benchmark_file = CI_BENCHMARK
    if not os.path.exists(benchmark_file):
        benchmark_file = "benchmark-questions.jsonl"
        print(f"CI benchmark not found, using full benchmark (slower)")

    questions = []
    benchmark = {}
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                questions.append(q)
                benchmark[q["id"]] = q

    # Limit to 10 questions for CI speed
    questions = questions[:10]

    run_id = f"ci-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}"
    print(f"CI eval: {len(questions)} questions, threshold: {threshold:.0%}\n")

    # Run and grade
    results = []
    start = time.perf_counter()

    for i, q in enumerate(questions):
        print(f"  [{i+1}/{len(questions)}] {q['question'][:50]}...")

        answer = traced_rag_pipeline(
            question=q["question"],
            hybrid_retrieve_fn=hybrid_retrieve,
            run_id=run_id,
        )

        # Auto-grade if gold answer exists
        gold = q.get("gold_answer", "")
        if gold:
            grade_result = grade_answer(
                question=q["question"],
                gold_answer=gold,
                system_answer=answer.answer,
            )
            results.append(grade_result)

    langfuse.flush()
    duration = time.perf_counter() - start

    # Calculate pass/fail
    if not results:
        print("\nNo gradable results. Add gold_answer to benchmark questions.")
        return False

    correct = sum(1 for r in results if r["grade"] == "fully_correct")
    accuracy = correct / len(results)

    print(f"\n{'='*40}")
    print(f"CI eval complete in {duration:.1f}s")
    print(f"  Accuracy: {accuracy:.0%} ({correct}/{len(results)})")
    print(f"  Threshold: {threshold:.0%}")

    if accuracy >= threshold:
        print(f"  Result: PASS")
        return True
    else:
        print(f"  Result: FAIL")
        # Show failures for debugging
        for r in results:
            if r["grade"] != "fully_correct":
                print(f"    {r['grade']}: {r.get('reasoning', '')[:80]}")
        return False


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="CI eval runner")
    parser.add_argument(
        "--threshold", type=float, default=THRESHOLD_DEFAULT,
        help=f"Minimum fully_correct rate (default: {THRESHOLD_DEFAULT})",
    )
    args = parser.parse_args()

    passed = run_ci_eval(args.threshold)
    sys.exit(0 if passed else 1)
# Run CI eval with default threshold
python harness/ci_eval.py

# Run with a custom threshold
python harness/ci_eval.py --threshold 0.5

To integrate with your CI system, add this to your pipeline configuration. In GitHub Actions:

# .github/workflows/eval.yml
name: AI Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: python harness/ci_eval.py --threshold 0.6
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}

A few practical notes on CI evals:

  • Keep the CI benchmark small (10-15 questions) for speed. Use the full benchmark for milestone evaluations.
  • Set the threshold conservatively at first. A threshold that's too high will cause flaky failures from LLM judge variance.
  • Cache the baseline grade so you can detect regressions. If accuracy drops more than 10% from the baseline, that's worth investigating even if it's still above the absolute threshold.
  • Log the full results, not just pass/fail. When CI fails, you'll need the per-question details to debug.

The complete eval flow

Here's the full evaluation pipeline, from harness run to CI check:

# Full evaluation pass (milestone)
python harness/run_harness.py                                    # Run benchmark
python harness/graders/answer_grader.py harness/runs/latest.jsonl    # Grade answers
python harness/graders/retrieval_grader.py harness/runs/latest.jsonl # Check retrieval
python harness/graders/tool_grader.py harness/runs/latest.jsonl      # Check tool use
python harness/graders/trace_labeler.py harness/runs/latest-graded.jsonl  # Label traces
python harness/summarize_run.py harness/runs/latest-graded-traced.jsonl   # Summarize

# Quick CI check (every PR)
python harness/ci_eval.py --threshold 0.6

The milestone pass gives you a complete picture: answer quality, retrieval quality, tool use efficiency, and trace-level behavior. The CI check gives you a fast regression gate. Together, they're the evaluation layer of the harness we've been building across this module.

Exercises

  1. Add expected_tools and expected_route to at least 10 benchmark questions. Run the tool grader and identify the most common tool-use issues.
  2. Run the trace labeler on your latest graded run. What percentage of traces are correct_and_optimal? What's the most common non-optimal label?
  3. Create a CI benchmark file (benchmark-questions-ci.jsonl) with 10 representative questions (at least 2 from each category). Run harness/ci_eval.py and verify it completes in under 2 minutes.
  4. Make a deliberate change to your pipeline (e.g., reduce the token budget, change the routing thresholds) and run the CI eval. Does it catch the regression?
  5. Review 5 trace labels by hand. Does the automatic labeling match your assessment? If not, adjust the thresholds in trace_labeler.py.

Completion checkpoint

You have:

  • A tool-use grader that checks tool precision, recall, and route accuracy against expected values
  • A trace labeler that classifies execution paths using the six-label taxonomy
  • A CI-friendly eval runner that produces pass/fail results in under 2 minutes
  • All four eval families (retrieval, answer, tool-use, trace) connected to the harness
  • At least one run that's been fully evaluated: answer-graded, retrieval-checked, tool-graded, and trace-labeled

Reflection prompts

  • Looking at your trace label distribution, what's the biggest source of waste in your pipeline? What would you change to reduce it?
  • How much confidence do you have in the automated grades vs. your manual grades? What would increase your confidence?
  • The CI eval uses a small benchmark subset. What risks does that create? How would you choose which questions to include in the CI set?

Connecting to the project

We can now measure the single-agent system across all four eval families. We know how well retrieval works, whether answers are correct and grounded, whether tools are used efficiently, and whether execution paths are optimal. The harness produces these measurements automatically and can run in CI to catch regressions.

This is the foundation that makes everything from here forward accountable. Only now are we ready to add multi-agent coordination, because when we split work across specialists, we'll need to measure whether the coordination actually helps or just adds complexity. Without the eval framework we built in this module, adding orchestration would be flying blind. Module 7 will add the orchestration layer, and every architectural decision will be measured against the baseline we've established here.

What's next

Orchestration. Only now do you have the measurement foundation to tell whether adding more agents helps or just adds motion; the next lesson uses that foundation to decide when a split is justified.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.