Module 6: Observability and Evals Building Your AI Harness

Building Your AI Harness

Over the last several modules, you've built pieces of an experiment system without calling it one. In Module 2, you created benchmark questions and a run-log schema. In the last two lessons, you added telemetry and cost tracking. In Module 5, you built the RAG pipeline that actually answers questions. Each piece works, but they're separate scripts with separate entry points. Running an experiment means executing multiple commands, copying file paths between scripts, and manually connecting the results.

This lesson pulls everything together into a single harness. One command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. The harness isn't a new idea; it's the name for what you've been building all along. After this lesson, every evaluation in the rest of the module will run through this harness, and you'll be able to compare any two versions of your system with a single command.

What you'll learn

  • Understand the AI harness concept: what it is, what it contains, and why it matters
  • Connect the existing artifacts (benchmark questions, run-log schema, telemetry, cost tracker) into a unified experiment runner
  • Run a complete benchmark pass with one command that produces traced, costed, gradable results
  • Compare two runs side by side using the summarize script
  • Set up the harness as the foundation for the eval lessons that follow

Concepts

AI harness — the experiment framework that makes iteration on an AI system repeatable and comparable. A harness has four components:

  1. Benchmark set — the questions and gold answers that define what "correct" means (you built this in Module 2's benchmark-design lesson)
  2. Runner — the code that sends benchmark questions through the system and captures outputs (you built a basic version in run-log-and-baseline)
  3. Telemetry — the traces and cost data that show what happened inside each request (you built this in the telemetry lesson and cost lesson)
  4. Graders — the evaluation logic that assigns grades and failure labels to each output (you've been doing this manually; we'll automate it in the next two lessons)

The harness is what turns "try it and see" into "run the experiment and compare." Without it, every change to the system requires manual testing and subjective judgment. With it, you can make a change, run the harness, and see exactly what improved, what regressed, and what stayed the same.

Experiment run — one complete pass of the benchmark through the system. Each run produces a JSONL log, a set of traces, and cost data. Runs are identified by a unique run ID and tagged with the repo SHA, model version, and harness version so you can reproduce them.

Run comparison — a side-by-side analysis of two experiment runs. The simplest comparison is overall accuracy, but the more useful comparison is at the category and failure-label level. If Run B has higher accuracy than Run A, the comparison tells you where it improved (which categories?) and how (which failure modes disappeared?).

Walkthrough

What you've already built

Let's take inventory. Here's what exists in your project from previous lessons:

anchor-repo/
├── benchmark-questions.jsonl          # Module 2: 30+ questions with gold answers
├── harness/
│   ├── schema.py                      # Module 2: run-log schema definition
│   ├── run_baseline.py                # Module 2: basic benchmark runner
│   ├── grade_baseline.py              # Module 2: interactive grading tool
│   ├── summarize_run.py               # Module 2: run summary metrics
│   └── runs/                          # Module 2: saved run logs
│       ├── baseline-2026-03-24-*.jsonl
│       └── traced-2026-03-25-*.jsonl  # Module 6 L1: traced run
├── observability/
│   ├── traced_pipeline.py             # Module 6 L1: instrumented RAG pipeline
│   ├── traced_benchmark.py            # Module 6 L1: traced benchmark runner
│   ├── cost_tracker.py                # Module 6 L2: cost estimation
│   ├── cache_metrics.py               # Module 6 L2: cache analysis
│   ├── model_router.py                # Module 6 L2: model routing
│   ├── token_budget.py                # Module 6 L2: budget enforcement
│   ├── rate_limit_telemetry.py        # Module 6 L2: rate-limit tracking
│   └── success_cost.py                # Module 6 L2: cost-per-success metric
└── rag/
    ├── retrieval_service.py           # Module 5: routed retrieval
    ├── rag_with_routing.py            # Module 5: full RAG pipeline
    ├── grounded_answer.py             # Module 5: grounded answer generation
    └── evidence_bundle.py             # Module 5: evidence bundle format

The harness we're building now will unify the runner, telemetry, and cost tracking into a single entry point. The graders (which we'll build in the next two lessons) will plug into this same harness.

The unified harness runner

This script replaces harness/run_baseline.py and observability/traced_benchmark.py with a single runner that does everything:

# harness/run_harness.py
"""Unified AI harness runner.

One command to:
1. Load benchmark questions
2. Run each through the traced, costed RAG pipeline
3. Save a structured JSONL run log
4. Print a summary

Usage:
    python harness/run_harness.py [--run-id RUN_ID] [--limit N]
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime, timezone

sys.path.insert(0, ".")

from langfuse import Langfuse
from observability.traced_pipeline import traced_rag_pipeline
from observability.cost_tracker import estimate_cost
from retrieval.hybrid_retrieve import hybrid_retrieve

# --- Argument parsing ---
parser = argparse.ArgumentParser(description="Run the AI harness benchmark")
parser.add_argument("--run-id", default=None, help="Custom run ID")
parser.add_argument("--limit", type=int, default=None, help="Limit to N questions")
parser.add_argument(
    "--benchmark", default="benchmark-questions.jsonl",
    help="Path to benchmark questions file",
)
args = parser.parse_args()

# --- Configuration ---
RUN_ID = args.run_id or f"harness-{datetime.now(timezone.utc).strftime('%Y-%m-%d-%H%M%S')}"
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = args.benchmark
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()

langfuse = Langfuse()

# --- Load benchmark questions ---
questions = []
with open(BENCHMARK_FILE) as f:
    for line in f:
        if line.strip():
            questions.append(json.loads(line))

if args.limit:
    questions = questions[:args.limit]

print(f"AI Harness Run")
print(f"  Run ID:    {RUN_ID}")
print(f"  Provider:  {PROVIDER}")
print(f"  Model:     {MODEL}")
print(f"  Repo SHA:  {REPO_SHA}")
print(f"  Questions: {len(questions)}")
print()

# --- Run each question ---
os.makedirs("harness/runs", exist_ok=True)
results = []
total_cost = 0.0
run_start = time.perf_counter()

for i, q in enumerate(questions):
    q_start = time.perf_counter()
    print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:55]}...")

    try:
        answer = traced_rag_pipeline(
            question=q["question"],
            hybrid_retrieve_fn=hybrid_retrieve,
            model=MODEL,
            run_id=RUN_ID,
        )

        q_duration = time.perf_counter() - q_start

        # Estimate cost using rough token counts. Langfuse will have
        # the exact usage once the traces are flushed.
        cost = estimate_cost(
            model=MODEL,
            input_tokens=answer.context_tokens or 3500,
            output_tokens=400,
        )
        total_cost += cost["total_cost"]

        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": answer.answer,
            "abstained": answer.abstained,
            "abstention_reason": answer.abstention_reason if answer.abstained else None,
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": list(set(
                c.get("file_path", "") for c in answer.citations
            )) if answer.citations else [],
            "tools_called": [],  # Populated if using the agent loop
            "retrieval_method": getattr(answer, "route", "routed"),
            "citation_count": len(answer.citations) if answer.citations else 0,
            "context_tokens": answer.context_tokens or 0,
            "estimated_cost": cost["total_cost"],
            "duration_seconds": round(q_duration, 2),
            "grade": None,
            "failure_label": None,
            "grading_notes": "",
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.3",
        }
        results.append(entry)

    except Exception as e:
        print(f"  ERROR: {e}")
        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": f"ERROR: {e}",
            "abstained": True,
            "abstention_reason": f"Pipeline error: {e}",
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": [],
            "tools_called": [],
            "retrieval_method": "error",
            "citation_count": 0,
            "context_tokens": 0,
            "estimated_cost": 0,
            "duration_seconds": round(time.perf_counter() - q_start, 2),
            "grade": "wrong",
            "failure_label": "pipeline_error",
            "grading_notes": str(e),
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.3",
        }
        results.append(entry)

total_duration = time.perf_counter() - run_start

# --- Save results ---
with open(OUTPUT_FILE, "w") as f:
    for entry in results:
        f.write(json.dumps(entry) + "\n")

langfuse.flush()

# --- Print summary ---
print(f"\n{'='*50}")
print(f"Run complete: {RUN_ID}")
print(f"  Duration:  {total_duration:.1f}s")
print(f"  Questions: {len(results)}")
print(f"  Abstained: {sum(1 for r in results if r['abstained'])}")
print(f"  Est. cost: ${total_cost:.4f}")
print(f"  Saved to:  {OUTPUT_FILE}")
print(f"  Traces:    Langfuse (filter by run_id: {RUN_ID})")
print(f"\nNext steps:")
print(f"  Grade:     python harness/grade_baseline.py {OUTPUT_FILE}")
print(f"  Summarize: python harness/summarize_run.py {OUTPUT_FILE}")
python harness/run_harness.py --limit 5

Expected output looks like this, with the provider and model lines matching the tab you used:

AI Harness Run
  Run ID:    harness-2026-03-25-142233
  Provider:  openai
  Model:     gpt-4o-mini
  Repo SHA:  a1b2c3d
  Questions: 5

[1/5] symbol_lookup: What does validate_path return?...
[2/5] architecture: What is the architecture of the retrieval system?...
[3/5] debugging: Why does the auth middleware reject valid tokens?...
[4/5] onboarding: How do I set up the development environment?...
[5/5] change_impact: What changed in the caching module last week?...

==================================================
Run complete: harness-2026-03-25-142233
  Duration:  12.3s
  Questions: 5
  Abstained: 1
  Est. cost: $0.0035
  Saved to:  harness/runs/harness-2026-03-25-142233.jsonl
  Traces:    Langfuse (filter by run_id: harness-2026-03-25-142233)

Next steps:
  Grade:     python harness/grade_baseline.py harness/runs/harness-2026-03-25-142233.jsonl
  Summarize: python harness/summarize_run.py harness/runs/harness-2026-03-25-142233.jsonl

That's the whole flow: one command, and you get a traced, costed, gradable run log. The --limit flag is useful during development so you can run 5 questions to verify things work, then do a full pass when you're ready.

Comparing two runs

The summarize script from Module 2 works on any run log. To compare two runs, you'll run it on both and look at the differences:

# harness/compare_runs.py
"""Compare two graded run logs side by side.

Shows where accuracy improved, regressed, or stayed the same.
"""
import json
import sys
from collections import Counter


def load_run(path: str) -> list[dict]:
    entries = []
    with open(path) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))
    return entries


def summarize(entries: list[dict]) -> dict:
    graded = [e for e in entries if e.get("grade") is not None]
    if not graded:
        return {"total": 0, "grades": {}, "failures": {}, "by_category": {}}

    grades = Counter(e["grade"] for e in graded)
    failures = Counter(
        e["failure_label"] for e in graded
        if e.get("failure_label") and e["failure_label"] != "none"
    )

    by_cat = {}
    for cat in sorted(set(e["category"] for e in graded)):
        cat_entries = [e for e in graded if e["category"] == cat]
        correct = sum(1 for e in cat_entries if e["grade"] == "fully_correct")
        by_cat[cat] = {"correct": correct, "total": len(cat_entries)}

    return {
        "total": len(graded),
        "grades": dict(grades),
        "failures": dict(failures),
        "by_category": by_cat,
    }


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python harness/compare_runs.py <run-a.jsonl> <run-b.jsonl>")
        sys.exit(1)

    run_a = load_run(sys.argv[1])
    run_b = load_run(sys.argv[2])

    summary_a = summarize(run_a)
    summary_b = summarize(run_b)

    name_a = run_a[0]["run_id"] if run_a else "Run A"
    name_b = run_b[0]["run_id"] if run_b else "Run B"

    print(f"Comparing: {name_a} vs {name_b}\n")

    # Overall accuracy
    for grade in ["fully_correct", "partially_correct", "unsupported", "wrong"]:
        a_count = summary_a["grades"].get(grade, 0)
        b_count = summary_b["grades"].get(grade, 0)
        a_pct = a_count / summary_a["total"] * 100 if summary_a["total"] else 0
        b_pct = b_count / summary_b["total"] * 100 if summary_b["total"] else 0
        delta = b_pct - a_pct
        arrow = "+" if delta > 0 else ""
        print(f"  {grade:20s}: {a_pct:5.1f}% -> {b_pct:5.1f}% ({arrow}{delta:.1f}%)")

    # Per-category comparison
    all_cats = sorted(
        set(list(summary_a["by_category"].keys()) + list(summary_b["by_category"].keys()))
    )
    print(f"\nPer-category accuracy:")
    for cat in all_cats:
        a = summary_a["by_category"].get(cat, {"correct": 0, "total": 0})
        b = summary_b["by_category"].get(cat, {"correct": 0, "total": 0})
        print(f"  {cat:20s}: {a['correct']}/{a['total']} -> {b['correct']}/{b['total']}")

    # Failure label changes
    all_labels = sorted(
        set(list(summary_a["failures"].keys()) + list(summary_b["failures"].keys()))
    )
    if all_labels:
        print(f"\nFailure label changes:")
        for label in all_labels:
            a_count = summary_a["failures"].get(label, 0)
            b_count = summary_b["failures"].get(label, 0)
            delta = b_count - a_count
            arrow = "+" if delta > 0 else ""
            print(f"  {label:20s}: {a_count} -> {b_count} ({arrow}{delta})")
python harness/compare_runs.py \
  harness/runs/baseline-2026-03-24-graded.jsonl \
  harness/runs/harness-2026-03-25-graded.jsonl

The harness as an experiment framework

Here's the mental model for how the harness works going forward:

Harness experiment flow: benchmark questions through run_harness, producing JSONL and Langfuse traces, then graders, graded run log, and summarize/compare

Every experiment follows the same flow: run the harness, grade the results, summarize, and compare. When you change the pipeline (better retrieval, different model, new prompt), you run the harness again and compare the new run to the previous one. The graders (which we'll build in the next two lessons) plug into this same flow.

This is what the harness concept means in practice: a repeatable experiment framework that separates the system under test from the evaluation logic. You can change the pipeline without changing the harness, and you can change the graders without re-running the pipeline. That separation makes iteration fast and comparisons trustworthy.

Harness versioning

Notice the harness_version field in the run log. We've been incrementing it:

  • v0.1 — Module 2: manual prompting, no retrieval, hand-graded
  • v0.2 — Module 6 L1: traced pipeline with retrieval routing
  • v0.3 — Module 6 L3 (this lesson): unified runner with cost tracking

When you compare two runs, the harness version tells you whether the runs were produced by the same infrastructure. If they were, differences in results are due to pipeline changes. If they weren't, you'll need to account for harness improvements too.

Exercises

  1. Run python harness/run_harness.py against your full benchmark set. Verify that the output file contains all expected fields from the schema, including the new estimated_cost and duration_seconds fields.
  2. Grade the run using harness/grade_baseline.py and then compare it to your Module 2 baseline using harness/compare_runs.py. Which categories improved the most from adding retrieval?
  3. Run the harness twice with different models (e.g., --model gpt-4o-mini and --model gpt-4o). Compare costs and accuracy. Is the more expensive model worth it?
  4. Run the harness with --limit 5 and inspect the traces in Langfuse. Verify that each trace has the correct run ID and that you can filter by run ID to see all five traces together.
  5. Add a --dry-run flag to run_harness.py that prints what would happen without making any API calls. This is useful for verifying configuration before spending money.

Completion checkpoint

You have:

  • A unified harness runner (harness/run_harness.py) that produces traced, costed, gradable run logs with one command
  • A run comparison script (harness/compare_runs.py) that shows accuracy changes at the category and failure-label level
  • At least one complete harness run against your full benchmark
  • A comparison between your Module 2 baseline and your current system showing where retrieval improved results
  • An understanding of the four-component harness structure (benchmark, runner, telemetry, graders) and how the next two lessons will complete it

Reflection prompts

  • How does having a one-command harness change the way you think about making changes to the pipeline? Would you be more or less likely to experiment?
  • Looking at your run comparison, are the improvements from retrieval distributed evenly across categories, or concentrated in specific ones? What does that tell you about where to focus next?
  • The harness currently uses rough token estimates for cost. What would you need to capture exact costs, and is the precision worth the implementation effort?

Connecting to the project

The harness is now a single-command experiment framework. You can run the benchmark, trace every question, estimate costs, grade results, and compare runs. What's missing is automated grading. Right now, you still grade by hand, which is slow and doesn't scale.

The next two lessons add automated graders for the four eval families: retrieval evals, answer evals, tool-use evals, and trace evals. These graders will plug directly into the harness, so by the end of the module you'll be able to run a fully automated experiment pass: harness run, auto-grade, summarize, and compare without any manual intervention.

What's next

Retrieval and Answer Evals. The harness makes runs repeatable; the next lesson turns grading into something repeatable too, starting with evidence quality and answer quality.

References

Start here

Build with this

  • JSONL specification — the format underlying all run logs; simple, appendable, and diffable
  • Langfuse: Sessions — how to group traces by session (run ID) for viewing all traces from a single harness run

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.