Module 6: Observability and Evals Telemetry

Telemetry: Making Your System Visible

Your RAG pipeline should now be working. Questions come in, the router picks a retrieval strategy, evidence gets compiled, and a grounded answer comes back with citations. But right now, you can only see the inputs and outputs. What happened in between (which route was chosen, how long retrieval took, how many tokens the evidence consumed, whether the model used the evidence or ignored it) is invisible unless you go digging through print statements.

In this lesson we'll add structured telemetry to the pipeline you built in Module 5. By the end, you'll be able to trace any request from start to finish and see exactly what the system did, how long each step took, and what it cost. That visibility will become the foundation for every evaluation and optimization we build in the rest of this module.

What you'll learn

  • Distinguish between three layers of AI telemetry: agent traces, application observability, and product outcome events
  • Instrument five key events in the RAG pipeline: run start, route chosen, tool call, retrieval return, and response completion
  • Set up Langfuse as a trace backend and understand OpenTelemetry as the portable concept underneath
  • Trace a full benchmark run end to end and read the resulting trace
  • Identify where time and tokens are spent within a single request

Concepts

Agent trace: a structured record of everything that happened during one request to your AI system. A trace captures the sequence of steps (route decision, retrieval, generation), their timing, their inputs and outputs, and any metadata you attach. Traces are hierarchical: a top-level trace contains spans, and spans can contain child spans. Think of a trace as a timeline for a single request, with enough detail to reconstruct what happened and why.

Span: one step within a trace. In our pipeline, each span represents a discrete operation: classifying the question, retrieving evidence, calling the model, or checking grounding. Spans have a start time, an end time, inputs, outputs, and metadata. The nesting of spans within a trace shows the causal structure of the request: which steps triggered which other steps.

The three layers of telemetry: AI systems need telemetry at three distinct levels, and confusing them leads to blind spots:

  1. Agent traces: what the AI system did for a single request. Route chosen, tools called, evidence retrieved, model response. This is what we'll instrument in this lesson.
  2. Application observability: how the service is performing across many requests. Latency percentiles, error rates, throughput, resource utilization. This is traditional software observability applied to AI services.
  3. Product outcome events: whether the system is actually helping users. Task completion, answer acceptance, escalation to a human, user feedback. These events connect technical performance to business value.

Many teams start with layer 1 (traces), bolt on layer 2 (application metrics), and only add layer 3 (outcomes) when they realize they can't tell whether "fast and cheap" actually means "useful." We'll build all three across this lesson and the next.

Generation: in Langfuse's model, a generation is a special type of span that represents an LLM call. It captures the model name, prompt, completion, token counts, and cost. Separating generations from other spans makes it easy to aggregate model usage across traces.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Opaque agent behaviorYou can't explain why the system chose a route or toolPrint statementsStructured traces with spans
No cost visibilityQuality improves but you don't know what it costsManual token estimatesPer-generation usage telemetry
Can't reproduce failuresA request failed but you can't recreate the conditionsRe-run and hopeTrace replay with captured inputs
Cross-step debuggingThe answer is wrong but you don't know which step failedRead the full outputSpan-level inspection of route, retrieval, generation
No product signalSystem is fast but you don't know if answers are usefulAsk users informallyProduct outcome events attached to traces

Default: Langfuse

Why this is our default: Langfuse is open-source, has a generous free tier for development, provides a trace UI out of the box, and has a Python SDK that integrates cleanly with our existing code. It gives you trace visualization, cost tracking, and eval hooks without requiring infrastructure setup.

Portable concept underneath: Every trace should be reconstructable as a sequence of: request received, route chosen, tool calls made, retrieval results returned, model response generated, outcome recorded. This structure is the same whether you use Langfuse, LangSmith, Arize Phoenix, or raw OpenTelemetry. The concepts (traces, spans, generations, metadata) are portable. The SDK calls are not.

Closest alternatives and when to switch:

  • LangSmith: use if LangChain or LangGraph is already your orchestration layer and you want tighter first-party integration
  • Arize Phoenix: use if your organization already standardizes on it, or if you want a self-hosted open-source option with strong eval integration
  • Pure OpenTelemetry + Jaeger/Grafana: use when you want maximal control over your telemetry pipeline and can tolerate more infrastructure setup. This is the most portable option but requires the most work. I would personally choose this in production, but it's a bit distracting for this course.

Walkthrough

Setting up Langfuse

First, install the SDK and set up credentials:

pip install langfuse

You'll need a Langfuse account for this. In development, the cloud tier at langfuse.com should be sufficient. After creating a project, you'll get a public key and a secret key.

export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or your self-hosted URL
Self-hosting option

Langfuse can run locally via Docker if you prefer not to send traces to the cloud. See the Langfuse self-hosting docs for setup. The code in this lesson works the same either way.

The five events we'll instrument

We'll add telemetry to five key points in our RAG pipeline. Each one then becomes a span in the trace:

  1. Run start: a request arrives, we create a trace and record the question and metadata
  2. Route chosen: the retrieval router classifies the question and picks a mode
  3. Tool call: any tool invocation (retrieval, graph traversal, etc.)
  4. Retrieval return: the evidence bundle comes back with chunk count, token count, and scores
  5. Response completion: the model generates an answer, and we record tokens, latency, and the response

The code below is the instrumented pipeline, which builds directly on top of rag/rag_with_routing.py from Module 5's Retrieval Modes and Routing lesson:

# observability/traced_pipeline.py
"""RAG pipeline with Langfuse tracing."""
import os
import sys
import time
from datetime import datetime, timezone

sys.path.insert(0, ".")

from langfuse import Langfuse
from rag.retrieval_service import (
    retrieve, RetrievalPolicy, RetrievalMode, RetrievalResult,
)
from rag.grounded_answer import (
    GroundedAnswer, check_grounding, generate_answer,
    GROUNDING_SYSTEM_PROMPT,
)
from retrieval.context_compiler import ContextPack, EvidenceChunk
from openai import OpenAI

langfuse = Langfuse()
client = OpenAI()
MODEL = "gpt-4o-mini"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()


def traced_rag_pipeline(
    question: str,
    hybrid_retrieve_fn,
    graph_traverse_fn=None,
    policy: RetrievalPolicy | None = None,
    mode: RetrievalMode | None = None,
    model: str = MODEL,
    run_id: str | None = None,
    user_id: str | None = None,
) -> GroundedAnswer:
    """RAG pipeline with full Langfuse tracing."""
    # --- 1. Run start: create the trace ---
    trace = langfuse.trace(
        name="rag-pipeline",
        input={"question": question, "mode": mode.value if mode else "auto"},
        metadata={
            "run_id": run_id or "interactive",
            "repo_sha": REPO_SHA,
            "model": model,
        },
        user_id=user_id,
    )
    pipeline_start = time.perf_counter()

    # --- 2. Route chosen ---
    routing_span = trace.span(name="routing", input={"question": question})
    routing_start = time.perf_counter()

    result: RetrievalResult = retrieve(
        question, hybrid_retrieve_fn,
        graph_traverse_fn=graph_traverse_fn, policy=policy, mode=mode,
    )

    routing_ms = (time.perf_counter() - routing_start) * 1000
    routing_span.end(
        output={
            "mode": result.mode_used.value,
            "confidence": result.classification.confidence,
            "reasoning": result.classification.reasoning,
            "skipped": result.skipped,
        },
        metadata={"duration_ms": round(routing_ms, 1)},
    )

    # --- Handle skip (no retrieval needed) ---
    if result.skipped:
        gen_span = trace.generation(
            name="generate-no-retrieval", model=model,
            input=[
                {"role": "system", "content": "Answer from general knowledge."},
                {"role": "user", "content": question},
            ],
        )
        gen_start = time.perf_counter()

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": (
                    "You are a helpful assistant. Answer the question from "
                    "your general knowledge. Be concise and accurate."
                )},
                {"role": "user", "content": question},
            ],
            temperature=0,
        )

        answer_text = response.choices[0].message.content
        usage = response.usage
        gen_ms = (time.perf_counter() - gen_start) * 1000

        gen_span.end(
            output=answer_text,
            usage={
                "input": usage.prompt_tokens,
                "output": usage.completion_tokens,
                "total": usage.total_tokens,
            },
            metadata={"duration_ms": round(gen_ms, 1)},
        )

        answer = GroundedAnswer(
            question=question, answer=answer_text,
            abstained=False, model=model,
        )
        total_ms = (time.perf_counter() - pipeline_start) * 1000
        trace.update(output={
            "answer": answer_text[:200],
            "skipped_retrieval": True,
            "total_ms": round(total_ms, 1),
        })
        langfuse.flush()
        return answer

    # --- 3. Tool call / 4. Retrieval return ---
    retrieval_span = trace.span(
        name="retrieval",
        input={"question": question, "mode": result.mode_used.value},
    )
    snippet_count = len(result.bundle.snippets) if result.bundle else 0
    token_count = result.bundle.total_tokens if result.bundle else 0
    retrieval_span.end(output={
        "snippet_count": snippet_count, "total_tokens": token_count,
        "files": [s.file_path for s in (result.bundle.snippets if result.bundle else [])][:10],
    })

    # --- Grounding check ---
    grounding_span = trace.span(
        name="grounding-check",
        input={"snippet_count": snippet_count, "total_tokens": token_count},
    )
    pack = ContextPack(
        question=question,
        chunks=[
            EvidenceChunk(
                chunk_id=s.chunk_id, file_path=s.file_path,
                symbol_name=s.symbol_name or "", text=s.text,
                start_line=s.start_line, end_line=s.end_line,
                retrieval_method=s.retrieval_method,
                retrieval_score=s.relevance_score,
            )
            for s in result.bundle.snippets
        ],
        total_tokens=result.bundle.total_tokens,
        token_budget=result.bundle.token_budget,
    )
    sufficient, reason = check_grounding(pack)
    grounding_span.end(output={"sufficient": sufficient, "reason": reason})

    if not sufficient:
        answer = GroundedAnswer(
            question=question,
            answer=f"I don't have enough evidence to answer this question reliably. {reason}",
            abstained=True, abstention_reason=reason,
            model=model, context_tokens=pack.total_tokens,
        )
        total_ms = (time.perf_counter() - pipeline_start) * 1000
        trace.update(output={"answer": answer.answer[:200], "abstained": True, "total_ms": round(total_ms, 1)})
        langfuse.flush()
        return answer

    # --- 5. Response completion ---
    gen_span = trace.generation(
        name="generate-grounded-answer", model=model,
        input=[
            {"role": "system", "content": GROUNDING_SYSTEM_PROMPT[:200] + "..."},
            {"role": "user", "content": f"[evidence: {snippet_count} snippets] {question}"},
        ],
    )
    gen_start = time.perf_counter()
    answer = generate_answer(question, pack, model=model)
    gen_ms = (time.perf_counter() - gen_start) * 1000

    gen_span.end(
        output=answer.answer[:500],
        usage={
            "input": answer.context_tokens or 0,
            "output": 0,  # Approximate — Langfuse captures exact counts
                           # when using its OpenAI wrapper; this is a fallback.
            "total": answer.context_tokens or 0,
        },
        metadata={
            "duration_ms": round(gen_ms, 1),
            "citation_count": len(answer.citations) if answer.citations else 0,
        },
    )

    total_ms = (time.perf_counter() - pipeline_start) * 1000
    trace.update(output={
        "answer": answer.answer[:200], "abstained": answer.abstained,
        "citations": len(answer.citations) if answer.citations else 0,
        "total_ms": round(total_ms, 1), "context_tokens": answer.context_tokens,
    })
    langfuse.flush()
    return answer


if __name__ == "__main__":
    from retrieval.hybrid_retrieve import hybrid_retrieve

    test_questions = [
        "What does validate_path return?",
        "What is the architecture of the retrieval system?",
        "What is a Python decorator?",
    ]
    print("Traced RAG pipeline demo\n")
    for q in test_questions:
        print(f"Q: {q}")
        ans = traced_rag_pipeline(q, hybrid_retrieve, run_id="telemetry-demo")
        if ans.abstained:
            print(f"  ABSTAINED: {ans.abstention_reason}")
        else:
            print(f"  Answer: {ans.answer[:120]}...")
        print()
    print("Traces flushed to Langfuse. Open the Langfuse UI to inspect them.")
mkdir -p observability
python observability/traced_pipeline.py

Expected output:

Traced RAG pipeline demo

Q: What does validate_path return?
  Answer: The `validate_path` function returns a `Path` object if the path is valid and within...

Q: What is the architecture of the retrieval system?
  Answer: The retrieval system uses a hybrid architecture combining vector search and lexical...

Q: What is a Python decorator?
  Answer: A Python decorator is a function that takes another function as an argument and extends...

Traces flushed to Langfuse. Open the Langfuse UI to inspect them.

After running this, open the Langfuse UI. You'll see three traces, each with nested spans showing the routing decision, retrieval results, grounding check, and generation. The "What is a Python decorator?" trace will show the skip path: routing decided no retrieval was needed, so there's a single generation span with no retrieval span.

Reading a trace

When you open a trace in the Langfuse UI, you'll see a waterfall view showing:

  • Trace metadata: the question, run ID, repo SHA, and total duration
  • Span timeline: each span as a horizontal bar, showing when it started, how long it ran, and what it produced
  • Generation details: for LLM calls, the model, input/output tokens, and the full prompt and completion
  • Nesting: child spans appear indented under their parent, showing the causal structure

The most useful thing to look for in early traces is where time is spent. In a typical RAG request, retrieval and generation dominate. If routing takes more than a few milliseconds, something is wrong. If retrieval takes longer than generation, your index may need optimization.

Understanding OpenTelemetry as the portable concept

Langfuse uses its own SDK, but the concepts map directly to OpenTelemetry (OTel):

Langfuse conceptOTel equivalentWhat it represents
TraceTraceOne end-to-end request
SpanSpanOne step in the request
GenerationSpan with gen_ai.* attributesAn LLM call specifically
MetadataSpan attributesKey-value data attached to a span
Score— (custom)Eval result attached to a trace

If you later migrate to a pure OTel setup, you'll carry the same mental model. Traces contain spans, spans have timing and metadata, and generations are spans with model-specific attributes. The OpenTelemetry Semantic Conventions for GenAI define a standard attribute set for LLM calls that most observability tools understand.

Tracing a benchmark run

Now we'll connect telemetry to the benchmark harness from Module 2. This lets you trace an entire benchmark pass and inspect any individual question's trace:

# observability/traced_benchmark.py
"""Run the benchmark with tracing enabled."""
import json
import os
import sys
from datetime import datetime, timezone

sys.path.insert(0, ".")

from observability.traced_pipeline import traced_rag_pipeline, langfuse
from retrieval.hybrid_retrieve import hybrid_retrieve

RUN_ID = "traced-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = "benchmark-questions.jsonl"
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()

questions = []
with open(BENCHMARK_FILE) as f:
    for line in f:
        if line.strip():
            questions.append(json.loads(line))

print(f"Running {len(questions)} benchmark questions with tracing")
print(f"Run ID: {RUN_ID}\n")

os.makedirs("harness/runs", exist_ok=True)
results = []

for i, q in enumerate(questions):
    print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
    answer = traced_rag_pipeline(
        question=q["question"], hybrid_retrieve_fn=hybrid_retrieve,
        model=MODEL, run_id=RUN_ID,
    )
    entry = {
        "run_id": RUN_ID, "question_id": q["id"],
        "question": q["question"], "category": q["category"],
        "answer": answer.answer, "model": MODEL, "provider": PROVIDER,
        "evidence_files": list(set(
            c.get("file_path", "") for c in answer.citations
        )) if answer.citations else [],
        "tools_called": [],
        "retrieval_method": "routed",
        "grade": None, "failure_label": None, "grading_notes": "",
        "repo_sha": REPO_SHA,
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "harness_version": "v0.2",
    }
    results.append(entry)

with open(OUTPUT_FILE, "w") as f:
    for entry in results:
        f.write(json.dumps(entry) + "\n")

langfuse.flush()
print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
print(f"Traces available in Langfuse. Filter by run_id: {RUN_ID}")
python observability/traced_benchmark.py

After this runs, you'll have two things: a JSONL run log (same format as Module 2) and a full set of traces in Langfuse. You can filter traces by the run ID and inspect any individual question to see exactly what the pipeline did.

What traces reveal that logs don't

Print-based logging tells you what happened. Traces tell you how it happened:

  • Timing relationships: you can see that retrieval took 340ms but generation took 1200ms, so generation is the bottleneck
  • Causal structure: the routing span shows why hybrid mode was chosen, and the retrieval span shows what evidence came back
  • Token accounting: the generation span shows exactly how many input tokens were evidence vs. system prompt
  • Failure isolation: when an answer is wrong, you can trace backward from the generation to see if the evidence was bad (retrieval problem) or the evidence was good but the model ignored it (generation problem)

We'll build on this visibility in the next lesson, where we'll add cost tracking, latency budgets, and product outcome events to the traces.

Exercises

  1. Set up Langfuse (cloud or self-hosted) and run observability/traced_pipeline.py with the three test questions. Open the Langfuse UI and inspect each trace. Identify which spans took the longest.
  2. Run observability/traced_benchmark.py against your full benchmark set. Filter traces by run ID in Langfuse and find the trace with the highest total duration. What made it slow?
  3. Find a trace where the answer was wrong (you can cross-reference with your graded run logs from Module 2). Walk through the trace spans and identify where the failure originated: routing, retrieval, or generation.
  4. Add a custom span to the pipeline for a step that isn't currently instrumented. For example, add a span around the context_pack_to_bundle call to see how long evidence bundle assembly takes.
  5. Write a short note (3-4 sentences) explaining what one trace taught you that print-statement logs alone wouldn't have shown.

Completion checkpoint

You should now have:

  • Langfuse set up and receiving traces from your pipeline
  • An instrumented RAG pipeline (observability/traced_pipeline.py) that produces traces with spans for routing, retrieval, grounding, and generation
  • At least one full benchmark run traced end to end, with results saved to both JSONL and Langfuse
  • The ability to open any trace in the UI and identify where time and tokens were spent
  • A written observation about what traces revealed that logs alone wouldn't have

Reflection prompts

  • Which step in the pipeline consistently takes the most time? Is that the step you expected?
  • For a question that was answered incorrectly, could you identify the failure point from the trace alone? What additional information would have helped?
  • How would you explain the value of structured traces to someone who says "print statements are fine"?

Connecting to the project

We can now see inside the pipeline as it runs. Every request produces a trace showing the route chosen, the evidence retrieved, the tokens consumed, and the answer generated. This visibility is the prerequisite for everything else in this module. You can't optimize what you can't see, and you can't evaluate what you can't measure.

But visibility alone doesn't answer the operational questions: how much does each answer cost? Where are we wasting tokens? How do we set budgets and catch runaway costs before they become a problem? The next lesson adds the cost, caching, and rate-limiting layers that turn raw telemetry into actionable operational metrics.

What's next

Cost, Caching and Rate Limits. Visibility is not enough if you cannot tell what the system costs or where tokens are being wasted; the next lesson makes the pipeline operationally legible.

References

Start here

  • Langfuse documentation — setup guides, SDK reference, and trace visualization for the observability layer we're using as our default

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.