Module 5: RAG and Grounded Answers Retrieval Routing

Retrieval Modes and Routing

Up to this point, every question in our pipeline has followed the same retrieval path: hybrid search over code chunks. That was the right starting point. A single retrieval strategy is easier to debug and benchmark. But your anchor project indexes both code and documentation, and those need different retrieval approaches. A question about "how does the caching layer work?" benefits from README and design doc retrieval. A question about "what does validate_path return?" needs precise symbol lookup. A question about "what changed in the auth module last week?" might need neither traditional retrieval — it might need git log.

And some questions don't need retrieval at all. "What's the difference between a list and a tuple in Python?" is general knowledge. Retrieving from your codebase for that question wastes tokens, adds latency, and can actually hurt answer quality by injecting irrelevant code context.

This lesson builds a retrieve(query, mode) service that routes questions to the right retrieval strategy, or decides to skip retrieval entirely. This is the last piece of the RAG pipeline before we move to making the whole system visible and accountable in Module 6.

What you'll learn

  • Build a retrieve(query, mode) service with four modes: code, docs, hybrid, and auto
  • Implement auto-routing logic that classifies questions and selects the appropriate retrieval mode
  • Decide when not to retrieve, and handle those cases gracefully
  • Understand retrieval routing as both a quality optimization and a cost optimization
  • Test routing accuracy on your benchmark questions

Concepts

Retrieval mode: a named retrieval strategy optimized for a specific type of evidence. In our system, code mode uses AST-aware vector search, lexical matching, and graph traversal over code files. docs mode is designed for documentation files (README, design docs, comments). hybrid mode combines both. auto mode selects the best mode based on the question. In a production system, each mode would have its own index and chunking strategy. In our implementation, docs mode currently reuses the code retriever — building a separate documentation index is a natural extension once you have documentation worth indexing separately.

Retrieval router: the component that decides which retrieval mode to use for a given question. The router sits between the question and the retrieval pipeline. A simple router uses keyword heuristics; a more sophisticated one uses an LLM to classify the question. The router's accuracy directly affects answer quality — routing a code question to docs retrieval (or vice versa) degrades results even when the underlying retrieval is good.

Retrieval skipping: the deliberate decision not to retrieve for a given question. This is itself a routing decision: the router classifies the question as answerable from general knowledge and skips the retrieval pipeline entirely. Retrieval skipping is a cost optimization (no embedding call, no vector search, no context tokens) and sometimes a quality optimization (irrelevant evidence can confuse the model). We'll build explicit skip logic with a fallback path.

Retrieval policy: the set of rules that govern when, how, and what to retrieve. A retrieval policy includes the routing rules, mode configurations, skip conditions, and fallback behavior. Making the policy explicit and configurable (rather than hardcoding it in the pipeline) lets you tune retrieval behavior without changing code.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Wrong retrieval modeCode questions return documentation; doc questions return codeSingle retrieval pathMode-based routing with auto-classification
Always-on retrievalSystem retrieves for every question, including general knowledgeNo skip logicRetrieval policy with explicit skip conditions
Mode mismatch noiseModel sees irrelevant evidence from the wrong corpusBroader hybrid searchTargeted retrieval with mode-specific indexes
Routing errorsAuto-router picks the wrong mode and answer quality dropsManual mode overrideRouting accuracy benchmark with manual labels
Unnecessary costEvery question incurs embedding + search costsReduce top-kSkip retrieval for questions that don't need it

Walkthrough

The retrieve(query, mode) service

This is the unified retrieval interface. All downstream code (the RAG pipeline, the evidence bundle builder, the answer generator) calls retrieve() and gets back a consistent evidence bundle regardless of which mode was used internally.

# rag/retrieval_service.py
"""Unified retrieval service with mode routing."""
import re
import sys
from dataclasses import dataclass
from enum import Enum
from typing import Callable

sys.path.insert(0, ".")
from retrieval.context_compiler import compile_context, ContextPack
from rag.pack_to_bundle import context_pack_to_bundle
from rag.evidence_bundle import EvidenceBundle


class RetrievalMode(Enum):
    CODE = "code"
    DOCS = "docs"
    HYBRID = "hybrid"
    AUTO = "auto"
    SKIP = "skip"


@dataclass
class RetrievalPolicy:
    """Configuration for the retrieval router."""
    default_mode: RetrievalMode = RetrievalMode.AUTO
    token_budget: int = 4000
    enable_skip: bool = True
    # Questions matching these patterns will skip retrieval
    skip_patterns: list[str] = None
    # Override: force a specific mode regardless of classification
    force_mode: RetrievalMode | None = None

    def __post_init__(self):
        if self.skip_patterns is None:
            self.skip_patterns = [
                r"^what is (a |an |the )?(python|javascript|variable|function|class|api|http|rest)\b",
                r"^(explain|describe) (what |how )?(a |an |the )?(list|dict|tuple|set|string|integer|float|boolean)",
                r"^how (do|does) (python|javascript|git|docker|linux)",
                r"^what('s| is) the difference between",
            ]


# ---------------------------------------------------------------------------
# Question classifier
# ---------------------------------------------------------------------------

@dataclass
class QuestionClassification:
    """Result of classifying a question for routing."""
    mode: RetrievalMode
    confidence: float
    reasoning: str


def classify_question(question: str, policy: RetrievalPolicy) -> QuestionClassification:
    """Classify a question and determine the best retrieval mode.

    This uses heuristics. In a production system, you might use an LLM
    for classification, but heuristics are faster, cheaper, and easier
    to debug. Start here and upgrade if routing accuracy is a bottleneck.
    """
    q_lower = question.lower().strip()

    # Check skip patterns first
    if policy.enable_skip:
        for pattern in policy.skip_patterns:
            if re.match(pattern, q_lower):
                return QuestionClassification(
                    mode=RetrievalMode.SKIP,
                    confidence=0.8,
                    reasoning=f"Matches skip pattern: general knowledge question",
                )

    # Code signals: identifiers, file paths, code-specific terms
    code_signals = 0
    # CamelCase or snake_case identifiers
    if re.search(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)+\b|\b[a-z_]+(?:_[a-z]+){2,}\b', question):
        code_signals += 2
    # File paths
    if re.search(r'[\w/]+\.(py|js|ts|go|rs|java|rb)\b', question):
        code_signals += 2
    # Code-specific verbs
    if any(w in q_lower for w in ["return", "import", "call", "implement", "function", "class", "method"]):
        code_signals += 1
    # Line numbers or stack traces
    if re.search(r'line \d+|traceback|error at', q_lower):
        code_signals += 1

    # Doc signals: design, architecture, explanation requests
    doc_signals = 0
    if any(w in q_lower for w in ["readme", "documentation", "design", "architecture", "overview"]):
        doc_signals += 2
    if any(w in q_lower for w in ["why", "decision", "tradeoff", "approach", "philosophy"]):
        doc_signals += 1
    if any(w in q_lower for w in ["how to use", "getting started", "setup", "install"]):
        doc_signals += 1

    # Route based on signal strength
    if code_signals >= 3 and doc_signals == 0:
        return QuestionClassification(
            mode=RetrievalMode.CODE,
            confidence=0.8,
            reasoning=f"Strong code signals ({code_signals}), no doc signals",
        )
    elif doc_signals >= 3 and code_signals == 0:
        return QuestionClassification(
            mode=RetrievalMode.DOCS,
            confidence=0.8,
            reasoning=f"Strong doc signals ({doc_signals}), no code signals",
        )
    elif code_signals > 0 and doc_signals > 0:
        return QuestionClassification(
            mode=RetrievalMode.HYBRID,
            confidence=0.6,
            reasoning=f"Mixed signals (code: {code_signals}, docs: {doc_signals})",
        )
    else:
        # Default to hybrid when we're not sure
        return QuestionClassification(
            mode=RetrievalMode.HYBRID,
            confidence=0.4,
            reasoning="No strong signals; defaulting to hybrid",
        )


# ---------------------------------------------------------------------------
# Mode-specific retrieval
# ---------------------------------------------------------------------------

def retrieve_code(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None,
    token_budget: int,
) -> ContextPack:
    """Retrieve from code indexes only."""
    # In a full implementation, this would use code-specific indexes
    # and scoring. For now, we use the context compiler which already
    # does code-focused retrieval.
    return compile_context(
        question,
        hybrid_retrieve_fn,
        graph_traverse_fn=graph_traverse_fn,
        token_budget=token_budget,
    )


def retrieve_docs(
    question: str,
    hybrid_retrieve_fn: Callable,
    token_budget: int,
) -> ContextPack:
    """Retrieve from documentation indexes only.

    In a production system, this would use a separate doc index with
    different chunking (e.g., section-level instead of AST-level).
    For now, we use the same hybrid retrieval but you'd swap in a
    doc-specific retriever when your doc index is ready.
    """
    return compile_context(
        question,
        hybrid_retrieve_fn,
        graph_traverse_fn=None,  # No graph traversal for docs
        token_budget=token_budget,
    )


def retrieve_hybrid(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None,
    token_budget: int,
) -> ContextPack:
    """Retrieve from both code and doc indexes."""
    # Split the budget between code and docs
    code_budget = int(token_budget * 0.6)
    doc_budget = token_budget - code_budget

    code_pack = retrieve_code(question, hybrid_retrieve_fn, graph_traverse_fn, code_budget)
    doc_pack = retrieve_docs(question, hybrid_retrieve_fn, doc_budget)

    # Merge the packs: combine chunks, re-sort, and re-budget
    all_chunks = code_pack.chunks + doc_pack.chunks

    # Deduplicate by content hash
    seen = set()
    unique = []
    for chunk in all_chunks:
        if chunk.content_hash not in seen:
            seen.add(chunk.content_hash)
            unique.append(chunk)

    # Sort by score and trim to budget
    unique.sort(key=lambda c: c.retrieval_score, reverse=True)

    from retrieval.context_compiler import ContextPack as CP, apply_token_budget
    merged = CP(
        question=question,
        chunks=unique,
        total_tokens=sum(c.token_count for c in unique),
        token_budget=token_budget,
    )
    return apply_token_budget(merged, budget=token_budget)


# ---------------------------------------------------------------------------
# Unified retrieval service
# ---------------------------------------------------------------------------

@dataclass
class RetrievalResult:
    """Result from the retrieval service."""
    bundle: EvidenceBundle | None
    mode_used: RetrievalMode
    classification: QuestionClassification
    skipped: bool = False
    skip_reason: str = ""


def retrieve(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None = None,
    policy: RetrievalPolicy | None = None,
    mode: RetrievalMode | None = None,
) -> RetrievalResult:
    """Unified retrieval service.

    Call this with a question and optionally a mode. If mode is None or
    AUTO, the router will classify the question and pick the best mode.
    """
    if policy is None:
        policy = RetrievalPolicy()

    # Determine the mode to use
    if policy.force_mode:
        classification = QuestionClassification(
            mode=policy.force_mode,
            confidence=1.0,
            reasoning="Forced by policy",
        )
    elif mode and mode != RetrievalMode.AUTO:
        classification = QuestionClassification(
            mode=mode,
            confidence=1.0,
            reasoning=f"Explicitly requested: {mode.value}",
        )
    else:
        classification = classify_question(question, policy)

    # Handle skip
    if classification.mode == RetrievalMode.SKIP:
        return RetrievalResult(
            bundle=None,
            mode_used=RetrievalMode.SKIP,
            classification=classification,
            skipped=True,
            skip_reason=classification.reasoning,
        )

    # Route to the appropriate retrieval function
    if classification.mode == RetrievalMode.CODE:
        pack = retrieve_code(
            question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
        )
    elif classification.mode == RetrievalMode.DOCS:
        pack = retrieve_docs(
            question, hybrid_retrieve_fn, policy.token_budget,
        )
    else:  # HYBRID or fallback
        pack = retrieve_hybrid(
            question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
        )

    bundle = context_pack_to_bundle(pack)

    return RetrievalResult(
        bundle=bundle,
        mode_used=classification.mode,
        classification=classification,
    )


if __name__ == "__main__":
    from retrieval.hybrid_retrieve import hybrid_retrieve

    test_questions = [
        "What does validate_path return?",
        "What is the architecture of the retrieval system?",
        "What is a Python decorator?",
        "What functions call read_file and how does the caching work?",
    ]

    print("Retrieval routing demo\n")
    for q in test_questions:
        result = retrieve(q, hybrid_retrieve)
        print(f"Q: {q}")
        print(f"  Mode: {result.mode_used.value}")
        print(f"  Confidence: {result.classification.confidence:.1f}")
        print(f"  Reasoning: {result.classification.reasoning}")
        if result.skipped:
            print(f"  SKIPPED: {result.skip_reason}")
        else:
            print(f"  Snippets: {len(result.bundle.snippets)}, "
                  f"Tokens: {result.bundle.total_tokens}")
        print()
python rag/retrieval_service.py

Expected output:

Retrieval routing demo

Q: What does validate_path return?
  Mode: code
  Confidence: 0.8
  Reasoning: Strong code signals (3), no doc signals
  Snippets: 4, Tokens: 1823

Q: What is the architecture of the retrieval system?
  Mode: docs
  Confidence: 0.8
  Reasoning: Strong doc signals (3), no code signals
  Snippets: 3, Tokens: 1450

Q: What is a Python decorator?
  Mode: skip
  Confidence: 0.8
  Reasoning: Matches skip pattern: general knowledge question
  SKIPPED: Matches skip pattern: general knowledge question

Q: What functions call read_file and how does the caching work?
  Mode: hybrid
  Confidence: 0.6
  Reasoning: Mixed signals (code: 3, docs: 1)
  Snippets: 5, Tokens: 2100

When NOT to retrieve

Retrieval skipping is the simplest cost optimization in the entire RAG pipeline, and it's often overlooked. Every retrieval call has a cost:

  • Latency: embedding the query, searching the index, compiling the context pack
  • Token cost: the retrieved evidence consumes input tokens in the generation call
  • Quality risk: irrelevant evidence can confuse the model into producing worse answers than it would with no evidence at all

The skip logic we built uses pattern matching, which catches the obvious cases (general knowledge questions). For more nuanced cases, you'll want to track skip accuracy in your benchmark: did skipping help or hurt for each question?

A useful heuristic: if you're not sure whether to retrieve, retrieve but track whether the evidence was actually cited in the answer. If the model consistently ignores the evidence for a class of questions, that's a signal to add those questions to the skip list. We'll build this tracking in Module 6.

Auto-routing logic

The auto-router uses a signal-counting approach: count code signals, count doc signals, and route based on which is stronger. This is deliberately simple. Here's why:

  1. Debuggable: when routing goes wrong, you can inspect the signal counts and understand why. An LLM-based classifier is more accurate but harder to debug.
  2. Fast: no API call needed for classification. The router adds microseconds, not milliseconds.
  3. Cheap: no token cost for the classification step.

The tradeoff is accuracy. The heuristic router will misclassify some questions, particularly ambiguous ones like "how does the auth module handle token refresh?" (is that a code question or a design question?). If routing accuracy becomes a bottleneck, you can upgrade to an LLM classifier without changing the interface, as the retrieve() function's signature stays the same.

Measuring routing accuracy

To know whether your router is working, you'll need ground truth labels. Add a routing_mode field to your benchmark questions:

{"id": "q001", "question": "What does validate_path do?", "category": "code_lookup", "routing_mode": "code"}
{"id": "q002", "question": "What is the project architecture?", "category": "explanation", "routing_mode": "docs"}
{"id": "q003", "question": "What is a Python list?", "category": "general", "routing_mode": "skip"}
{"id": "q004", "question": "What calls read_file and why was it designed that way?", "category": "relationship", "routing_mode": "hybrid"}

Then run the classifier on each question and compare:

# rag/test_routing.py
"""Test routing accuracy against labeled benchmark questions."""
import json
import sys

sys.path.insert(0, ".")
from rag.retrieval_service import (
    classify_question, RetrievalPolicy, RetrievalMode,
)
from pathlib import Path

BENCHMARK_FILE = Path("benchmark-questions.jsonl")


def test_routing():
    policy = RetrievalPolicy()

    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                if "routing_mode" in q:
                    questions.append(q)

    if not questions:
        print("No questions with routing_mode labels found.")
        print("Add 'routing_mode' to your benchmark questions to test routing.")
        return

    correct = 0
    total = len(questions)

    for q in questions:
        classification = classify_question(q["question"], policy)
        expected = q["routing_mode"]
        predicted = classification.mode.value
        match = expected == predicted

        if match:
            correct += 1
        else:
            print(f"MISMATCH: {q['question'][:50]}...")
            print(f"  Expected: {expected}, Got: {predicted}")
            print(f"  Reasoning: {classification.reasoning}")

    accuracy = correct / total * 100 if total > 0 else 0
    print(f"\nRouting accuracy: {correct}/{total} ({accuracy:.0f}%)")
    if accuracy < 80:
        print("Consider adding more signal patterns or switching to LLM classification.")


if __name__ == "__main__":
    test_routing()
python rag/test_routing.py

Integrating the router into the RAG pipeline

Now we'll update our RAG pipeline from Lesson 1 to use the retrieval service instead of calling the context compiler directly:

# rag/rag_with_routing.py
"""RAG pipeline with retrieval routing: the complete Module 5 pipeline."""
import sys
sys.path.insert(0, ".")

from rag.retrieval_service import retrieve, RetrievalPolicy, RetrievalMode
from rag.grounded_answer import (
    GroundedAnswer, check_grounding, generate_answer,
)
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4o-mini"


def rag_pipeline_with_routing(
    question: str,
    hybrid_retrieve_fn,
    graph_traverse_fn=None,
    policy: RetrievalPolicy | None = None,
    mode: RetrievalMode | None = None,
    model: str = MODEL,
) -> GroundedAnswer:
    result = retrieve(
        question, hybrid_retrieve_fn,
        graph_traverse_fn=graph_traverse_fn, policy=policy, mode=mode,
    )

    if result.skipped:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": (
                    "You are a helpful assistant. Answer the question from "
                    "your general knowledge. Be concise and accurate."
                )},
                {"role": "user", "content": question},
            ],
            temperature=0,
        )
        return GroundedAnswer(
            question=question,
            answer=response.choices[0].message.content,
            abstained=False, model=model,
        )

    from retrieval.context_compiler import ContextPack, EvidenceChunk
    pack = ContextPack(
        question=question,
        chunks=[
            EvidenceChunk(
                chunk_id=s.chunk_id, file_path=s.file_path,
                symbol_name=s.symbol_name or "", text=s.text,
                start_line=s.start_line, end_line=s.end_line,
                retrieval_method=s.retrieval_method,
                retrieval_score=s.relevance_score,
            )
            for s in result.bundle.snippets
        ],
        total_tokens=result.bundle.total_tokens,
        token_budget=result.bundle.token_budget,
    )

    sufficient, reason = check_grounding(pack)
    if not sufficient:
        return GroundedAnswer(
            question=question,
            answer=f"I don't have enough evidence to answer this question reliably. "
                   f"{reason} Rather than guessing, I'd recommend checking the "
                   f"codebase directly or refining the question.",
            abstained=True, abstention_reason=reason,
            model=model, context_tokens=pack.total_tokens,
        )

    return generate_answer(question, pack, model=model)


if __name__ == "__main__":
    from retrieval.hybrid_retrieve import hybrid_retrieve

    test_questions = [
        "What does validate_path return?",
        "What is a Python decorator?",
        "What is the architecture of the retrieval system?",
    ]
    for q in test_questions:
        print(f"Q: {q}")
        answer = rag_pipeline_with_routing(q, hybrid_retrieve)
        if answer.abstained:
            print(f"  ABSTAINED: {answer.abstention_reason}")
        print(f"  Answer: {answer.answer[:150]}...")
        print(f"  Citations: {len(answer.citations)}")
        print()
python rag/rag_with_routing.py

Exercises

  1. Build the retrieval service (rag/retrieval_service.py). Test it with the four example questions and verify that each one routes to the expected mode.
  2. Add routing_mode labels to at least 15 of your benchmark questions. Run rag/test_routing.py and measure routing accuracy. If accuracy is below 80%, adjust the signal patterns.
  3. Ask the same question in all four modes (code, docs, hybrid, auto) and compare the answers. For which questions does the mode choice matter most? For which does it barely matter?
  4. Test retrieval skipping on five general-knowledge questions and five codebase-specific questions. Verify that skipping works correctly for general questions and doesn't trigger for codebase questions.
  5. Integrate the router into the full RAG pipeline (rag/rag_with_routing.py). Run your complete benchmark through the routed pipeline and compare answer quality to the non-routed version from Lesson 1. Does routing improve accuracy? For which question categories?

Completion checkpoint

You should now have:

  • A retrieve(query, mode) service with four modes: code, docs, hybrid, and auto
  • An auto-routing classifier that picks the best retrieval mode based on question signals
  • Retrieval skip logic for general knowledge questions that don't need codebase evidence
  • Routing accuracy measured against labeled benchmark questions (target: 80%+)
  • A complete RAG pipeline with routing integrated, tested on your full benchmark

Reflection prompts

  • Which questions did the router misclassify? What would you need to fix those: better heuristics, or an LLM-based classifier?
  • How much did retrieval skipping save in terms of latency and tokens? Was the answer quality for skipped questions better, worse, or the same as when evidence was retrieved?
  • Are there question categories where mode routing made a significant quality difference? Categories where it didn't matter?

Connecting to the project

We've built a complete answer pipeline. A question comes in, the router classifies it and selects a retrieval strategy (or skips retrieval entirely), the context compiler produces a structured evidence bundle, and the answer generator produces a grounded response with citations and abstention. Every piece of retrieved evidence has provenance. Every claim in the answer can be traced to a source.

Now we need to make it visible and accountable. The pipeline works, but we can't see inside it during operation. How long does each stage take? How much does each answer cost? When routing goes wrong, how do we detect it? When the model hallucinates despite grounding, how do we catch it? Module 6 will add the observability and evaluation layers that turn this from a working prototype into a system you can operate and improve with confidence.

What's next

Telemetry. You now have a full pipeline with branching behavior; the next lesson makes that behavior visible so you can see routes, tool calls, latency, and failures instead of guessing.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.