Module 4: Code Retrieval Choosing a Retrieval Strategy

When to Use Which Retrieval Method

By the end of Module 3, your agent can read files, search text, and answer questions about your codebase. That's pretty useful, but you've probably noticed its limits. Some questions need more than what grep can offer. At this point it may seem like the next step is to reach for a vector database, but that instinct will lead people astray more often than it helps. So before we build any retrieval infrastructure, we'll spend this lesson developing the judgment to help pick the right retrieval method for each problem class.

RAG is a pattern, not a database choice. The "R" in RAG (retrieval) can be a file path lookup, a grep command, a SQL query, a symbol table scan, a vector search, a graph traversal, or any combination. The retrieval method you choose should match the question you're answering, not the hype cycle you're in.

What you'll learn

  • Evaluate nine retrieval methods and identify which question types each one handles well
  • Recognize when simpler retrieval methods outperform vector search
  • Build a structured JSON metadata index for your anchor repo and query it
  • Compare structured retrieval against the grep-based tools from Module 3 on the same benchmark questions
  • Use the retrieval method chooser as an ongoing decision framework

Concepts

Retrieval method: the underlying mechanism you use to find relevant information. A vector database is one retrieval method. A grep command is another. A SQL query against a metadata table is a third. The method you choose determines what kinds of questions you can answer efficiently.

RAG (Retrieval-Augmented Generation): a pattern where you retrieve relevant information, insert it into the model's context, and let the model generate an answer grounded in that evidence. RAG doesn't require any specific database. It requires a retrieval step, a context assembly step, and a generation step. We'll build the full pipeline in Module 5; this module focuses on making the retrieval step excellent.

Lexical search: finding documents by matching exact terms. Grep is the simplest form. BM25 is a more sophisticated version that accounts for term frequency and document length. Lexical search excels when the user knows the exact identifier, error message, or string they're looking for.

Semantic search: finding documents by meaning rather than exact terms. This is what vector databases do: they encode text as numerical vectors and find chunks whose vectors are close to the query's vector. Semantic search helps when the user describes what they want in different words than the code uses.

Hybrid search: combining lexical and semantic retrieval, then merging or reranking the results. This is often better than either approach alone, but it's also more complex to build and tune.

Reranker: a second-pass model that takes the initial retrieval results and re-scores them for relevance. A reranker sees both the query and each candidate together, which lets it make finer-grained relevance judgments than the initial retrieval pass.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Know which file to look atThe answer is in a predictable location (README, config, specific module)Hardcoded path rulesFile tree + path metadata
Need an exact identifierLooking for a function name, class name, error string, or config keygrep/regex in Module 3 toolsGrep / regex search
Keyword-heavy code searchNeed documents containing specific terms but with rankinggrep with manual sortingBM25 / lexical search
Structured metadata queries"Which files were modified most recently?" or "List all Python files importing X"Manual file inspectionJSON / SQL metadata index
Symbol lookup"Where is UserService defined?" or "What methods does Router have?"grep for the symbol nameAST / symbol index
Conceptual questions"How does authentication work?" (user's words differ from code's terms)Keyword searchVector search
Mixed exact + conceptualSome questions need exact matches, others need semantic similarityRun both and eyeballHybrid lexical + vector
Relationship questions"What calls this function?" or "What breaks if I change this file?"grep for import/usageGraph traversal
Too many results from any methodTop-k returns partly relevant, partly noiseIncrease k and hopeReranker on top of any of the above

The retrieval method chooser

This table is your decision framework. It will be helpful to consult before building anything as it saves you from over-engineering retrieval for problems that have simpler solutions.

MethodGood forWeak forCheapest implementationUpgrade signal
File tree + path metadataKnown locations, configuration files, READMEs, directory conventionsAnything requiring content understandingos.listdir + path pattern matchingYou need to search inside files, not just find them
Grep / regexExact identifiers, error strings, import statements, config keysSemantic similarity, fuzzy matches, typo toleranceYour existing search_text tool from Module 3Queries use different words than the code (e.g., "auth" vs. verify_credentials)
BM25 / lexical searchRanked keyword search, documentation, comments, docstringsConceptual questions where vocabulary differsrank_bm25 Python library over your chunked corpusRelevant results rank below irrelevant ones because of vocabulary mismatch
JSON / SQL metadata indexFile metadata, symbol lists, dependency tracking, structured queriesFree-text conceptual searchA JSON file mapping filenames to metadata (language, imports, exports, size)You need to search content semantics, not just attributes
AST / symbol indexSymbol lookup, function signatures, class hierarchies, definition locationsCross-file relationship reasoning, natural language questionsTree-sitter parse + symbol table (we'll build this in the AST-aware lesson)You need to answer "what calls this?" or "what depends on this?"
Vector searchConceptual similarity, natural language questions, documentation searchExact identifier lookup, structured queries, relationship traversalEmbedding model + Qdrant (we'll build this in the naive baseline lesson)You need exact matches and semantic matches together
Hybrid lexical + vectorMixed question types, production systems that serve varied queriesSimple use cases where one method is clearly sufficientBM25 + vector search with reciprocal rank fusionYour retrieval needs are narrow enough that one method works fine
Graph traversalImport chains, call graphs, dependency impact, "what breaks if I change X?"Similarity-based questions, conceptual searchNetworkX with import/call edges (we'll build this in the graph/hybrid lesson)Your questions don't involve relationships between code entities
RerankerImproving precision in the top results from any retrieval methodBeing a standalone retrieval method (it needs candidates to rerank)Cross-encoder model on top of initial resultsYour initial retrieval already returns mostly relevant results

For the full decision matrix with additional columns and edge cases, see the Retrieval Method Chooser reference page.

Walkthrough

Start cheap, upgrade on evidence

The most effective retrieval systems are built incrementally. Don't start with a vector database and graph store. Start with the simplest method that answers your questions, observe where it fails, and upgrade only the methods that need upgrading.

Your Module 3 agent already has grep-based retrieval. For many question types, like exact symbol lookup, error string search, import tracing, grep is often good enough. The goal of this lesson is to build one more retrieval method (structured metadata) and see how far it gets before we need vectors.

Build a structured metadata index

We'll create a JSON index that stores metadata about every file in your anchor repo. This gives you a queryable data structure for questions like "which files define classes?" or "what are the entry points?". Using grep can answer these, but neither very well nor efficiently.

cd anchor-repo
mkdir -p retrieval
# retrieval/build_metadata_index.py
"""Build a JSON metadata index for the anchor repository."""
import ast
import json
import os
from pathlib import Path

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
INDEX_PATH = Path("retrieval/metadata_index.json")


def is_excluded(path: Path) -> bool:
    return any(part in EXCLUDED_DIRS for part in path.parts)


def extract_python_metadata(file_path: Path) -> dict:
    """Extract metadata from a Python file using the ast module."""
    source = file_path.read_text(errors="replace")
    metadata = {
        "functions": [],
        "classes": [],
        "imports": [],
        "line_count": len(source.splitlines()),
    }
    try:
        tree = ast.parse(source)
    except SyntaxError:
        metadata["parse_error"] = True
        return metadata

    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
            metadata["functions"].append({
                "name": node.name,
                "line": node.lineno,
                "args": [a.arg for a in node.args.args],
            })
        elif isinstance(node, ast.ClassDef):
            methods = [
                n.name for n in node.body
                if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))
            ]
            metadata["classes"].append({
                "name": node.name,
                "line": node.lineno,
                "methods": methods,
            })
        elif isinstance(node, ast.Import):
            for alias in node.names:
                metadata["imports"].append(alias.name)
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                metadata["imports"].append(node.module)
    return metadata


def build_index() -> dict:
    """Walk the repo and build metadata for every file."""
    index = {}
    for root, dirs, files in os.walk(REPO_ROOT):
        # Skip excluded directories
        dirs[:] = [d for d in dirs if d not in EXCLUDED_DIRS]
        for fname in files:
            fpath = Path(root) / fname
            rel = str(fpath.relative_to(REPO_ROOT))
            entry = {
                "path": rel,
                "extension": fpath.suffix,
                "size_bytes": fpath.stat().st_size,
            }
            if fpath.suffix == ".py":
                entry.update(extract_python_metadata(fpath))
            index[rel] = entry
    return index


if __name__ == "__main__":
    index = build_index()
    INDEX_PATH.write_text(json.dumps(index, indent=2))
    py_files = [k for k, v in index.items() if v["extension"] == ".py"]
    total_functions = sum(len(v.get("functions", [])) for v in index.values())
    total_classes = sum(len(v.get("classes", [])) for v in index.values())
    print(f"Indexed {len(index)} files ({len(py_files)} Python)")
    print(f"Found {total_functions} functions, {total_classes} classes")
    print(f"Index saved to {INDEX_PATH}")
python retrieval/build_metadata_index.py

Expected output (these numbers will vary based on your anchor repo):

Indexed 47 files (23 Python)
Found 68 functions, 12 classes
Index saved to retrieval/metadata_index.json

Query the metadata index

Now build a query tool that can answer structured questions using this index:

# retrieval/query_metadata.py
"""Query the metadata index for structured code questions."""
import json
from pathlib import Path

INDEX_PATH = Path("retrieval/metadata_index.json")


def load_index() -> dict:
    return json.loads(INDEX_PATH.read_text())


def find_symbol(name: str, index: dict = None) -> list[dict]:
    """Find where a function or class is defined."""
    if index is None:
        index = load_index()
    results = []
    for path, meta in index.items():
        for fn in meta.get("functions", []):
            if name.lower() in fn["name"].lower():
                results.append({
                    "type": "function",
                    "name": fn["name"],
                    "file": path,
                    "line": fn["line"],
                    "args": fn["args"],
                })
        for cls in meta.get("classes", []):
            if name.lower() in cls["name"].lower():
                results.append({
                    "type": "class",
                    "name": cls["name"],
                    "file": path,
                    "line": cls["line"],
                    "methods": cls["methods"],
                })
    return results


def find_importers(module_name: str, index: dict = None) -> list[str]:
    """Find files that import a given module."""
    if index is None:
        index = load_index()
    return [
        path for path, meta in index.items()
        if module_name in meta.get("imports", [])
    ]


def list_entry_points(index: dict = None) -> list[dict]:
    """Find likely entry points: files with if __name__ == '__main__' or main()."""
    if index is None:
        index = load_index()
    results = []
    for path, meta in index.items():
        if meta.get("extension") != ".py":
            continue
        fn_names = [f["name"] for f in meta.get("functions", [])]
        if "main" in fn_names or path.endswith("__main__.py"):
            results.append({"file": path, "functions": fn_names})
    return results


if __name__ == "__main__":
    import sys
    index = load_index()
    if len(sys.argv) > 1:
        query = sys.argv[1]
        print(f"Searching for symbol: {query}")
        results = find_symbol(query, index)
        for r in results:
            print(f"  {r['type']} {r['name']} in {r['file']}:{r['line']}")
        if not results:
            print("  No matches found")
        print(f"\nFiles importing '{query}':")
        importers = find_importers(query, index)
        for f in importers:
            print(f"  {f}")
        if not importers:
            print("  None")
    else:
        print("Entry points:")
        for ep in list_entry_points(index):
            print(f"  {ep['file']}: {ep['functions']}")
# Search for a symbol in your repo
python retrieval/query_metadata.py "UserService"

# Or list entry points
python retrieval/query_metadata.py

Tier 0.5: Compare structured retrieval against your Module 3 tools

This is where we see the importance of our decision framework. We'll run a subset of your benchmark questions through three retrieval methods (grep (Module 3), the metadata index (this lesson), and later vector search (next lesson)) and compare which questions each one handles well.

# retrieval/compare_substrates.py
"""Compare grep vs. metadata index on benchmark questions."""
import json
import subprocess
from pathlib import Path
from retrieval.query_metadata import load_index, find_symbol, find_importers

BENCHMARK_FILE = Path("benchmark-questions.jsonl")
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules"}


def grep_search(query: str) -> list[str]:
    """Run grep and return matching file paths."""
    exclude_args = []
    for d in EXCLUDED_DIRS:
        exclude_args.extend(["--exclude-dir", d])
    cmd = ["grep", "-rl", "--include=*.py"] + exclude_args + [query, "."]
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=10, cwd=REPO_ROOT)
        return [line.strip() for line in result.stdout.strip().split("\n") if line.strip()]
    except subprocess.TimeoutExpired:
        return []


def metadata_search(query: str, index: dict) -> list[str]:
    """Search the metadata index for symbols matching the query."""
    symbols = find_symbol(query, index)
    importers = find_importers(query, index)
    files = list(set([s["file"] for s in symbols] + importers))
    return files


def run_comparison():
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    index = load_index()

    print(f"{'Category':<20} {'Question (truncated)':<45} {'Grep':<8} {'Metadata':<8}")
    print("-" * 85)

    for q in questions[:15]:  # Compare first 15 questions
        # Extract a likely search term from the question
        # In practice, you'd use the model to extract terms; here we use a simple heuristic
        words = q["question"].split()
        # Look for CamelCase or snake_case terms as likely identifiers
        search_terms = [
            w.strip("?.,\"'") for w in words
            if ("_" in w or (any(c.isupper() for c in w[1:]) and any(c.islower() for c in w)))
        ]
        search_term = search_terms[0] if search_terms else words[-2] if len(words) > 1 else words[0]

        grep_results = grep_search(search_term)
        meta_results = metadata_search(search_term, index)

        print(f"{q['category']:<20} {q['question'][:43]:<45} {len(grep_results):<8} {len(meta_results):<8}")


if __name__ == "__main__":
    run_comparison()
python -m retrieval.compare_substrates

You should see a table showing how many files each method found for each question. Watch for this patterns:

  • Symbol lookup questions: the metadata index will often find the exact file and line, while grep returns more noise
  • "What imports X?" questions: the metadata index answers directly, grep gives partial matches
  • Conceptual questions ("How does authentication work?"): neither grep nor the metadata index handles these well. That's the signal that you'll need semantic search

These patterns are precisely what the retrieval method chooser predicts. The goal here isn't to build one retrieval method that handles everything, but rather to know which approach to reach for based on the question type.

When you don't need a vector database

I'll be a little politically incorrect here, because I see a lot of content encouraging teams to waste weeks building vector retrieval infrastructure for problems that grep solves in milliseconds. You don't need a vector database by default. If these are your scenarios, skip vector search:

  1. Your questions use the same vocabulary as your code. If the user asks "where is validate_path defined?" that's grep territory. Embeddings add latency and lose precision for exact matches.

  2. Your corpus is small enough to scan. If your repo has fewer than a few thousand files, grep over the full codebase runs in under a second. Vector search adds complexity without meaningful speed benefit at this scale.

  3. Your questions are structural, not semantic. "What files import datetime?" is a metadata query. "What methods does Router have?" is a symbol table lookup. Neither needs embeddings.

  4. You need exact recall. Vector search is approximate by nature. If you need to guarantee that a specific identifier appears in the results, lexical search is more reliable.

You do need vector search (or something beyond lexical) when:

  • The user's vocabulary differs from the code's vocabulary ("auth flow" vs. verify_credentials)
  • The question is conceptual ("how does error handling work in this codebase?")
  • You need to find code that's semantically similar to a description
  • Your corpus is large enough that scanning is too slow

We'll build that vector search baseline in the next lesson. But we'll build it knowing exactly which questions it needs to answer, the ones our simpler methods can't handle.

Exercises

  1. Build the metadata index (build_metadata_index.py) for your anchor repo. Inspect the JSON output and verify it captured your files, functions, and classes accurately.
  2. Run query_metadata.py against five symbol names from your repo. Compare the results against what you get from the search_text tool in agent/tools.py. Note which approach gives you more precise results for each query.
  3. Run compare_substrates.py against your benchmark questions. Categorize each question as "grep handles this," "metadata handles this," "neither handles this well."
  4. For the questions in the "neither handles this" category, write down what kind of retrieval you think would help. Don't look ahead. Form your own hypothesis first.
  5. Add a find_dependents function to query_metadata.py that answers "which files would be affected if I changed file X?" by tracing import relationships. Test it on a core module in your repo.

Completion checkpoint

You should now have:

  • A working metadata index covering all files in your anchor repo
  • Query functions that answer symbol lookup and import-tracing questions
  • A comparison showing which benchmark questions each retrieval method handles well
  • A categorized list of questions that need semantic retrieval (this will become your test set for the next lesson)
  • A clear understanding of when simpler retrieval outperforms vector search

Reflection prompts

  • Which of your benchmark questions were answered well by grep alone? What do those questions have in common?
  • Which questions did the metadata index handle better than grep? What structural information made the difference?
  • For the questions that neither method handled, what's missing: vocabulary mapping, conceptual understanding, or relationship awareness?
  • Looking at the retrieval method chooser table, which methods do you think your final system will need to combine? Why?

What's next

Naive Vector Baseline. Start with the obvious semantic-search baseline so you can see exactly what it helps with and what it still breaks.

References

Start here

  • Retrieval Method Chooser — the full decision matrix for all nine retrieval methods, with additional edge cases and upgrade signals

Build with this

  • rank_bm25 on PyPI — a lightweight BM25 implementation for when you need ranked lexical search beyond grep
  • Python ast module — the standard library module we used for extracting Python metadata

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.