Module 4: Code Retrieval AST and Symbol Retrieval

AST and Symbol-Aware Retrieval (Tier 2)

Your naive vector baseline showed you exactly where flat chunking breaks: functions split across chunk boundaries, classes severed from their methods, docstrings separated from the code they describe. You might be tempted to think of these as edge cases, but they're actually the normal behavior of character-based chunking on code. In this lesson, we'll fix those failures by parsing your code into its actual structure and chunking along the boundaries that the language itself defines.

What you'll want to note here is that code has grammar. Unlike prose, where paragraph breaks are soft suggestions, code has hard structural boundaries: functions, classes, methods, module-level blocks. A chunk that respects those boundaries is drastically more useful to a model than one that splits a function at character 800.

What you'll learn

  • Parse code files with Tree-sitter to extract the abstract syntax tree (AST)
  • Build a symbol table mapping every function, class, and method to its file and line range
  • Create code-aware chunks that respect structural boundaries
  • Re-index your anchor repo with AST-aware chunks and compare retrieval quality against the naive baseline
  • Measure the improvement on symbol lookup and architecture benchmark questions

Concepts

Abstract Syntax Tree (AST): a tree representation of the syntactic structure of source code. Each node represents a construct: a function definition, a class, an if-statement, an import. The AST captures what the code means structurally, independent of formatting. We'll use the AST to know where functions start and end, which methods belong to which classes, and what each file exports.

Tree-sitter: an incremental parsing library that builds ASTs for source code. It supports many languages through grammar files, parses quickly enough for editor-scale use, and produces concrete syntax trees that include every token. We use it because it works across languages and doesn't require the code to be valid (it can parse partial or broken files).

Symbol table: a data structure that maps symbol names (functions, classes, variables) to their locations (file, start line, end line) and metadata (arguments, return types, parent class). In compilers, symbol tables are used for name resolution. In retrieval, we'll use them for precise symbol lookup: "where is X defined?" becomes a table lookup instead of a search query.

Structural chunking: splitting code into chunks that follow the code's own structure. Instead of splitting at every N characters, you split at function boundaries, class boundaries, and module-level blocks. Each chunk is a complete semantic unit that a model can understand without needing context from adjacent chunks.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Broken function boundariesRetrieved chunks contain half a functionIncrease chunk sizeParse with Tree-sitter, chunk by function/class boundary
Exact symbol lookup missesVector search returns related but wrong symbolsGrep for the symbol nameSymbol table with direct lookup
Missing structural contextModel doesn't know which class a method belongs toAdd file path to chunk metadataInclude parent class/module context in chunk
Language-specific failuresPython parser doesn't handle TypeScript filesSingle-language indexingMulti-language Tree-sitter grammars

Default: Tree-sitter

Why this is the default: Tree-sitter parses many languages with a single API, handles broken/partial files gracefully, and runs fast enough to re-index on every commit. It gives us a consistent structural representation regardless of language.

Portable concept underneath: parse code into meaningful structural units instead of treating it as plain text. The specific parser matters less than the principle: code structure should inform chunking.

Closest alternatives and when to switch:

  • Python ast module: use when your codebase is pure Python and you don't need multi-language support (we used this in the metadata index lesson. It works pretty well for Python-only analysis)
  • LSP-based symbol extraction: use when you need type information, cross-file resolution, or refactoring-grade accuracy
  • ctags / universal-ctags: use when you only need symbol definitions and don't need full AST traversal

Walkthrough

Install Tree-sitter

cd anchor-repo
pip install tree-sitter tree-sitter-python

If your anchor repo includes other languages, install those grammars too:

# Only install what you need
pip install tree-sitter-javascript  # for JS/JSX projects
pip install tree-sitter-typescript  # for TypeScript/TSX projects (separate package)
pip install tree-sitter-go          # for Go projects
Note on tree-sitter versions

The tree-sitter Python package version 0.23+ uses a new API. The code below targets that API. If you're using an older version, the Language import path and parser setup will differ. Check the tree-sitter Python bindings if you're unsure what you need.

Parse files and extract symbols

# retrieval/parse_ast.py
"""Parse code files with Tree-sitter and extract structural metadata."""
import json
from pathlib import Path
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
SYMBOL_TABLE_PATH = Path("retrieval/symbol_table.json")

# Initialize the Python parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)


def is_excluded(path: Path) -> bool:
    """Check whether a path should be skipped during repository traversal.

    Args:
        path: Repository-relative path to evaluate.

    Returns:
        ``True`` when the path lives under an excluded directory, otherwise ``False``.
    """
    return any(part in EXCLUDED_DIRS for part in path.parts)


def extract_symbols(file_path: Path) -> list[dict]:
    """Extract function and class symbols from one Python file.

    Args:
        file_path: Absolute path to the Python file to parse.

    Returns:
        A list of symbol dictionaries describing discovered classes and functions.
    """
    source = file_path.read_bytes()
    tree = parser.parse(source)
    rel_path = str(file_path.relative_to(REPO_ROOT))
    symbols = []

    def visit(node, parent_class=None):
        if node.type == "function_definition":
            name_node = node.child_by_field_name("name")
            params_node = node.child_by_field_name("parameters")
            name = name_node.text.decode() if name_node else "<anonymous>"
            params = params_node.text.decode() if params_node else "()"

            # Extract docstring if present
            body = node.child_by_field_name("body")
            docstring = None
            if body and body.children:
                first_stmt = body.children[0]
                if first_stmt.type == "expression_statement":
                    expr = first_stmt.children[0] if first_stmt.children else None
                    if expr and expr.type == "string":
                        docstring = expr.text.decode().strip("\"'")

            symbols.append({
                "type": "function",
                "name": name,
                "qualified_name": f"{parent_class}.{name}" if parent_class else name,
                "file": rel_path,
                "start_line": node.start_point[0] + 1,
                "end_line": node.end_point[0] + 1,
                "start_byte": node.start_byte,
                "end_byte": node.end_byte,
                "params": params,
                "docstring": docstring,
                "parent_class": parent_class,
            })

        elif node.type == "class_definition":
            name_node = node.child_by_field_name("name")
            name = name_node.text.decode() if name_node else "<anonymous>"
            body = node.child_by_field_name("body")

            # Extract class docstring
            docstring = None
            if body and body.children:
                first_stmt = body.children[0]
                if first_stmt.type == "expression_statement":
                    expr = first_stmt.children[0] if first_stmt.children else None
                    if expr and expr.type == "string":
                        docstring = expr.text.decode().strip("\"'")

            symbols.append({
                "type": "class",
                "name": name,
                "qualified_name": name,
                "file": rel_path,
                "start_line": node.start_point[0] + 1,
                "end_line": node.end_point[0] + 1,
                "start_byte": node.start_byte,
                "end_byte": node.end_byte,
                "docstring": docstring,
            })

            # Visit class body with parent context
            if body:
                for child in body.children:
                    visit(child, parent_class=name)
            return  # Don't recurse into children again

        for child in node.children:
            visit(child, parent_class=parent_class)

    visit(tree.root_node)
    return symbols


def build_symbol_table() -> dict:
    """Build a repository-wide symbol table from parsed Python files.

    Args:
        None.

    Returns:
        A symbol table with full symbol records and lookup indexes.
    """
    all_symbols = []
    files_parsed = 0

    for path in sorted(REPO_ROOT.rglob("*.py")):
        if is_excluded(path.relative_to(REPO_ROOT)):
            continue
        try:
            symbols = extract_symbols(path)
            all_symbols.extend(symbols)
            files_parsed += 1
        except Exception as e:
            print(f"  Warning: failed to parse {path}: {e}")

    # Build lookup indexes
    by_name = {}
    for sym in all_symbols:
        name = sym["name"]
        if name not in by_name:
            by_name[name] = []
        by_name[name].append(sym)

    table = {
        "symbols": all_symbols,
        "by_name": by_name,
        "files_parsed": files_parsed,
        "total_symbols": len(all_symbols),
    }

    SYMBOL_TABLE_PATH.write_text(json.dumps(table, indent=2))
    functions = [s for s in all_symbols if s["type"] == "function"]
    classes = [s for s in all_symbols if s["type"] == "class"]
    print(f"Parsed {files_parsed} files")
    print(f"Found {len(functions)} functions, {len(classes)} classes")
    print(f"Symbol table saved to {SYMBOL_TABLE_PATH}")
    return table


if __name__ == "__main__":
    build_symbol_table()
python retrieval/parse_ast.py

Expected output:

Parsed 23 files
Found 68 functions, 12 classes
Symbol table saved to retrieval/symbol_table.json

Create code-aware chunks

Now let's build chunks that follow the code's structure. Each function becomes its own chunk. Each class becomes a chunk (with its methods). Module-level code becomes a chunk. Nothing gets split mid-definition.

# retrieval/chunk_ast.py
"""Create code-aware chunks using Tree-sitter AST boundaries."""
import json
from pathlib import Path
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
CODE_EXTENSIONS = {".py"}  # Start with Python; extend for other languages
OUTPUT_PATH = Path("retrieval/chunks_ast.jsonl")
MAX_CHUNK_CHARS = 2000  # If a single function/class exceeds this, we'll include it whole but flag it

PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)


def is_excluded(path: Path) -> bool:
    """Check whether a path should be skipped during repository traversal.

    Args:
        path: Repository-relative path to evaluate.

    Returns:
        ``True`` when the path lives under an excluded directory, otherwise ``False``.
    """
    return any(part in EXCLUDED_DIRS for part in path.parts)


def structural_chunks(file_path: Path) -> list[dict]:
    """Split one source file into AST-aligned structural chunks.

    Args:
        file_path: Absolute path to the source file to chunk.

    Returns:
        A list of chunk dictionaries aligned to module, class, and function boundaries.
    """
    source_bytes = file_path.read_bytes()
    source_text = source_bytes.decode(errors="replace")
    tree = parser.parse(source_bytes)
    rel_path = str(file_path.relative_to(REPO_ROOT))
    chunks = []

    # Collect top-level nodes
    root = tree.root_node
    module_header_lines = []  # imports, module docstring, etc.
    current_header_end = 0

    for child in root.children:
        if child.type in ("function_definition", "class_definition", "decorated_definition"):
            # If there's module-level code above this definition, capture it
            if current_header_end < child.start_byte:
                header_text = source_bytes[current_header_end:child.start_byte].decode(errors="replace").strip()
                if header_text:
                    module_header_lines.append(header_text)

            # Extract the full definition as a chunk
            node_text = source_bytes[child.start_byte:child.end_byte].decode(errors="replace")
            start_line = child.start_point[0] + 1
            end_line = child.end_point[0] + 1

            # Determine the symbol name
            actual_node = child
            if child.type == "decorated_definition":
                for sub in child.children:
                    if sub.type in ("function_definition", "class_definition"):
                        actual_node = sub
                        break
            name_node = actual_node.child_by_field_name("name")
            symbol_name = name_node.text.decode() if name_node else "<anonymous>"
            symbol_type = "class" if actual_node.type == "class_definition" else "function"

            chunks.append({
                "file_path": rel_path,
                "symbol_name": symbol_name,
                "symbol_type": symbol_type,
                "start_line": start_line,
                "end_line": end_line,
                "text": node_text,
                "char_count": len(node_text),
                "is_oversized": len(node_text) > MAX_CHUNK_CHARS,
            })
            current_header_end = child.end_byte

        else:
            # Module-level code (imports, assignments, etc.)
            current_header_end = max(current_header_end, child.end_byte)

    # Capture trailing module-level code
    trailing = source_bytes[current_header_end:].decode(errors="replace").strip()
    if trailing:
        module_header_lines.append(trailing)

    # Add module-level code as a single chunk
    if module_header_lines:
        header_text = "\n".join(module_header_lines)
        chunks.insert(0, {
            "file_path": rel_path,
            "symbol_name": "__module__",
            "symbol_type": "module",
            "start_line": 1,
            "end_line": None,
            "text": header_text,
            "char_count": len(header_text),
            "is_oversized": len(header_text) > MAX_CHUNK_CHARS,
        })

    return chunks


def build_ast_chunks():
    """Build AST-aware chunks for all eligible source files in the repository.

    Args:
        None.

    Returns:
        None. Chunk records are written to the AST chunk JSONL file.
    """
    all_chunks = []
    chunk_id = 0

    for path in sorted(REPO_ROOT.rglob("*")):
        if not path.is_file():
            continue
        if is_excluded(path.relative_to(REPO_ROOT)):
            continue
        if path.suffix not in CODE_EXTENSIONS:
            continue

        try:
            file_chunks = structural_chunks(path)
            for chunk in file_chunks:
                chunk["chunk_id"] = f"ast-{chunk_id:05d}"
                all_chunks.append(chunk)
                chunk_id += 1
        except Exception as e:
            print(f"  Warning: failed to parse {path}: {e}")

    with open(OUTPUT_PATH, "w") as f:
        for chunk in all_chunks:
            f.write(json.dumps(chunk) + "\n")

    oversized = [c for c in all_chunks if c.get("is_oversized")]
    print(f"Created {len(all_chunks)} AST-aware chunks from {len(set(c['file_path'] for c in all_chunks))} files")
    print(f"  Functions: {len([c for c in all_chunks if c['symbol_type'] == 'function'])}")
    print(f"  Classes: {len([c for c in all_chunks if c['symbol_type'] == 'class'])}")
    print(f"  Module-level: {len([c for c in all_chunks if c['symbol_type'] == 'module'])}")
    if oversized:
        print(f"  Oversized chunks (>{MAX_CHUNK_CHARS} chars): {len(oversized)}")
    print(f"Chunks saved to {OUTPUT_PATH}")


if __name__ == "__main__":
    build_ast_chunks()
python retrieval/chunk_ast.py

Expected output:

Created 103 AST-aware chunks from 23 files
  Functions: 68
  Classes: 12
  Module-level: 23
  Oversized chunks (>2000 chars): 3
Chunks saved to retrieval/chunks_ast.jsonl

Notice the difference: naive chunking produced 142 arbitrary chunks. AST-aware chunking produces 103 chunks that align with code structure. Each function is whole. Each class is whole. No split definitions.

Re-index with AST-aware chunks

Pick your provider for the embedding script. The AST chunking and Qdrant storage are identical across providers; only the embedding call differs.

# retrieval/embed_ast_chunks.py
"""Embed AST-aware chunks and store in a separate Qdrant collection."""
import json
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

CHUNKS_PATH = Path("retrieval/chunks_ast.jsonl")
COLLECTION_NAME = "anchor-repo-ast"
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
BATCH_SIZE = 50

client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")


def load_chunks():
    """Load AST-aware chunks from the JSONL chunk store.

    Args:
        None.

    Returns:
        A list of chunk dictionaries from the chunk store.
    """
    chunks = []
    with open(CHUNKS_PATH) as f:
        for line in f:
            if line.strip():
                chunks.append(json.loads(line))
    return chunks


def embed_texts(texts):
    """Generate embeddings for a batch of AST-aware chunk texts.

    Args:
        texts: Chunk texts to embed.

    Returns:
        A list of embedding vectors in the same order as the input texts.
    """
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
    return [item.embedding for item in response.data]


def create_and_store():
    """Embed AST-aware chunks and store them in the Qdrant collection.

    Args:
        None.

    Returns:
        None. The vector collection is recreated and populated in place.
    """
    chunks = load_chunks()
    collections = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME in collections:
        qdrant.delete_collection(COLLECTION_NAME)
    qdrant.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
    )
    print(f"Created collection '{COLLECTION_NAME}'")

    for batch_start in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[batch_start:batch_start + BATCH_SIZE]
        texts = [f"{c['symbol_type']} {c['symbol_name']} in {c['file_path']}\n\n{c['text']}" for c in batch]
        embeddings = embed_texts(texts)
        points = [
            PointStruct(id=batch_start + i, vector=emb, payload={
                "chunk_id": chunk["chunk_id"], "file_path": chunk["file_path"],
                "symbol_name": chunk["symbol_name"], "symbol_type": chunk["symbol_type"],
                "start_line": chunk["start_line"], "end_line": chunk["end_line"],
                "text": chunk["text"],
            })
            for i, (chunk, emb) in enumerate(zip(batch, embeddings))
        ]
        qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
        print(f"  Stored {batch_start + len(batch)}/{len(chunks)} chunks")
    print(f"\nDone. {len(chunks)} AST-aware chunks stored in '{COLLECTION_NAME}'")


if __name__ == "__main__":
    create_and_store()
python retrieval/embed_ast_chunks.py

Compare naive vs. AST-aware retrieval

Use the same embedding provider you used for indexing:

# retrieval/compare_tiers.py
"""Compare naive vs. AST-aware retrieval on benchmark questions."""
import json
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient

BENCHMARK_FILE = Path("benchmark-questions.jsonl")
EMBEDDING_MODEL = "text-embedding-3-small"
TOP_K = 5

client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")


def retrieve_from(collection, query, top_k=TOP_K):
    """Query one vector collection and return the top AST-aware matches.

    Args:
        collection: Qdrant collection name to query.
        query: User question or lookup string to embed.
        top_k: Number of matches to return.

    Returns:
        A list of ranked retrieval hits with file, symbol, and preview metadata.
    """
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=[query])
    query_vector = response.data[0].embedding
    results = qdrant.query_points(collection_name=collection, query=query_vector, limit=top_k)
    return [{"file_path": hit.payload["file_path"], "score": round(hit.score, 4),
             "text_preview": hit.payload["text"][:120],
             "symbol_name": hit.payload.get("symbol_name", "n/a")}
            for hit in results.points]


def compare():
    """Compare naive retrieval against AST-aware retrieval on benchmark questions.

    Args:
        None.

    Returns:
        None. Results are printed for manual inspection.
    """
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    print(f"Comparing naive vs. AST-aware retrieval on {min(len(questions), 15)} questions\n")
    for q in questions[:15]:
        print(f"[{q['category']}] {q['question'][:70]}")
        naive = retrieve_from("anchor-repo-naive", q["question"])
        ast_aware = retrieve_from("anchor-repo-ast", q["question"])
        print(f"  Naive:     {', '.join(set(r['file_path'] for r in naive))}")
        print(f"  AST-aware: {', '.join(set(r['file_path'] for r in ast_aware))}")
        for r in ast_aware[:3]:
            print(f"    [{r['score']}] {r['symbol_name']} in {r['file_path']}")
        print()


if __name__ == "__main__":
    compare()
python -m retrieval.compare_tiers

You should see improvements in two areas:

  1. Symbol lookup questions: AST-aware retrieval returns the complete function or class, not a fragment. The model gets a whole definition to work with.

  2. Architecture questions: because each chunk is a named symbol with metadata, the retrieval results are more meaningful. Instead of "some text from main.py," you get "function handle_request in routes/api.py."

The questions where you won't see much improvement yet are relationship questions ("what calls this function?") and questions requiring cross-file reasoning. Those are what graph and hybrid retrieval will address.

Exercises

  1. Build the Tree-sitter parser and symbol table (parse_ast.py). Verify the symbol table includes every function and class in your repo.
  2. Build AST-aware chunks (chunk_ast.py). Compare the chunk count and average chunk size against the naive baseline. Open both JSONL files and compare three chunks from the same file.
  3. Embed and store AST-aware chunks (embed_ast_chunks.py). Run the tier comparison script and note which question categories improved.
  4. Run a full benchmark through AST-aware retrieval (modify run_naive_benchmark.py to use the AST collection). Grade at least 15 answers and compare against your naive baseline grades.
  5. Find a question where AST-aware retrieval finds the right file but the wrong symbol. What metadata would help the retrieval rank the correct symbol higher?

Completion checkpoint

You now have:

  • A working Tree-sitter parser that extracts symbols from your anchor repo
  • A symbol table with file, line range, and parent-class metadata for every function and class
  • AST-aware chunks stored in a separate Qdrant collection
  • A side-by-side comparison showing AST-aware retrieval's improvement over the naive baseline on symbol lookup and architecture questions
  • Benchmark grades showing the overall improvement and the remaining failure categories

Reflection prompts

  • How much did AST-aware chunking improve your benchmark scores? Was the improvement concentrated in specific question categories?
  • Did the symbol table metadata (symbol name, parent class, line range) change which chunks the retrieval returned, or just make the returned chunks more useful?
  • Which failure classes from the naive baseline are now resolved? Which ones remain?
  • When you look at the remaining failures, do they involve relationships between code entities (imports, calls, dependencies)? That's the pattern we'll address next.

What's next

Graph/Hybrid Retrieval. Structure fixes symbol lookup, but relationship questions still need traversal and exact-match signals. That is the gap the next tier closes.

References

Start here

Build with this

Deep dive

  • Tree-sitter GitHub — the core library with links to all available grammars
  • Aider: Repository maps — how Aider uses Tree-sitter to build repository maps for LLM context; a practical example of AST-informed retrieval at scale
Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.