Module 4: Code Retrieval Context Compilation

Context Compilation (Tier 4)

Through Tiers 1-3, we've progressively improved what gets retrieved. AST-aware chunking fixed broken boundaries. Graph traversal and lexical search added structural and exact-match signals. But retrieval quality is only half the problem. The other half, and I'd argue the harder half, is what you do with retrieved evidence before it enters the model's context window.

Right now, our pipeline takes the top-k chunks, concatenates them, and stuffs them into the prompt. That's wasteful. Some chunks overlap; some are irrelevant to the specific question even though they scored well; some are too long when only three lines matter. And as we retrieve from more sources (vector, lexical, graph), the total evidence grows beyond what the model can usefully process, even with large context windows.

This lesson treats context as a compilation problem. Just as a compiler transforms source code into optimized machine code, a context compiler transforms raw retrieval results into a focused, deduplicated, token-budgeted context pack. This is where "context engineering" becomes a concrete engineering practice, not just a term people use on social media.

What you'll learn

  • Build a context compiler with five stages: workspace, planner, slicer, context pack builder, and token budgeter
  • Detect and handle context rot: oversized context, stale evidence, conflicting chunks, and accumulated noise
  • Produce context packs with provenance metadata that connect every piece of evidence back to its source
  • Measure context quality: are we sending the model what it actually needs?
  • Compare the full retrieval progression (naive through compiled) on your benchmark

Concepts

Context engineering: the practice of controlling what information enters a model's context window, in what form, and in what order. It's a named discipline because the context you provide shapes the model's reasoning as much as the prompt does. Context engineering includes retrieval, selection, formatting, ordering, deduplication, and token budgeting. We've been doing pieces of it since Module 1; this lesson makes it explicit.

Context rot: the degradation of answer quality when context becomes stale, contradictory, duplicated, or bloated. Context rot is the retrieval equivalent of technical debt. It accumulates silently: each retrieval improvement adds more evidence, and without active management, the context window fills with noise that dilutes the signal. I've seen production systems where retrieval was excellent but answers were poor because the context assembly was careless.

Context rot has four common forms:

  1. Oversized context: more tokens than the model can attend to effectively, even if the window fits them
  2. Stale evidence: chunks that were relevant to an earlier version of the question or conversation
  3. Conflicting evidence: chunks that give contradictory information (e.g., two versions of the same function)
  4. Accumulated noise: irrelevant chunks that scored just above the retrieval threshold

Context pack: a structured bundle of evidence assembled for a specific task. A context pack includes the selected code chunks, their provenance (where they came from and why), a token budget, and metadata the model can use to assess evidence quality. Think of it as a dossier prepared for the model, not a pile of search results.

Token budget: a deliberate limit on how many tokens of context you provide, independent of the model's maximum context window. A 128k-token context window doesn't mean you should use 128k tokens. In my experience, answer quality peaks well before the window is full, and it degrades as noise accumulates. A token budget forces you to be selective.

Provenance: tracking where each piece of context came from: which file, which retrieval method, what score, why it was included. Provenance lets the model (and you) assess evidence quality and enables citations in the answer.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Context too largeModel ignores relevant evidence buried in a wall of textLower top-kToken budgeter with priority ranking
Duplicated evidenceSame code appears in multiple chunks (overlapping retrieval)Deduplicate on chunk_idContent-level deduplication
Irrelevant chunksRetrieval returns chunks that scored well but aren't useful for this specific questionIncrease relevance thresholdPlanner that assesses chunk relevance to the question
Missing targeted evidenceThe right file was retrieved but the relevant function is 200 lines long and only 5 lines matterRetrieve the whole functionSlicer that extracts the relevant subsection
No provenanceCan't trace the model's answer back to specific codeManual inspectionContext pack with provenance metadata

Walkthrough

Architecture of the context compiler

The context compiler has five stages. Each stage is a separate function, which means you can improve or replace any stage independently.

Context compilation stages: question flows through planner, workspace, slicer, context pack builder, and token budgeter

Build the context compiler

# retrieval/context_compiler.py
"""Context compiler: workspace, planner, slicer, pack builder, token budgeter."""
import json
import hashlib
import re
from dataclasses import dataclass, field, asdict
from pathlib import Path

# We'll use tiktoken for accurate token counting.
# Install: pip install tiktoken
import tiktoken

ENCODING = tiktoken.encoding_for_model("gpt-4o-mini")


def count_tokens(text: str) -> int:
    """Count tokens for a text string.

    Args:
        text: Text whose token usage should be measured.

    Returns:
        int: Token count for the configured model encoding.
    """
    return len(ENCODING.encode(text))


# ---------------------------------------------------------------------------
# Data structures
# ---------------------------------------------------------------------------

@dataclass
class EvidenceChunk:
    """A single piece of evidence with provenance."""
    chunk_id: str
    file_path: str
    symbol_name: str
    text: str
    start_line: int | None = None
    end_line: int | None = None
    retrieval_method: str = ""
    retrieval_score: float = 0.0
    token_count: int = 0
    content_hash: str = ""

    def __post_init__(self):
        self.token_count = count_tokens(self.text)
        self.content_hash = hashlib.md5(self.text.encode()).hexdigest()[:12]


@dataclass
class ContextPack:
    """A compiled context pack ready for the model."""
    question: str
    chunks: list[EvidenceChunk] = field(default_factory=list)
    total_tokens: int = 0
    token_budget: int = 0
    provenance: list[dict] = field(default_factory=list)
    warnings: list[str] = field(default_factory=list)

    def to_prompt_context(self) -> str:
        """Format the pack for inclusion in a prompt.

        Returns:
            str: Rendered evidence sections with provenance headers.
        """
        sections = []
        for i, chunk in enumerate(self.chunks):
            header = f"[Evidence {i+1}] {chunk.file_path}"
            if chunk.symbol_name and chunk.symbol_name != "__module__":
                header += f" :: {chunk.symbol_name}"
            if chunk.start_line:
                header += f" (lines {chunk.start_line}-{chunk.end_line})"
            header += f" [{chunk.retrieval_method}, score: {chunk.retrieval_score:.4f}]"
            sections.append(f"{header}\n{chunk.text}")
        return "\n\n---\n\n".join(sections)

    def to_dict(self) -> dict:
        """Serialize the pack into a JSON-friendly dictionary.

        Returns:
            dict: Structured context-pack payload for logging or grading.
        """
        return {
            "question": self.question,
            "chunks": [asdict(c) for c in self.chunks],
            "total_tokens": self.total_tokens,
            "token_budget": self.token_budget,
            "warnings": self.warnings,
            "provenance": self.provenance,
        }


# ---------------------------------------------------------------------------
# Stage 1: Planner
# ---------------------------------------------------------------------------

def plan_retrieval(question: str) -> dict:
    """Analyze the question and decide what retrieval strategies to use.

    In a production system, this could be an LLM call that classifies the
    question type. For now, we'll use heuristics.
    
    Args:
        question: User question that will drive retrieval.

    Returns:
        dict: Retrieval plan containing selected strategies and extracted hints.
    """
    plan = {
        "question": question,
        "strategies": ["vector"],  # always include vector
        "identifier_hints": [],
        "file_hints": [],
    }

    # Detect identifiers (CamelCase, snake_case)
    identifiers = re.findall(
        r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)+\b|\b[a-z_]+(?:_[a-z]+)+\b',
        question,
    )
    if identifiers:
        plan["strategies"].append("lexical")
        plan["identifier_hints"] = identifiers

    # Detect relationship keywords
    relationship_words = {"calls", "imports", "depends", "affects", "breaks", "changes", "uses"}
    if any(w in question.lower() for w in relationship_words):
        plan["strategies"].append("graph")

    # Detect file path mentions
    file_mentions = re.findall(r'[\w/]+\.py\b', question)
    if file_mentions:
        plan["file_hints"] = file_mentions

    return plan


# ---------------------------------------------------------------------------
# Stage 2: Workspace (collects raw evidence)
# ---------------------------------------------------------------------------

def collect_evidence(plan: dict, hybrid_retrieve_fn, graph_traverse_fn=None) -> list[EvidenceChunk]:
    """Collect raw evidence using the strategies the planner selected.

    In a production system, each strategy would have its own retrieval
    path. Here we use hybrid retrieval for vector+lexical and optionally
    add graph traversal if the planner flagged relationship keywords.
    
    Args:
        plan: Planner output describing which retrieval legs to use.
        hybrid_retrieve_fn: Callable that returns the base hybrid retrieval results.
        graph_traverse_fn: Optional callable for graph-specific evidence expansion.

    Returns:
        list[EvidenceChunk]: Raw evidence objects ready for slicing and packing.
    """
    raw_results = hybrid_retrieve_fn(plan["question"])

    # If the planner detected relationship keywords, add graph evidence
    if "graph" in plan.get("strategies", []) and graph_traverse_fn:
        for hint in plan.get("identifier_hints", []):
            graph_results = graph_traverse_fn(hint)
            raw_results.extend(graph_results)

    chunks = []
    for result in raw_results:
        chunks.append(EvidenceChunk(
            chunk_id=result.get("chunk_id", "unknown"),
            file_path=result.get("file_path", "unknown"),
            symbol_name=result.get("symbol_name", "unknown"),
            text=result.get("text", ""),
            start_line=result.get("start_line"),
            end_line=result.get("end_line"),
            retrieval_method=result.get("retrieval_method", "hybrid"),
            retrieval_score=result.get("rrf_score", 0.0),
        ))

    return chunks


# ---------------------------------------------------------------------------
# Stage 3: Slicer
# ---------------------------------------------------------------------------

def slice_evidence(chunks: list[EvidenceChunk], question: str) -> list[EvidenceChunk]:
    """Slice oversized chunks to their relevant portion.

    For now, we use a simple heuristic: if a chunk exceeds 1500 tokens,
    try to find the most relevant section. In a production system, you'd
    use an LLM to identify the relevant lines.
    
    Args:
        chunks: Raw evidence chunks gathered during retrieval.
        question: User question used to score relevant lines.

    Returns:
        list[EvidenceChunk]: Chunks with oversized entries trimmed to denser regions.
    """
    MAX_CHUNK_TOKENS = 1500
    sliced = []

    for chunk in chunks:
        if chunk.token_count <= MAX_CHUNK_TOKENS:
            sliced.append(chunk)
            continue

        # Simple heuristic: extract lines containing query keywords
        keywords = set(re.findall(r'\w+', question.lower()))
        lines = chunk.text.split("\n")
        scored_lines = []
        for i, line in enumerate(lines):
            line_words = set(re.findall(r'\w+', line.lower()))
            score = len(keywords & line_words)
            scored_lines.append((i, score))

        # Find the densest region
        best_start = 0
        best_score = 0
        window = min(40, len(lines))  # ~40 lines of context
        for start in range(len(lines) - window + 1):
            window_score = sum(s for _, s in scored_lines[start:start + window])
            if window_score > best_score:
                best_score = window_score
                best_start = start

        sliced_text = "\n".join(lines[best_start:best_start + window])
        sliced_chunk = EvidenceChunk(
            chunk_id=chunk.chunk_id + "-sliced",
            file_path=chunk.file_path,
            symbol_name=chunk.symbol_name,
            text=sliced_text,
            start_line=(chunk.start_line or 1) + best_start,
            end_line=(chunk.start_line or 1) + best_start + window,
            retrieval_method=chunk.retrieval_method,
            retrieval_score=chunk.retrieval_score,
        )
        sliced.append(sliced_chunk)

    return sliced


# ---------------------------------------------------------------------------
# Stage 4: Context Pack Builder
# ---------------------------------------------------------------------------

def build_context_pack(
    question: str,
    chunks: list[EvidenceChunk],
) -> ContextPack:
    """Deduplicate, order, and annotate evidence for model consumption.

    Args:
        question: User question the pack is being assembled for.
        chunks: Candidate evidence chunks after slicing.

    Returns:
        ContextPack: Ordered pack with provenance and warnings attached.
    """
    pack = ContextPack(question=question)

    # Deduplicate by content hash
    seen_hashes = set()
    unique_chunks = []
    for chunk in chunks:
        if chunk.content_hash not in seen_hashes:
            seen_hashes.add(chunk.content_hash)
            unique_chunks.append(chunk)
        else:
            pack.warnings.append(f"Deduplicated: {chunk.chunk_id} (same content as existing chunk)")

    # Order by retrieval score (highest first)
    unique_chunks.sort(key=lambda c: c.retrieval_score, reverse=True)

    pack.chunks = unique_chunks
    pack.total_tokens = sum(c.token_count for c in unique_chunks)

    # Build provenance
    pack.provenance = [
        {
            "chunk_id": c.chunk_id,
            "file_path": c.file_path,
            "symbol_name": c.symbol_name,
            "retrieval_method": c.retrieval_method,
            "retrieval_score": c.retrieval_score,
            "token_count": c.token_count,
        }
        for c in unique_chunks
    ]

    return pack


# ---------------------------------------------------------------------------
# Stage 5: Token Budgeter
# ---------------------------------------------------------------------------

def apply_token_budget(pack: ContextPack, budget: int = 4000) -> ContextPack:
    """Trim the pack to fit within the token budget.

    Removes the lowest-scoring chunks until the pack fits. If a single
    chunk exceeds the budget, it will be sliced further.
    
    Args:
        pack: Context pack to trim in place.
        budget: Maximum token count allowed for the final pack.

    Returns:
        ContextPack: The same pack after trimming, truncation, and provenance refresh.
    """
    pack.token_budget = budget

    if pack.total_tokens <= budget:
        return pack

    # Remove lowest-scoring chunks until we fit
    while pack.total_tokens > budget and len(pack.chunks) > 1:
        removed = pack.chunks.pop()
        pack.total_tokens -= removed.token_count
        pack.warnings.append(
            f"Budget trim: removed {removed.chunk_id} ({removed.token_count} tokens, "
            f"score {removed.retrieval_score:.4f})"
        )

    # If the single remaining chunk still exceeds budget, truncate it
    if pack.total_tokens > budget and pack.chunks:
        chunk = pack.chunks[0]
        target_chars = int(budget * 3.5)  # rough token-to-char ratio
        chunk.text = chunk.text[:target_chars]
        chunk.token_count = count_tokens(chunk.text)
        pack.total_tokens = chunk.token_count
        pack.warnings.append(f"Truncated {chunk.chunk_id} to fit budget")

    # Recalculate provenance
    pack.provenance = [
        {
            "chunk_id": c.chunk_id,
            "file_path": c.file_path,
            "symbol_name": c.symbol_name,
            "retrieval_method": c.retrieval_method,
            "retrieval_score": c.retrieval_score,
            "token_count": c.token_count,
        }
        for c in pack.chunks
    ]

    return pack


# ---------------------------------------------------------------------------
# Full pipeline
# ---------------------------------------------------------------------------

def compile_context(
    question: str,
    hybrid_retrieve_fn,
    graph_traverse_fn=None,
    token_budget: int = 4000,
) -> ContextPack:
    """Run the full context compilation pipeline.

    In this version, the planner detects strategies but the workspace
    stage uses hybrid retrieval for most of them. Passing graph_traverse_fn
    enables the graph branch for relationship questions. Extending this to
    route each strategy independently is a natural next step.
    
    Args:
        question: User question to compile evidence for.
        hybrid_retrieve_fn: Callable that performs the base hybrid retrieval step.
        graph_traverse_fn: Optional callable that adds graph-only evidence.
        token_budget: Maximum token budget for the final context pack.

    Returns:
        ContextPack: Final compiled pack after planning, slicing, and budgeting.
    """
    # Stage 1: Plan
    plan = plan_retrieval(question)

    # Stage 2: Collect (uses plan to optionally add graph evidence)
    raw_chunks = collect_evidence(plan, hybrid_retrieve_fn, graph_traverse_fn)

    # Stage 3: Slice
    sliced_chunks = slice_evidence(raw_chunks, question)

    # Stage 4: Build pack
    pack = build_context_pack(question, sliced_chunks)

    # Stage 5: Budget
    pack = apply_token_budget(pack, budget=token_budget)

    return pack


if __name__ == "__main__":
    import sys
    sys.path.insert(0, ".")
    from retrieval.hybrid_retrieve import hybrid_retrieve

    question = sys.argv[1] if len(sys.argv) > 1 else "What functions call validate_path and how would changing it affect the codebase?"
    print(f"Question: {question}\n")

    pack = compile_context(question, hybrid_retrieve, token_budget=4000)

    print(f"Context pack: {len(pack.chunks)} chunks, {pack.total_tokens} tokens (budget: {pack.token_budget})")
    print(f"\nProvenance:")
    for p in pack.provenance:
        print(f"  {p['chunk_id']}: {p['file_path']} :: {p['symbol_name']} ({p['token_count']} tokens)")
    if pack.warnings:
        print(f"\nWarnings:")
        for w in pack.warnings:
            print(f"  {w}")
    print(f"\nFormatted context preview (first 500 chars):")
    print(pack.to_prompt_context()[:500])
pip install tiktoken
python retrieval/context_compiler.py "What functions call validate_path and how would changing it affect the codebase?"

Expected output:

Question: What functions call validate_path and how would changing it affect the codebase?

Context pack: 4 chunks, 1823 tokens (budget: 4000)

Provenance:
  ast-00012: agent/tools.py :: validate_path (312 tokens)
  ast-00015: agent/tools.py :: read_file (287 tokens)
  ast-00034: retrieval/query_metadata.py :: find_symbol (241 tokens)
  ast-00008: agent/loop.py :: run_agent (403 tokens)

Formatted context preview (first 500 chars):
[Evidence 1] agent/tools.py :: validate_path (lines 108-117) [hybrid, score: 0.0489]
def validate_path(path_str: str) -> Path:
...

Detect context rot

Context rot happens when your pipeline accumulates evidence without managing its quality. Here's a detector you can run after any retrieval:

# retrieval/detect_context_rot.py
"""Detect context rot patterns in a context pack."""
from retrieval.context_compiler import ContextPack, count_tokens


def detect_rot(pack: ContextPack) -> list[str]:
    """Check a context pack for common context-rot patterns.

    Args:
        pack: Compiled context pack to inspect.

    Returns:
        list[str]: Human-readable issue labels describing any detected problems.
    """
    issues = []

    # 1. Oversized context
    if pack.total_tokens > pack.token_budget * 0.9:
        issues.append(
            f"OVERSIZED: Pack uses {pack.total_tokens}/{pack.token_budget} tokens "
            f"({pack.total_tokens / pack.token_budget * 100:.0f}% of budget). "
            "Consider tightening the slicer or lowering top-k."
        )

    # 2. Low-score chunks taking budget
    if pack.chunks:
        lowest = min(pack.chunks, key=lambda c: c.retrieval_score)
        highest = max(pack.chunks, key=lambda c: c.retrieval_score)
        if highest.retrieval_score > 0 and lowest.retrieval_score < highest.retrieval_score * 0.3:
            issues.append(
                f"NOISE: Chunk {lowest.chunk_id} has score {lowest.retrieval_score:.4f} "
                f"vs. best {highest.retrieval_score:.4f}. This chunk may be noise."
            )

    # 3. Duplicate files
    files = [c.file_path for c in pack.chunks]
    file_counts = {}
    for f in files:
        file_counts[f] = file_counts.get(f, 0) + 1
    for f, count in file_counts.items():
        if count > 2:
            issues.append(
                f"DUPLICATION: {count} chunks from {f}. Consider merging or selecting "
                "the most relevant section."
            )

    # 4. Large single chunk dominating budget
    for chunk in pack.chunks:
        if pack.total_tokens > 0 and chunk.token_count > pack.total_tokens * 0.6:
            issues.append(
                f"DOMINATION: Chunk {chunk.chunk_id} uses {chunk.token_count} tokens "
                f"({chunk.token_count / pack.total_tokens * 100:.0f}% of total). "
                "Consider slicing to the relevant section."
            )

    if not issues:
        issues.append("No context rot detected.")

    return issues

Run the context-compiled benchmark and compare all tiers

# retrieval/run_compiled_benchmark.py
"""Run benchmark through context-compiled retrieval. Full Tier 4."""
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path

sys.path.insert(0, ".")
from openai import OpenAI
from retrieval.context_compiler import compile_context
from retrieval.hybrid_retrieve import hybrid_retrieve
from retrieval.detect_context_rot import detect_rot

RUN_ID = "compiled-v1-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
OUTPUT_FILE = Path(f"harness/runs/{RUN_ID}.jsonl")
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()
TOKEN_BUDGET = 4000

client = OpenAI()

SYSTEM_PROMPT = (
    "You are a code assistant. Answer the question using ONLY the "
    "retrieved evidence below. Each piece of evidence includes its "
    "source file and retrieval method. If the evidence is insufficient, "
    "say so and explain what's missing."
)


def answer_with_compiled_context(question: str) -> dict:
    """Answer a benchmark question using the compiled-context pipeline.

    Args:
        question: Benchmark question to answer from retrieved evidence.

    Returns:
        dict: Final answer text plus the compiled pack and any rot warnings.
    """
    pack = compile_context(question, hybrid_retrieve, token_budget=TOKEN_BUDGET)
    rot_issues = detect_rot(pack)

    context = pack.to_prompt_context()
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": f"{SYSTEM_PROMPT}\n\nEvidence:\n{context}"},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )

    return {
        "answer": response.choices[0].message.content,
        "pack": pack.to_dict(),
        "context_rot": rot_issues,
    }


def run_benchmark():
    """Run the benchmark set through the compiled-context pipeline.

    Returns:
        None
    """
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    print(f"Running {len(questions)} questions through context-compiled retrieval")
    print(f"Token budget: {TOKEN_BUDGET}")
    print(f"Run ID: {RUN_ID}\n")

    results = []
    for i, q in enumerate(questions):
        print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
        result = answer_with_compiled_context(q["question"])

        pack = result["pack"]
        print(f"  Context: {pack['total_tokens']} tokens, {len(pack['chunks'])} chunks")
        if result["context_rot"] and "No context rot" not in result["context_rot"][0]:
            for issue in result["context_rot"]:
                print(f"  Rot: {issue}")

        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": result["answer"],
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": [p["file_path"] for p in pack["provenance"]],
            "context_tokens": pack["total_tokens"],
            "token_budget": pack["token_budget"],
            "context_rot_issues": result["context_rot"],
            "retrieval_method": "compiled_hybrid",
            "grade": None,
            "failure_label": None,
            "grading_notes": "",
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.2",
        }
        results.append(entry)

    os.makedirs("harness/runs", exist_ok=True)
    with open(OUTPUT_FILE, "w") as f:
        for entry in results:
            f.write(json.dumps(entry) + "\n")

    print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
    print("Grade and compare across all four tiers.")


if __name__ == "__main__":
    run_benchmark()
python -m retrieval.run_compiled_benchmark

The full progression

After grading all four benchmark runs, you'll have a progression that looks something like this (your specific numbers will vary):

TierRetrieval methodTypical accuracy rangeContext tokens per question
Tier 1Naive vector30-45%~4,000 (mostly noise)
Tier 2AST-aware vector45-60%~3,500 (better chunks)
Tier 3Hybrid (vector + lexical + graph)55-70%~4,500 (more sources)
Tier 4Compiled hybrid60-75%~2,500 (focused, deduplicated)

Notice the compiled tier: accuracy goes up while context tokens go down. That's the core insight of context compilation. Less noise means better signal. Bigger context windows don't fix noisy context; careful context engineering does.

See the Context-Pack Contract reference page for the full schema, provenance rules, and anti-patterns.

Exercises

  1. Build the context compiler (context_compiler.py). Run it on three questions and inspect the output: check provenance, warnings, and token counts.
  2. Run the rot detector (detect_context_rot.py) on your context packs. Fix any issues it flags by adjusting the planner, slicer, or budgeter.
  3. Run the context-compiled benchmark (run_compiled_benchmark.py). Grade at least 15 answers.
  4. Build a comparison table across all four retrieval approaches: naive, AST-aware, hybrid, and compiled. For each benchmark question category, note which approach first reached an acceptable accuracy. Which categories needed all four? Which were solved by AST-aware retrieval alone?
  5. Experiment with the token budget. Run the same benchmark at 2000, 4000, and 8000 tokens. Does accuracy improve with more tokens, plateau, or degrade? Find the sweet spot for your questions.

Completion checkpoint

You have:

  • A working context compiler with all five stages: planner, workspace, slicer, pack builder, token budgeter
  • Context packs with provenance metadata for every piece of evidence
  • A context rot detector that flags common quality issues
  • A full naive-through-compiled progression showing measurable improvement
  • Evidence that focused context (fewer tokens, better selection) outperforms raw retrieval (more tokens, no curation)

Reflection prompts

  • How much did context compilation improve accuracy compared to hybrid retrieval? Was the improvement from better evidence selection, deduplication, or token budgeting?
  • What's the relationship between context size and answer quality in your benchmark? Is there a point where more context makes answers worse?
  • Which context rot pattern appeared most often in your results? What does that tell you about your retrieval pipeline's weaknesses?
  • Looking back across all four tiers, which single upgrade produced the largest accuracy improvement? Was it the one you expected?

Connecting to the project

This context compiler is now a core component of your anchor project. Every question your assistant handles will pass through this pipeline: retrieve from multiple substrates, compile a focused context pack, and present it to the model with provenance. In Module 5, we'll build the full RAG pipeline on top of this, adding response generation, citation, grounding verification, and evaluation. The context compiler you built here will be the engine underneath all of it.

The progression you've documented (from naive retrieval to compiled context) is also a demonstration of the iterative methodology this curriculum teaches. You didn't build the "right" retrieval system on day one. You built the simplest version, observed its failures, and upgraded each layer based on evidence. That pattern will serve you well beyond this curriculum.

What's next

RAG Pipeline. You can assemble focused evidence now; the next lesson turns that evidence into grounded answers with citations and abstention.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.