Context Compilation (Tier 4)
Through Tiers 1-3, we've progressively improved what gets retrieved. AST-aware chunking fixed broken boundaries. Graph traversal and lexical search added structural and exact-match signals. But retrieval quality is only half the problem. The other half, and I'd argue the harder half, is what you do with retrieved evidence before it enters the model's context window.
Right now, our pipeline takes the top-k chunks, concatenates them, and stuffs them into the prompt. That's wasteful. Some chunks overlap; some are irrelevant to the specific question even though they scored well; some are too long when only three lines matter. And as we retrieve from more sources (vector, lexical, graph), the total evidence grows beyond what the model can usefully process, even with large context windows.
This lesson treats context as a compilation problem. Just as a compiler transforms source code into optimized machine code, a context compiler transforms raw retrieval results into a focused, deduplicated, token-budgeted context pack. This is where "context engineering" becomes a concrete engineering practice, not just a term people use on social media.
What you'll learn
- Build a context compiler with five stages: workspace, planner, slicer, context pack builder, and token budgeter
- Detect and handle context rot: oversized context, stale evidence, conflicting chunks, and accumulated noise
- Produce context packs with provenance metadata that connect every piece of evidence back to its source
- Measure context quality: are we sending the model what it actually needs?
- Compare the full retrieval progression (naive through compiled) on your benchmark
Concepts
Context engineering: the practice of controlling what information enters a model's context window, in what form, and in what order. It's a named discipline because the context you provide shapes the model's reasoning as much as the prompt does. Context engineering includes retrieval, selection, formatting, ordering, deduplication, and token budgeting. We've been doing pieces of it since Module 1; this lesson makes it explicit.
Context rot: the degradation of answer quality when context becomes stale, contradictory, duplicated, or bloated. Context rot is the retrieval equivalent of technical debt. It accumulates silently: each retrieval improvement adds more evidence, and without active management, the context window fills with noise that dilutes the signal. I've seen production systems where retrieval was excellent but answers were poor because the context assembly was careless.
Context rot has four common forms:
- Oversized context: more tokens than the model can attend to effectively, even if the window fits them
- Stale evidence: chunks that were relevant to an earlier version of the question or conversation
- Conflicting evidence: chunks that give contradictory information (e.g., two versions of the same function)
- Accumulated noise: irrelevant chunks that scored just above the retrieval threshold
Context pack: a structured bundle of evidence assembled for a specific task. A context pack includes the selected code chunks, their provenance (where they came from and why), a token budget, and metadata the model can use to assess evidence quality. Think of it as a dossier prepared for the model, not a pile of search results.
Token budget: a deliberate limit on how many tokens of context you provide, independent of the model's maximum context window. A 128k-token context window doesn't mean you should use 128k tokens. In my experience, answer quality peaks well before the window is full, and it degrades as noise accumulates. A token budget forces you to be selective.
Provenance: tracking where each piece of context came from: which file, which retrieval method, what score, why it was included. Provenance lets the model (and you) assess evidence quality and enables citations in the answer.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Context too large | Model ignores relevant evidence buried in a wall of text | Lower top-k | Token budgeter with priority ranking |
| Duplicated evidence | Same code appears in multiple chunks (overlapping retrieval) | Deduplicate on chunk_id | Content-level deduplication |
| Irrelevant chunks | Retrieval returns chunks that scored well but aren't useful for this specific question | Increase relevance threshold | Planner that assesses chunk relevance to the question |
| Missing targeted evidence | The right file was retrieved but the relevant function is 200 lines long and only 5 lines matter | Retrieve the whole function | Slicer that extracts the relevant subsection |
| No provenance | Can't trace the model's answer back to specific code | Manual inspection | Context pack with provenance metadata |
Walkthrough
Architecture of the context compiler
The context compiler has five stages. Each stage is a separate function, which means you can improve or replace any stage independently.
Build the context compiler
# retrieval/context_compiler.py
"""Context compiler: workspace, planner, slicer, pack builder, token budgeter."""
import json
import hashlib
import re
from dataclasses import dataclass, field, asdict
from pathlib import Path
# We'll use tiktoken for accurate token counting.
# Install: pip install tiktoken
import tiktoken
ENCODING = tiktoken.encoding_for_model("gpt-4o-mini")
def count_tokens(text: str) -> int:
"""Count tokens for a text string.
Args:
text: Text whose token usage should be measured.
Returns:
int: Token count for the configured model encoding.
"""
return len(ENCODING.encode(text))
# ---------------------------------------------------------------------------
# Data structures
# ---------------------------------------------------------------------------
@dataclass
class EvidenceChunk:
"""A single piece of evidence with provenance."""
chunk_id: str
file_path: str
symbol_name: str
text: str
start_line: int | None = None
end_line: int | None = None
retrieval_method: str = ""
retrieval_score: float = 0.0
token_count: int = 0
content_hash: str = ""
def __post_init__(self):
self.token_count = count_tokens(self.text)
self.content_hash = hashlib.md5(self.text.encode()).hexdigest()[:12]
@dataclass
class ContextPack:
"""A compiled context pack ready for the model."""
question: str
chunks: list[EvidenceChunk] = field(default_factory=list)
total_tokens: int = 0
token_budget: int = 0
provenance: list[dict] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
def to_prompt_context(self) -> str:
"""Format the pack for inclusion in a prompt.
Returns:
str: Rendered evidence sections with provenance headers.
"""
sections = []
for i, chunk in enumerate(self.chunks):
header = f"[Evidence {i+1}] {chunk.file_path}"
if chunk.symbol_name and chunk.symbol_name != "__module__":
header += f" :: {chunk.symbol_name}"
if chunk.start_line:
header += f" (lines {chunk.start_line}-{chunk.end_line})"
header += f" [{chunk.retrieval_method}, score: {chunk.retrieval_score:.4f}]"
sections.append(f"{header}\n{chunk.text}")
return "\n\n---\n\n".join(sections)
def to_dict(self) -> dict:
"""Serialize the pack into a JSON-friendly dictionary.
Returns:
dict: Structured context-pack payload for logging or grading.
"""
return {
"question": self.question,
"chunks": [asdict(c) for c in self.chunks],
"total_tokens": self.total_tokens,
"token_budget": self.token_budget,
"warnings": self.warnings,
"provenance": self.provenance,
}
# ---------------------------------------------------------------------------
# Stage 1: Planner
# ---------------------------------------------------------------------------
def plan_retrieval(question: str) -> dict:
"""Analyze the question and decide what retrieval strategies to use.
In a production system, this could be an LLM call that classifies the
question type. For now, we'll use heuristics.
Args:
question: User question that will drive retrieval.
Returns:
dict: Retrieval plan containing selected strategies and extracted hints.
"""
plan = {
"question": question,
"strategies": ["vector"], # always include vector
"identifier_hints": [],
"file_hints": [],
}
# Detect identifiers (CamelCase, snake_case)
identifiers = re.findall(
r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)+\b|\b[a-z_]+(?:_[a-z]+)+\b',
question,
)
if identifiers:
plan["strategies"].append("lexical")
plan["identifier_hints"] = identifiers
# Detect relationship keywords
relationship_words = {"calls", "imports", "depends", "affects", "breaks", "changes", "uses"}
if any(w in question.lower() for w in relationship_words):
plan["strategies"].append("graph")
# Detect file path mentions
file_mentions = re.findall(r'[\w/]+\.py\b', question)
if file_mentions:
plan["file_hints"] = file_mentions
return plan
# ---------------------------------------------------------------------------
# Stage 2: Workspace (collects raw evidence)
# ---------------------------------------------------------------------------
def collect_evidence(plan: dict, hybrid_retrieve_fn, graph_traverse_fn=None) -> list[EvidenceChunk]:
"""Collect raw evidence using the strategies the planner selected.
In a production system, each strategy would have its own retrieval
path. Here we use hybrid retrieval for vector+lexical and optionally
add graph traversal if the planner flagged relationship keywords.
Args:
plan: Planner output describing which retrieval legs to use.
hybrid_retrieve_fn: Callable that returns the base hybrid retrieval results.
graph_traverse_fn: Optional callable for graph-specific evidence expansion.
Returns:
list[EvidenceChunk]: Raw evidence objects ready for slicing and packing.
"""
raw_results = hybrid_retrieve_fn(plan["question"])
# If the planner detected relationship keywords, add graph evidence
if "graph" in plan.get("strategies", []) and graph_traverse_fn:
for hint in plan.get("identifier_hints", []):
graph_results = graph_traverse_fn(hint)
raw_results.extend(graph_results)
chunks = []
for result in raw_results:
chunks.append(EvidenceChunk(
chunk_id=result.get("chunk_id", "unknown"),
file_path=result.get("file_path", "unknown"),
symbol_name=result.get("symbol_name", "unknown"),
text=result.get("text", ""),
start_line=result.get("start_line"),
end_line=result.get("end_line"),
retrieval_method=result.get("retrieval_method", "hybrid"),
retrieval_score=result.get("rrf_score", 0.0),
))
return chunks
# ---------------------------------------------------------------------------
# Stage 3: Slicer
# ---------------------------------------------------------------------------
def slice_evidence(chunks: list[EvidenceChunk], question: str) -> list[EvidenceChunk]:
"""Slice oversized chunks to their relevant portion.
For now, we use a simple heuristic: if a chunk exceeds 1500 tokens,
try to find the most relevant section. In a production system, you'd
use an LLM to identify the relevant lines.
Args:
chunks: Raw evidence chunks gathered during retrieval.
question: User question used to score relevant lines.
Returns:
list[EvidenceChunk]: Chunks with oversized entries trimmed to denser regions.
"""
MAX_CHUNK_TOKENS = 1500
sliced = []
for chunk in chunks:
if chunk.token_count <= MAX_CHUNK_TOKENS:
sliced.append(chunk)
continue
# Simple heuristic: extract lines containing query keywords
keywords = set(re.findall(r'\w+', question.lower()))
lines = chunk.text.split("\n")
scored_lines = []
for i, line in enumerate(lines):
line_words = set(re.findall(r'\w+', line.lower()))
score = len(keywords & line_words)
scored_lines.append((i, score))
# Find the densest region
best_start = 0
best_score = 0
window = min(40, len(lines)) # ~40 lines of context
for start in range(len(lines) - window + 1):
window_score = sum(s for _, s in scored_lines[start:start + window])
if window_score > best_score:
best_score = window_score
best_start = start
sliced_text = "\n".join(lines[best_start:best_start + window])
sliced_chunk = EvidenceChunk(
chunk_id=chunk.chunk_id + "-sliced",
file_path=chunk.file_path,
symbol_name=chunk.symbol_name,
text=sliced_text,
start_line=(chunk.start_line or 1) + best_start,
end_line=(chunk.start_line or 1) + best_start + window,
retrieval_method=chunk.retrieval_method,
retrieval_score=chunk.retrieval_score,
)
sliced.append(sliced_chunk)
return sliced
# ---------------------------------------------------------------------------
# Stage 4: Context Pack Builder
# ---------------------------------------------------------------------------
def build_context_pack(
question: str,
chunks: list[EvidenceChunk],
) -> ContextPack:
"""Deduplicate, order, and annotate evidence for model consumption.
Args:
question: User question the pack is being assembled for.
chunks: Candidate evidence chunks after slicing.
Returns:
ContextPack: Ordered pack with provenance and warnings attached.
"""
pack = ContextPack(question=question)
# Deduplicate by content hash
seen_hashes = set()
unique_chunks = []
for chunk in chunks:
if chunk.content_hash not in seen_hashes:
seen_hashes.add(chunk.content_hash)
unique_chunks.append(chunk)
else:
pack.warnings.append(f"Deduplicated: {chunk.chunk_id} (same content as existing chunk)")
# Order by retrieval score (highest first)
unique_chunks.sort(key=lambda c: c.retrieval_score, reverse=True)
pack.chunks = unique_chunks
pack.total_tokens = sum(c.token_count for c in unique_chunks)
# Build provenance
pack.provenance = [
{
"chunk_id": c.chunk_id,
"file_path": c.file_path,
"symbol_name": c.symbol_name,
"retrieval_method": c.retrieval_method,
"retrieval_score": c.retrieval_score,
"token_count": c.token_count,
}
for c in unique_chunks
]
return pack
# ---------------------------------------------------------------------------
# Stage 5: Token Budgeter
# ---------------------------------------------------------------------------
def apply_token_budget(pack: ContextPack, budget: int = 4000) -> ContextPack:
"""Trim the pack to fit within the token budget.
Removes the lowest-scoring chunks until the pack fits. If a single
chunk exceeds the budget, it will be sliced further.
Args:
pack: Context pack to trim in place.
budget: Maximum token count allowed for the final pack.
Returns:
ContextPack: The same pack after trimming, truncation, and provenance refresh.
"""
pack.token_budget = budget
if pack.total_tokens <= budget:
return pack
# Remove lowest-scoring chunks until we fit
while pack.total_tokens > budget and len(pack.chunks) > 1:
removed = pack.chunks.pop()
pack.total_tokens -= removed.token_count
pack.warnings.append(
f"Budget trim: removed {removed.chunk_id} ({removed.token_count} tokens, "
f"score {removed.retrieval_score:.4f})"
)
# If the single remaining chunk still exceeds budget, truncate it
if pack.total_tokens > budget and pack.chunks:
chunk = pack.chunks[0]
target_chars = int(budget * 3.5) # rough token-to-char ratio
chunk.text = chunk.text[:target_chars]
chunk.token_count = count_tokens(chunk.text)
pack.total_tokens = chunk.token_count
pack.warnings.append(f"Truncated {chunk.chunk_id} to fit budget")
# Recalculate provenance
pack.provenance = [
{
"chunk_id": c.chunk_id,
"file_path": c.file_path,
"symbol_name": c.symbol_name,
"retrieval_method": c.retrieval_method,
"retrieval_score": c.retrieval_score,
"token_count": c.token_count,
}
for c in pack.chunks
]
return pack
# ---------------------------------------------------------------------------
# Full pipeline
# ---------------------------------------------------------------------------
def compile_context(
question: str,
hybrid_retrieve_fn,
graph_traverse_fn=None,
token_budget: int = 4000,
) -> ContextPack:
"""Run the full context compilation pipeline.
In this version, the planner detects strategies but the workspace
stage uses hybrid retrieval for most of them. Passing graph_traverse_fn
enables the graph branch for relationship questions. Extending this to
route each strategy independently is a natural next step.
Args:
question: User question to compile evidence for.
hybrid_retrieve_fn: Callable that performs the base hybrid retrieval step.
graph_traverse_fn: Optional callable that adds graph-only evidence.
token_budget: Maximum token budget for the final context pack.
Returns:
ContextPack: Final compiled pack after planning, slicing, and budgeting.
"""
# Stage 1: Plan
plan = plan_retrieval(question)
# Stage 2: Collect (uses plan to optionally add graph evidence)
raw_chunks = collect_evidence(plan, hybrid_retrieve_fn, graph_traverse_fn)
# Stage 3: Slice
sliced_chunks = slice_evidence(raw_chunks, question)
# Stage 4: Build pack
pack = build_context_pack(question, sliced_chunks)
# Stage 5: Budget
pack = apply_token_budget(pack, budget=token_budget)
return pack
if __name__ == "__main__":
import sys
sys.path.insert(0, ".")
from retrieval.hybrid_retrieve import hybrid_retrieve
question = sys.argv[1] if len(sys.argv) > 1 else "What functions call validate_path and how would changing it affect the codebase?"
print(f"Question: {question}\n")
pack = compile_context(question, hybrid_retrieve, token_budget=4000)
print(f"Context pack: {len(pack.chunks)} chunks, {pack.total_tokens} tokens (budget: {pack.token_budget})")
print(f"\nProvenance:")
for p in pack.provenance:
print(f" {p['chunk_id']}: {p['file_path']} :: {p['symbol_name']} ({p['token_count']} tokens)")
if pack.warnings:
print(f"\nWarnings:")
for w in pack.warnings:
print(f" {w}")
print(f"\nFormatted context preview (first 500 chars):")
print(pack.to_prompt_context()[:500])pip install tiktoken
python retrieval/context_compiler.py "What functions call validate_path and how would changing it affect the codebase?"Expected output:
Question: What functions call validate_path and how would changing it affect the codebase?
Context pack: 4 chunks, 1823 tokens (budget: 4000)
Provenance:
ast-00012: agent/tools.py :: validate_path (312 tokens)
ast-00015: agent/tools.py :: read_file (287 tokens)
ast-00034: retrieval/query_metadata.py :: find_symbol (241 tokens)
ast-00008: agent/loop.py :: run_agent (403 tokens)
Formatted context preview (first 500 chars):
[Evidence 1] agent/tools.py :: validate_path (lines 108-117) [hybrid, score: 0.0489]
def validate_path(path_str: str) -> Path:
...
Detect context rot
Context rot happens when your pipeline accumulates evidence without managing its quality. Here's a detector you can run after any retrieval:
# retrieval/detect_context_rot.py
"""Detect context rot patterns in a context pack."""
from retrieval.context_compiler import ContextPack, count_tokens
def detect_rot(pack: ContextPack) -> list[str]:
"""Check a context pack for common context-rot patterns.
Args:
pack: Compiled context pack to inspect.
Returns:
list[str]: Human-readable issue labels describing any detected problems.
"""
issues = []
# 1. Oversized context
if pack.total_tokens > pack.token_budget * 0.9:
issues.append(
f"OVERSIZED: Pack uses {pack.total_tokens}/{pack.token_budget} tokens "
f"({pack.total_tokens / pack.token_budget * 100:.0f}% of budget). "
"Consider tightening the slicer or lowering top-k."
)
# 2. Low-score chunks taking budget
if pack.chunks:
lowest = min(pack.chunks, key=lambda c: c.retrieval_score)
highest = max(pack.chunks, key=lambda c: c.retrieval_score)
if highest.retrieval_score > 0 and lowest.retrieval_score < highest.retrieval_score * 0.3:
issues.append(
f"NOISE: Chunk {lowest.chunk_id} has score {lowest.retrieval_score:.4f} "
f"vs. best {highest.retrieval_score:.4f}. This chunk may be noise."
)
# 3. Duplicate files
files = [c.file_path for c in pack.chunks]
file_counts = {}
for f in files:
file_counts[f] = file_counts.get(f, 0) + 1
for f, count in file_counts.items():
if count > 2:
issues.append(
f"DUPLICATION: {count} chunks from {f}. Consider merging or selecting "
"the most relevant section."
)
# 4. Large single chunk dominating budget
for chunk in pack.chunks:
if pack.total_tokens > 0 and chunk.token_count > pack.total_tokens * 0.6:
issues.append(
f"DOMINATION: Chunk {chunk.chunk_id} uses {chunk.token_count} tokens "
f"({chunk.token_count / pack.total_tokens * 100:.0f}% of total). "
"Consider slicing to the relevant section."
)
if not issues:
issues.append("No context rot detected.")
return issuesRun the context-compiled benchmark and compare all tiers
# retrieval/run_compiled_benchmark.py
"""Run benchmark through context-compiled retrieval. Full Tier 4."""
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
sys.path.insert(0, ".")
from openai import OpenAI
from retrieval.context_compiler import compile_context
from retrieval.hybrid_retrieve import hybrid_retrieve
from retrieval.detect_context_rot import detect_rot
RUN_ID = "compiled-v1-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
OUTPUT_FILE = Path(f"harness/runs/{RUN_ID}.jsonl")
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()
TOKEN_BUDGET = 4000
client = OpenAI()
SYSTEM_PROMPT = (
"You are a code assistant. Answer the question using ONLY the "
"retrieved evidence below. Each piece of evidence includes its "
"source file and retrieval method. If the evidence is insufficient, "
"say so and explain what's missing."
)
def answer_with_compiled_context(question: str) -> dict:
"""Answer a benchmark question using the compiled-context pipeline.
Args:
question: Benchmark question to answer from retrieved evidence.
Returns:
dict: Final answer text plus the compiled pack and any rot warnings.
"""
pack = compile_context(question, hybrid_retrieve, token_budget=TOKEN_BUDGET)
rot_issues = detect_rot(pack)
context = pack.to_prompt_context()
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": f"{SYSTEM_PROMPT}\n\nEvidence:\n{context}"},
{"role": "user", "content": question},
],
temperature=0,
)
return {
"answer": response.choices[0].message.content,
"pack": pack.to_dict(),
"context_rot": rot_issues,
}
def run_benchmark():
"""Run the benchmark set through the compiled-context pipeline.
Returns:
None
"""
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
print(f"Running {len(questions)} questions through context-compiled retrieval")
print(f"Token budget: {TOKEN_BUDGET}")
print(f"Run ID: {RUN_ID}\n")
results = []
for i, q in enumerate(questions):
print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
result = answer_with_compiled_context(q["question"])
pack = result["pack"]
print(f" Context: {pack['total_tokens']} tokens, {len(pack['chunks'])} chunks")
if result["context_rot"] and "No context rot" not in result["context_rot"][0]:
for issue in result["context_rot"]:
print(f" Rot: {issue}")
entry = {
"run_id": RUN_ID,
"question_id": q["id"],
"question": q["question"],
"category": q["category"],
"answer": result["answer"],
"model": MODEL,
"provider": PROVIDER,
"evidence_files": [p["file_path"] for p in pack["provenance"]],
"context_tokens": pack["total_tokens"],
"token_budget": pack["token_budget"],
"context_rot_issues": result["context_rot"],
"retrieval_method": "compiled_hybrid",
"grade": None,
"failure_label": None,
"grading_notes": "",
"repo_sha": REPO_SHA,
"timestamp": datetime.now(timezone.utc).isoformat(),
"harness_version": "v0.2",
}
results.append(entry)
os.makedirs("harness/runs", exist_ok=True)
with open(OUTPUT_FILE, "w") as f:
for entry in results:
f.write(json.dumps(entry) + "\n")
print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
print("Grade and compare across all four tiers.")
if __name__ == "__main__":
run_benchmark()python -m retrieval.run_compiled_benchmarkThe full progression
After grading all four benchmark runs, you'll have a progression that looks something like this (your specific numbers will vary):
| Tier | Retrieval method | Typical accuracy range | Context tokens per question |
|---|---|---|---|
| Tier 1 | Naive vector | 30-45% | ~4,000 (mostly noise) |
| Tier 2 | AST-aware vector | 45-60% | ~3,500 (better chunks) |
| Tier 3 | Hybrid (vector + lexical + graph) | 55-70% | ~4,500 (more sources) |
| Tier 4 | Compiled hybrid | 60-75% | ~2,500 (focused, deduplicated) |
Notice the compiled tier: accuracy goes up while context tokens go down. That's the core insight of context compilation. Less noise means better signal. Bigger context windows don't fix noisy context; careful context engineering does.
See the Context-Pack Contract reference page for the full schema, provenance rules, and anti-patterns.
Exercises
- Build the context compiler (
context_compiler.py). Run it on three questions and inspect the output: check provenance, warnings, and token counts. - Run the rot detector (
detect_context_rot.py) on your context packs. Fix any issues it flags by adjusting the planner, slicer, or budgeter. - Run the context-compiled benchmark (
run_compiled_benchmark.py). Grade at least 15 answers. - Build a comparison table across all four retrieval approaches: naive, AST-aware, hybrid, and compiled. For each benchmark question category, note which approach first reached an acceptable accuracy. Which categories needed all four? Which were solved by AST-aware retrieval alone?
- Experiment with the token budget. Run the same benchmark at 2000, 4000, and 8000 tokens. Does accuracy improve with more tokens, plateau, or degrade? Find the sweet spot for your questions.
Completion checkpoint
You have:
- A working context compiler with all five stages: planner, workspace, slicer, pack builder, token budgeter
- Context packs with provenance metadata for every piece of evidence
- A context rot detector that flags common quality issues
- A full naive-through-compiled progression showing measurable improvement
- Evidence that focused context (fewer tokens, better selection) outperforms raw retrieval (more tokens, no curation)
Reflection prompts
- How much did context compilation improve accuracy compared to hybrid retrieval? Was the improvement from better evidence selection, deduplication, or token budgeting?
- What's the relationship between context size and answer quality in your benchmark? Is there a point where more context makes answers worse?
- Which context rot pattern appeared most often in your results? What does that tell you about your retrieval pipeline's weaknesses?
- Looking back across all four tiers, which single upgrade produced the largest accuracy improvement? Was it the one you expected?
Connecting to the project
This context compiler is now a core component of your anchor project. Every question your assistant handles will pass through this pipeline: retrieve from multiple substrates, compile a focused context pack, and present it to the model with provenance. In Module 5, we'll build the full RAG pipeline on top of this, adding response generation, citation, grounding verification, and evaluation. The context compiler you built here will be the engine underneath all of it.
The progression you've documented (from naive retrieval to compiled context) is also a demonstration of the iterative methodology this curriculum teaches. You didn't build the "right" retrieval system on day one. You built the simplest version, observed its failures, and upgraded each layer based on evidence. That pattern will serve you well beyond this curriculum.
What's next
RAG Pipeline. You can assemble focused evidence now; the next lesson turns that evidence into grounded answers with citations and abstention.
References
Start here
- Context-Pack Contract — the full schema, validation rules, and anti-patterns for context packs
Build with this
- tiktoken on PyPI — the tokenizer library we use for accurate token counting
- OpenAI: Managing tokens — practical guide to token counting and context window management
Deep dive
- Anthropic: Long context tips — strategies for working with large context windows effectively
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — the original RAG paper; useful for understanding the pattern's foundations
- LlamaIndex: Response synthesis — how LlamaIndex approaches context assembly and response generation