Evidence Bundles and Context Packs

In the previous lesson, we built a RAG pipeline that generates grounded answers with citations. The context compiler from Module 4 produced context packs, and the answer generator consumed them. But we treated the handoff between those two stages informally. The context pack was whatever the compiler happened to produce, and the generator just formatted it into a prompt.

The informality here creates problems as the system grows. When retrieval and generation are separate services (as they will be in production), the contract between them needs to be explicit. What fields does an evidence bundle contain? What metadata does the generator need to produce good citations? What information helps the model assess evidence quality? And the critically imporant, how do you measure whether structured evidence bundles actually improve answer quality compared to raw retrieval?

In this lesson we spend time formalizing the evidence bundle schema, turning context packs into a deliberate contract, and building the benchmark comparison that proves they're worth the effort.

What you'll learn

Define a complete evidence bundle schema with snippets, scores, metadata, file paths, line ranges, and selection rationale
Formalize context packs as the structured output of the context compiler, the contract between retrieval and generation
Measure the impact of structured context packs on answer quality by benchmarking against raw retrieval
Identify cost optimization opportunities through smaller, more focused context packs
Understand how evidence bundle design affects downstream eval quality

Concepts

Evidence bundle: a structured collection of retrieved evidence for a specific question. Unlike a raw list of search results, an evidence bundle includes metadata that helps the model and the eval system assess evidence quality: relevance scores, retrieval method, file paths, line ranges, and a selection rationale explaining why each piece of evidence was included. The evidence bundle is the atomic unit of the retrieval-to-generation handoff.

Context pack: the assembled, token-budgeted output of the context compiler, ready for inclusion in a prompt. A context pack is built from one or more evidence bundles after deduplication, slicing, ordering, and budgeting. We introduced context packs in Module 4's Context Compilation lesson; here we'll formalize their schema and make the contract explicit.

Selection rationale: a short explanation of why a particular evidence chunk was included in the bundle. This serves two purposes: it helps the model weigh evidence appropriately ("this chunk was included because it contains the function definition" vs. "this chunk was included because it's a caller of the target function"), and it helps you debug retrieval quality when answers go wrong.

Evidence coverage: a measure of how well the evidence bundle addresses the question. A question about "what calls validate_path" needs evidence showing callers; if the bundle only contains the function definition, coverage is incomplete regardless of how high the relevance scores are. Coverage is harder to measure than relevance, but it's what actually determines answer quality.

Walkthrough

The evidence bundle schema

This is our formalized schema. Each field exists for a reason, and we'll walk through them after the definition.

# rag/evidence_bundle.py
"""Evidence bundle schema: the contract between retrieval and generation."""
import json
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Literal


@dataclass
class EvidenceSnippet:
    """A single piece of retrieved evidence with full metadata."""

    # Identity
    chunk_id: str                        # Unique ID from the retrieval system
    content_hash: str                    # Hash of the text content for dedup

    # Content
    text: str                            # The actual evidence text
    token_count: int                     # Accurate token count (via tiktoken)

    # Source location
    file_path: str                       # Relative path in the repository
    start_line: int | None = None        # First line of the snippet
    end_line: int | None = None          # Last line of the snippet
    symbol_name: str | None = None       # Function/class name if applicable

    # Retrieval metadata
    retrieval_method: str = ""           # "vector", "lexical", "graph", "hybrid"
    relevance_score: float = 0.0         # Score from the retrieval system
    retrieval_rank: int = 0              # Position in the original result list

    # Selection rationale
    selection_reason: str = ""           # Why this snippet was included
    evidence_role: str = ""              # "definition", "caller", "related", "context"


@dataclass
class EvidenceBundle:
    """A complete evidence bundle for a single question."""

    # The question this evidence is for
    question: str
    question_type: str = ""              # "code_lookup", "explanation", "relationship"

    # Evidence
    snippets: list[EvidenceSnippet] = field(default_factory=list)

    # Bundle metadata
    total_tokens: int = 0
    token_budget: int = 0
    retrieval_strategies_used: list[str] = field(default_factory=list)
    coverage_notes: str = ""             # Assessment of evidence completeness

    # Provenance
    created_at: str = ""
    compiler_version: str = "v1"

    # Warnings
    warnings: list[str] = field(default_factory=list)

    def __post_init__(self):
        if not self.created_at:
            self.created_at = datetime.now(timezone.utc).isoformat()

    def to_dict(self) -> dict:
        return asdict(self)

    def to_json(self, indent: int = 2) -> str:
        return json.dumps(self.to_dict(), indent=indent)

    def to_prompt_context(self) -> str:
        """Format the bundle for inclusion in a model prompt."""
        sections = []
        for i, snippet in enumerate(self.snippets):
            header = f"[Evidence {i+1}]"
            header += f" {snippet.file_path}"
            if snippet.symbol_name:
                header += f" :: {snippet.symbol_name}"
            if snippet.start_line is not None:
                header += f" (lines {snippet.start_line}-{snippet.end_line})"
            header += f" [{snippet.retrieval_method}, score: {snippet.relevance_score:.4f}]"
            if snippet.selection_reason:
                header += f"\nReason included: {snippet.selection_reason}"
            sections.append(f"{header}\n{snippet.text}")
        return "\n\n---\n\n".join(sections)

Why each field matters

selection_reason and evidence_role: these are the fields most people skip when building evidence bundles, and they're the ones that improve answer quality the most. When the model sees "Reason included: this function is a caller of validate_path," it understands how to use the evidence, not just what the evidence contains.

question_type: different question types need different evidence shapes. A "code_lookup" question needs the function definition and maybe its callers. An "explanation" question needs broader context. A "relationship" question needs graph evidence that shows connections. Tagging the question type helps the retrieval router (next lesson) and helps evals categorize failures.

coverage_notes: this is an honest assessment of what the evidence does and doesn't cover. "Contains the function definition and two callers; no tests were found" is more useful than an empty field. The model can use this to calibrate its confidence, and you can use it to debug retrieval gaps.

compiler_version: when you're iterating on the context compiler, you'll want to know which version produced each bundle. This is essential for reproducible benchmarks.

Convert context packs to evidence bundles

Your context compiler from Module 4 produces ContextPack objects. Here's an adapter that converts them into the formalized evidence bundle schema:

# rag/pack_to_bundle.py
"""Convert Module 4 context packs into formalized evidence bundles."""
import sys
sys.path.insert(0, ".")

from retrieval.context_compiler import ContextPack, EvidenceChunk
from rag.evidence_bundle import EvidenceBundle, EvidenceSnippet


def classify_question(question: str) -> str:
    """Classify a question into a coarse evidence-shaping category.

    Args:
        question: User question to classify.

    Returns:
        A question type label used to shape evidence selection.
    """
    q_lower = question.lower()
    if any(w in q_lower for w in ["what does", "how does", "explain", "describe"]):
        return "explanation"
    if any(w in q_lower for w in ["calls", "imports", "depends", "affects", "uses"]):
        return "relationship"
    if any(w in q_lower for w in ["where is", "find", "show me", "what is"]):
        return "code_lookup"
    return "general"


def infer_evidence_role(chunk: EvidenceChunk, question: str) -> str:
    """Infer the role one chunk plays in answering the question.

    Args:
        chunk: Retrieved evidence chunk under consideration.
        question: User question that the chunk may help answer.

    Returns:
        An evidence-role label such as ``definition``, ``caller``, or ``related``.
    """
    q_lower = question.lower()
    symbol = (chunk.symbol_name or "").lower()

    # If the question mentions this symbol directly, it's likely the definition
    if symbol and symbol in q_lower:
        return "definition"

    # If the question asks about callers/usage and this is a different symbol
    if any(w in q_lower for w in ["calls", "uses", "imports"]):
        return "caller"

    return "related"


def infer_selection_reason(chunk: EvidenceChunk, role: str) -> str:
    """Generate a short human-readable reason for selecting one chunk.

    Args:
        chunk: Retrieved evidence chunk under consideration.
        role: Evidence role assigned to the chunk.

    Returns:
        A sentence explaining why the chunk was included.
    """
    reasons = {
        "definition": f"Contains the definition of {chunk.symbol_name}",
        "caller": f"Contains {chunk.symbol_name}, which may call or use the target",
        "related": f"Related code from {chunk.file_path} with relevance score {chunk.retrieval_score:.4f}",
    }
    return reasons.get(role, f"Retrieved via {chunk.retrieval_method} with score {chunk.retrieval_score:.4f}")


def context_pack_to_bundle(pack: ContextPack) -> EvidenceBundle:
    """Convert a compiled context pack into a formal evidence bundle.

    Args:
        pack: Context pack produced by the retrieval pipeline.

    Returns:
        An ``EvidenceBundle`` with labeled snippets, coverage notes, and provenance.
    """
    question_type = classify_question(pack.question)

    snippets = []
    for i, chunk in enumerate(pack.chunks):
        role = infer_evidence_role(chunk, pack.question)
        snippets.append(EvidenceSnippet(
            chunk_id=chunk.chunk_id,
            content_hash=chunk.content_hash,
            text=chunk.text,
            token_count=chunk.token_count,
            file_path=chunk.file_path,
            start_line=chunk.start_line,
            end_line=chunk.end_line,
            symbol_name=chunk.symbol_name,
            retrieval_method=chunk.retrieval_method,
            relevance_score=chunk.retrieval_score,
            retrieval_rank=i + 1,
            selection_reason=infer_selection_reason(chunk, role),
            evidence_role=role,
        ))

    # Assess coverage
    roles_present = set(s.evidence_role for s in snippets)
    coverage_parts = []
    if "definition" in roles_present:
        coverage_parts.append("target definition found")
    else:
        coverage_parts.append("no direct definition found")
    if "caller" in roles_present:
        caller_count = sum(1 for s in snippets if s.evidence_role == "caller")
        coverage_parts.append(f"{caller_count} caller(s) found")
    coverage_notes = "; ".join(coverage_parts)

    strategies = list(set(s.retrieval_method for s in snippets))

    return EvidenceBundle(
        question=pack.question,
        question_type=question_type,
        snippets=snippets,
        total_tokens=pack.total_tokens,
        token_budget=pack.token_budget,
        retrieval_strategies_used=strategies,
        coverage_notes=coverage_notes,
        warnings=pack.warnings,
    )


if __name__ == "__main__":
    from retrieval.context_compiler import compile_context
    from retrieval.hybrid_retrieve import hybrid_retrieve

    question = sys.argv[1] if len(sys.argv) > 1 else (
        "What does validate_path do and what functions call it?"
    )
    pack = compile_context(question, hybrid_retrieve, token_budget=4000)
    bundle = context_pack_to_bundle(pack)

    print(f"Question: {bundle.question}")
    print(f"Type: {bundle.question_type}")
    print(f"Snippets: {len(bundle.snippets)}")
    print(f"Tokens: {bundle.total_tokens} / {bundle.token_budget}")
    print(f"Coverage: {bundle.coverage_notes}")
    print(f"Strategies: {bundle.retrieval_strategies_used}")
    print(f"\nSnippets:")
    for s in bundle.snippets:
        print(f"  [{s.evidence_role}] {s.file_path} :: {s.symbol_name}")
        print(f"    Reason: {s.selection_reason}")
        print(f"    Score: {s.relevance_score:.4f}, Tokens: {s.token_count}")

python rag/pack_to_bundle.py "What does validate_path do and what functions call it?"

Expected output:

Question: What does validate_path do and what functions call it?
Type: relationship
Snippets: 4
Tokens: 1823 / 4000
Coverage: target definition found; 2 caller(s) found
Strategies: ['hybrid']

Snippets:
  [definition] agent/tools.py :: validate_path
    Reason: Contains the definition of validate_path
    Score: 0.0489, Tokens: 312
  [caller] agent/tools.py :: read_file
    Reason: Contains read_file, which may call or use the target
    Score: 0.0412, Tokens: 287
  [related] retrieval/query_metadata.py :: find_symbol
    Reason: Related code from retrieval/query_metadata.py with relevance score 0.0387
    Score: 0.0387, Tokens: 241
  [caller] agent/loop.py :: run_agent
    Reason: Contains run_agent, which may call or use the target
    Score: 0.0301, Tokens: 403

Context packs as a cost optimization lever

Here's something worth noting: the evidence bundle above uses 1,823 tokens out of a 4,000-token budget. That's less than half. Smaller context packs mean fewer input tokens, which directly reduces API costs.

This connection will play a significant role in Module 6 when we cover cost optimization, but the insight is available now: every token in the context pack that doesn't contribute to the answer is wasted money. The context compiler's deduplication, slicing, and token budgeting aren't just quality improvements, but cost optimizations.

A rough calculation: if your system handles 1,000 questions per day, and you reduce average context size from 4,000 tokens to 2,000 tokens, that's 2 million fewer input tokens per day. At typical API pricing, that's a meaningful operational savings. We'll build the cost tracking to make this visible in Module 6.

The context-pack contract

The Context-Pack Contract reference page has the full schema specification, validation rules, and anti-patterns. Consult it when building integrations or debugging evidence quality issues.

The core contract is straightforward: a context pack is a valid evidence bundle that satisfies these properties:

Every snippet has provenance: you can trace any piece of evidence back to its source file, line range, and retrieval method
Token counts are accurate: the total_tokens field matches the actual token count of the formatted evidence
No duplicate content: content-level deduplication has been applied (not just chunk-ID deduplication)
Token budget is respected: the total tokens don't exceed the specified budget
Selection rationale is present: each snippet explains why it was included

When any of these properties are violated, you'll see it in answer quality before you see it in metrics. Missing provenance leads to unhelpful citations. Inaccurate token counts will lead to budget overruns. Duplicates waste tokens on redundant information. Missing rationale leads to the model treating all evidence as equally important.

Benchmark: raw retrieval vs. structured context packs

This is where we prove the evidence bundle schema earns its complexity. We'll run the same benchmark questions through two pipelines:

Raw retrieval: take the top-k results from hybrid retrieval, concatenate them, and stuff them into the prompt
Structured context packs: run through the full context compiler and evidence bundle pipeline

# rag/benchmark_bundles.py
"""Benchmark: raw retrieval vs. structured evidence bundles."""
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path

sys.path.insert(0, ".")
from openai import OpenAI
from retrieval.hybrid_retrieve import hybrid_retrieve
from retrieval.context_compiler import compile_context
from rag.pack_to_bundle import context_pack_to_bundle

client = OpenAI()
MODEL = "gpt-4o-mini"
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
OUTPUT_DIR = Path("harness/runs")
TOKEN_BUDGET = 4000


def answer_raw(question: str) -> dict:
    """Answer a question using raw retrieval results only.

    Args:
        question: User question to answer.

    Returns:
        A dictionary containing the answer text, method name, and chunk count.
    """
    raw_results = hybrid_retrieve(question)
    context_parts = []
    for r in raw_results[:10]:
        context_parts.append(f"File: {r.get('file_path', 'unknown')}\n{r.get('text', '')}")
    raw_context = "\n\n---\n\n".join(context_parts)

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": (
                "You are a code assistant. Answer the question using the "
                "retrieved context below. Cite specific files when possible.\n\n"
                f"Context:\n{raw_context}"
            )},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return {"answer": response.choices[0].message.content, "method": "raw_retrieval",
            "num_chunks": len(context_parts)}


def answer_bundled(question: str) -> dict:
    """Answer a question using the full evidence-bundle pipeline.

    Args:
        question: User question to answer.

    Returns:
        A dictionary containing the answer text plus bundle coverage metadata.
    """
    pack = compile_context(question, hybrid_retrieve, token_budget=TOKEN_BUDGET)
    bundle = context_pack_to_bundle(pack)
    context = bundle.to_prompt_context()

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": (
                "You are a code assistant. Answer the question using ONLY the "
                "retrieved evidence below. Cite evidence by label. If evidence "
                "is insufficient, say so.\n\n"
                f"Evidence:\n{context}"
            )},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return {"answer": response.choices[0].message.content, "method": "evidence_bundle",
            "num_chunks": len(bundle.snippets), "total_tokens": bundle.total_tokens,
            "coverage": bundle.coverage_notes}


def run_comparison():
    """Benchmark raw retrieval against structured evidence bundles.

    Args:
        None.

    Returns:
        None. The comparison results are written to the benchmark output file.
    """
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
    output_file = OUTPUT_DIR / f"bundle-comparison-{timestamp}.jsonl"
    os.makedirs(OUTPUT_DIR, exist_ok=True)

    print(f"Comparing raw retrieval vs. evidence bundles on {len(questions)} questions\n")
    for i, q in enumerate(questions):
        print(f"[{i+1}/{len(questions)}] {q['question'][:60]}...")
        raw_result = answer_raw(q["question"])
        bundle_result = answer_bundled(q["question"])
        entry = {
            "question_id": q["id"], "question": q["question"],
            "category": q["category"],
            "raw_answer": raw_result["answer"], "raw_chunks": raw_result["num_chunks"],
            "bundle_answer": bundle_result["answer"],
            "bundle_chunks": bundle_result["num_chunks"],
            "bundle_tokens": bundle_result.get("total_tokens", 0),
            "bundle_coverage": bundle_result.get("coverage", ""),
            "raw_grade": None, "bundle_grade": None, "grading_notes": "",
            "timestamp": datetime.now(timezone.utc).isoformat(),
        }
        with open(output_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

    print(f"\nDone. Results saved to {output_file}")
    print("Grade both 'raw_grade' and 'bundle_grade' fields, then compare.")


if __name__ == "__main__":
    run_comparison()

python rag/benchmark_bundles.py

After running the benchmark, grade each pair of answers manually. For each question, you'll have a raw answer and a bundled answer. Score them on:

Correctness: is the answer factually accurate?
Grounding: does the answer cite specific evidence?
Completeness: does the answer address the full question?
Conciseness: does the answer avoid irrelevant information?

In my experience, structured bundles consistently outperform raw retrieval on grounding and conciseness, with smaller gains on correctness. The improvement is most visible on relationship questions ("what calls X?") where the evidence role metadata helps the model understand how to use each piece of evidence.

Exercises

Build the evidence bundle schema (rag/evidence_bundle.py) and the pack-to-bundle converter (rag/pack_to_bundle.py). Run them on three questions and inspect the full bundle JSON. Check that every snippet has a selection reason and evidence role.
Run the benchmark comparison (rag/benchmark_bundles.py) on at least 10 questions. Grade both raw and bundled answers. Calculate the accuracy difference.
Inspect the coverage_notes field for each bundle. For questions where the bundled answer was still wrong, was the coverage assessment accurate? Did it correctly identify what was missing?
Experiment with the token budget as a cost lever. Run the same 10 questions at budgets of 1000, 2000, and 4000 tokens. Plot answer quality vs. context tokens. Where does quality plateau?
Write three new benchmark questions specifically designed to test evidence coverage: one where the definition is sufficient, one where callers are needed, and one where the question requires evidence from multiple files.

Completion checkpoint

You should now have:

A formalized evidence bundle schema with snippets, scores, metadata, file paths, line ranges, and selection rationale
A converter that turns Module 4 context packs into evidence bundles
Benchmark results comparing raw retrieval vs. structured evidence bundles on at least 10 questions
Evidence that structured bundles improve answer quality (particularly grounding and conciseness)
An understanding of how context pack size functions as a cost optimization lever

Reflection prompts

Which fields in the evidence bundle schema contributed most to answer quality? Were there fields that felt redundant or unhelpful?
How did the selection_reason field affect the model's use of evidence? Could you see a difference in answers when the model knew why each piece of evidence was included?
At what token budget did answer quality start to degrade? Was it the same for all question types, or did relationship questions need more context than code lookups?

Connecting to the project

Your anchor project now has a formal contract between retrieval and generation: the evidence bundle schema. Every question gets structured evidence with provenance, selection rationale, and coverage assessment. This contract will remain stable as we add retrieval routing in the next lesson and cost optimization in Module 6.

The benchmark comparison you built here is also a template for future evaluations. Any time you change the retrieval pipeline, the context compiler, or the generation prompt, you can re-run this comparison to measure whether the change helped or hurt. We'll formalize this into an eval harness in Module 6.

What's next

Retrieval Routing. A single retrieval path will stop being enough once questions span code, docs, and cases that need no retrieval at all; the next lesson routes deliberately.

References

Start here

Context-Pack Contract — the full schema specification, validation rules, and anti-patterns for context packs

Build with this

tiktoken on PyPI — accurate token counting for budget management
OpenAI: Chat Completions API — the API used in the benchmark's generation step

Deep dive

Anthropic: Long context tips — strategies for working with large context windows and prompt caching
LlamaIndex: Node and document design — how LlamaIndex structures its retrieval units, comparable to our evidence snippet design