Graph and Hybrid Retrieval (Tier 3)

Your AST-aware benchmark results will have surfaced a clear pattern: symbol lookup and architecture questions improved significantly, but relationship questions ("what calls this function?", "what breaks if I change this file?", "trace the request path from the API endpoint to the database") are still pretty weak. Vector similarity can tell you which code looks like other code, but it can't tell you which code depends on other code. For that, you need explicit relationship data.

This lesson adds two capabilities: a graph of import and call edges between your code entities, and hybrid retrieval that combines vector search with lexical search and graph traversal. Together, these will handle the question types that pure vector search can't reach.

What you'll learn

Build an import graph and call graph from your anchor repo's AST data
Query the graph to answer relationship questions: importers, reverse dependencies, and file-level impact analysis
Combine vector search, lexical search, and graph traversal into a hybrid retrieval pipeline
Apply reciprocal rank fusion to merge results from multiple retrieval methods
Measure the improvement on change-impact and debugging benchmark questions

Concepts

Import graph: a directed graph where nodes are files (or modules) and edges represent import relationships. If routes/api.py imports services/auth.py, there's an edge from api.py to auth.py. Import graphs answer questions like "what depends on this module?" and "what would be affected if this file changed?"

Call graph: conceptually, a directed graph where nodes are functions and edges represent call relationships. In a full-fidelity implementation, handle_request calling validate_input would create an edge between those two function nodes. In practice, building a precise call graph from static analysis is hard. Python's dynamic dispatch, decorators, and closures make the resulting graph ambiguous. The implementation we'll build here takes a simpler path, using file-to-file edges based on name resolution against a symbol table. It answers "which files likely call symbols defined in this file?" rather than the more precise "which function calls which function." That's fairly coarse, but it's meaningfully useful for impact analysis and good enough to demonstrate when graph retrieval helps.

Hybrid retrieval: combining multiple retrieval methods and merging their results. In our case, we'll combine vector search (semantic similarity), lexical search (exact term matching), and graph traversal (structural relationships). Each method covers a different weakness in the others.

Reciprocal Rank Fusion (RRF): a method for combining ranked result lists from different retrieval systems. RRF assigns a score to each result based on its rank position in each list, then sums those scores. It's simple, effective, and doesn't require training. The formula is: score(d) = sum(1 / (k + rank_i(d))) across all retrieval methods, where k is a constant (typically 60).

Reranking: taking the merged results from hybrid retrieval and re-scoring them with a more accurate model. A cross-encoder reranker sees the full query and each candidate together, which lets it make finer-grained relevance judgments than any individual retrieval method. We'll treat reranking as an optional enhancement in this lesson.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
"What calls this function?"	Vector search returns code that mentions the function, not code that calls it	Grep for the function name	Call graph with caller lookup
"What breaks if I change this file?"	No way to trace downstream dependencies	Manual import tracing	Import graph with transitive dependency walk
Exact identifier missed by vectors	Vector search returns semantically similar but wrong results	Grep / BM25	Hybrid retrieval with lexical leg
Too many retrieval sources to rank manually	Vector, lexical, and graph each return different top-k lists	Pick one and ignore the rest	Reciprocal rank fusion

Default: NetworkX (graph) + BM25 (lexical)

Why this is the default: NetworkX runs locally with no infrastructure, handles the graph sizes we need for single-repo analysis, and makes traversal queries easy. BM25 via the rank_bm25 library adds lexical search without a separate service. Together, they let give us what we need to build the full hybrid pipeline with minimal setup.

Portable concept underneath: explicit relationship storage for dependency reasoning (graph) combined with exact-match retrieval for known identifiers (lexical). The same patterns work with any graph store or search engine.

Closest alternatives and when to switch:

Neo4j: use when your graph needs persistence, multi-user access, or you're working across multiple repositories. Neo4j's Cypher query language is more expressive for complex traversals.
Graph features in your primary database: use when you don't want a separate graph store; some vector databases and SQL databases support basic graph queries.
Elasticsearch/OpenSearch: use when you need production-grade lexical search with faceting, and BM25 on a local library isn't sufficient.

Walkthrough

Install dependencies

cd anchor-repo
pip install networkx rank-bm25 openai qdrant-client

Build the import and call graphs

We'll extract import and call relationships from the AST data we built in the previous lesson.

# retrieval/build_graph.py
"""Build import and call graphs from AST data."""
import ast
import json
from pathlib import Path
import networkx as nx

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
SYMBOL_TABLE_PATH = Path("retrieval/symbol_table.json")
GRAPH_PATH = Path("retrieval/code_graph.json")


def is_excluded(path: Path) -> bool:
    """Check whether a path should be skipped during repository traversal.

    Args:
        path: Repository-relative path to evaluate.

    Returns:
        ``True`` when the path lives under an excluded directory, otherwise ``False``.
    """
    return any(part in EXCLUDED_DIRS for part in path.parts)


def extract_imports(file_path: Path) -> list[dict]:
    """Extract import statements from one Python file.

    Args:
        file_path: Absolute path to the Python file to analyze.

    Returns:
        A list of dictionaries describing import statements and their line numbers.
    """
    try:
        source = file_path.read_text(errors="replace")
        tree = ast.parse(source)
    except SyntaxError:
        return []

    imports = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                imports.append({
                    "type": "import",
                    "module": alias.name,
                    "line": node.lineno,
                })
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                imports.append({
                    "type": "from_import",
                    "module": node.module,
                    "names": [a.name for a in node.names],
                    "line": node.lineno,
                })
    return imports


def extract_calls(file_path: Path) -> list[dict]:
    """Extract function and method calls from one Python file.

    Args:
        file_path: Absolute path to the Python file to analyze.

    Returns:
        A list of dictionaries describing discovered call sites and line numbers.
    """
    try:
        source = file_path.read_text(errors="replace")
        tree = ast.parse(source)
    except SyntaxError:
        return []

    calls = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            # Simple function call: func_name()
            if isinstance(node.func, ast.Name):
                calls.append({
                    "type": "call",
                    "name": node.func.id,
                    "line": node.lineno,
                })
            # Method call: obj.method()
            elif isinstance(node.func, ast.Attribute):
                calls.append({
                    "type": "method_call",
                    "name": node.func.attr,
                    "line": node.lineno,
                })
    return calls


def resolve_import_to_file(module_name: str, repo_files: set) -> str | None:
    """Resolve an absolute import name to a repository file when possible.

    Args:
        module_name: Imported module name such as ``pkg.utils``.
        repo_files: Set of repository-relative file paths available for matching.

    Returns:
        The matching repository file path, or ``None`` when resolution fails.
    """
    # Convert module.path to file path patterns
    candidates = [
        module_name.replace(".", "/") + ".py",
        module_name.replace(".", "/") + "/__init__.py",
    ]
    for candidate in candidates:
        if candidate in repo_files:
            return candidate
    return None


def build_graphs():
    """Build repository-wide import and call graphs from Python source.

    Args:
        None.

    Returns:
        A tuple of ``(import_graph, call_graph)`` for downstream querying.
    """
    import_graph = nx.DiGraph()
    call_graph = nx.DiGraph()

    # Collect all Python files
    py_files = {}
    for path in sorted(REPO_ROOT.rglob("*.py")):
        if is_excluded(path.relative_to(REPO_ROOT)):
            continue
        rel = str(path.relative_to(REPO_ROOT))
        py_files[rel] = path

    repo_file_set = set(py_files.keys())

    # Add all files as nodes in the import graph
    for rel_path in py_files:
        import_graph.add_node(rel_path)

    # Load symbol table for call graph resolution
    symbol_table = {}
    if SYMBOL_TABLE_PATH.exists():
        table_data = json.loads(SYMBOL_TABLE_PATH.read_text())
        for sym in table_data.get("symbols", []):
            name = sym["name"]
            if name not in symbol_table:
                symbol_table[name] = []
            symbol_table[name].append(sym)

    # Extract imports and calls
    for rel_path, abs_path in py_files.items():
        # Import edges
        imports = extract_imports(abs_path)
        for imp in imports:
            target_file = resolve_import_to_file(imp["module"], repo_file_set)
            if target_file and target_file != rel_path:
                import_graph.add_edge(rel_path, target_file, type="imports", line=imp["line"])

        # Call edges
        calls = extract_calls(abs_path)
        for call in calls:
            # Try to resolve the call to a known symbol
            if call["name"] in symbol_table:
                for sym in symbol_table[call["name"]]:
                    if sym["file"] != rel_path:
                        call_graph.add_edge(
                            rel_path, sym["file"],
                            caller_file=rel_path,
                            callee=call["name"],
                            callee_file=sym["file"],
                            line=call["line"],
                        )

    # Serialize to JSON
    graph_data = {
        "import_graph": {
            "nodes": list(import_graph.nodes()),
            "edges": [
                {"source": u, "target": v, **d}
                for u, v, d in import_graph.edges(data=True)
            ],
        },
        "call_graph": {
            "nodes": list(call_graph.nodes()),
            "edges": [
                {"source": u, "target": v, **d}
                for u, v, d in call_graph.edges(data=True)
            ],
        },
        "stats": {
            "files": len(py_files),
            "import_edges": import_graph.number_of_edges(),
            "call_edges": call_graph.number_of_edges(),
        },
    }

    GRAPH_PATH.write_text(json.dumps(graph_data, indent=2))
    print(f"Built graphs for {len(py_files)} files")
    print(f"Import edges: {import_graph.number_of_edges()}")
    print(f"Call edges: {call_graph.number_of_edges()}")
    print(f"Graph saved to {GRAPH_PATH}")
    return import_graph, call_graph


if __name__ == "__main__":
    build_graphs()

python retrieval/build_graph.py

Expected output:

Built graphs for 23 files
Import edges: 31
Call edges: 47
Graph saved to retrieval/code_graph.json

Query the graph

# retrieval/query_graph.py
"""Query the code graph for relationship questions."""
import json
from pathlib import Path
import networkx as nx

GRAPH_PATH = Path("retrieval/code_graph.json")


def load_graphs() -> tuple[nx.DiGraph, nx.DiGraph]:
    """Load serialized import and call graphs from disk.

    Args:
        None.

    Returns:
        A tuple of ``(import_graph, call_graph)`` reconstructed from JSON.
    """
    data = json.loads(GRAPH_PATH.read_text())
    import_graph = nx.DiGraph()
    import_graph.add_nodes_from(data["import_graph"]["nodes"])
    for edge in data["import_graph"]["edges"]:
        import_graph.add_edge(edge["source"], edge["target"])

    call_graph = nx.DiGraph()
    for edge in data["call_graph"]["edges"]:
        call_graph.add_edge(
            edge["source"], edge["target"],
            callee=edge.get("callee", ""),
        )
    return import_graph, call_graph


def what_imports(file_path: str, import_graph: nx.DiGraph) -> list[str]:
    """List the files imported by one source file.

    Args:
        file_path: Repository-relative file path to inspect.
        import_graph: Directed import graph for the repository.

    Returns:
        A list of imported file paths.
    """
    if file_path not in import_graph:
        return []
    return list(import_graph.successors(file_path))


def what_depends_on(file_path: str, import_graph: nx.DiGraph) -> list[str]:
    """List the files that import a given source file.

    Args:
        file_path: Repository-relative file path to inspect.
        import_graph: Directed import graph for the repository.

    Returns:
        A list of reverse dependencies for the file.
    """
    if file_path not in import_graph:
        return []
    return list(import_graph.predecessors(file_path))


def change_impact(file_path: str, import_graph: nx.DiGraph, max_depth: int = 3) -> list[str]:
    """Estimate transitive reverse dependencies for a changed file.

    Args:
        file_path: Repository-relative file path to inspect.
        import_graph: Directed import graph for the repository.
        max_depth: Maximum predecessor depth to traverse.

    Returns:
        A sorted list of potentially affected files.
    """
    if file_path not in import_graph:
        return []
    affected = set()
    frontier = {file_path}
    for _ in range(max_depth):
        next_frontier = set()
        for node in frontier:
            for pred in import_graph.predecessors(node):
                if pred not in affected and pred != file_path:
                    affected.add(pred)
                    next_frontier.add(pred)
        frontier = next_frontier
        if not frontier:
            break
    return sorted(affected)


def callers_of(function_name: str, call_graph: nx.DiGraph) -> list[str]:
    """List files that contain calls to the named function.

    Args:
        function_name: Callee name to search for in the call graph.
        call_graph: Directed call graph for the repository.

    Returns:
        A sorted list of caller file paths.
    """
    callers = set()
    for u, v, data in call_graph.edges(data=True):
        if data.get("callee") == function_name:
            callers.add(u)
    return sorted(callers)


if __name__ == "__main__":
    import sys
    import_graph, call_graph = load_graphs()

    if len(sys.argv) > 2:
        cmd, arg = sys.argv[1], sys.argv[2]
        if cmd == "imports":
            print(f"Files imported by {arg}:")
            for f in what_imports(arg, import_graph):
                print(f"  {f}")
        elif cmd == "depends":
            print(f"Files that depend on {arg}:")
            for f in what_depends_on(arg, import_graph):
                print(f"  {f}")
        elif cmd == "impact":
            print(f"Change impact for {arg}:")
            for f in change_impact(arg, import_graph):
                print(f"  {f}")
        elif cmd == "callers":
            print(f"Files that call {arg}:")
            for f in callers_of(arg, call_graph):
                print(f"  {f}")
    else:
        print("Usage: python retrieval/query_graph.py [imports|depends|impact|callers] <file_or_symbol>")

# What depends on a core module?
python retrieval/query_graph.py depends "services/auth.py"

# What would break if you changed it?
python retrieval/query_graph.py impact "services/auth.py"

# Who calls a specific function?
python retrieval/query_graph.py callers "validate_path"

Build the hybrid retrieval pipeline

Now we'll combine all three retrieval methods: vector search (AST-aware), lexical search (BM25), and graph traversal, with reciprocal rank fusion.

# retrieval/hybrid_retrieve.py
"""Hybrid retrieval: vector + lexical + graph, merged with reciprocal rank fusion."""
import json
import re
import sys
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from rank_bm25 import BM25Okapi
import networkx as nx

CHUNKS_PATH = Path("retrieval/chunks_ast.jsonl")
GRAPH_PATH = Path("retrieval/code_graph.json")
COLLECTION_NAME = "anchor-repo-ast"
EMBEDDING_MODEL = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o-mini"
TOP_K = 10  # retrieve more candidates for fusion, return fewer
FINAL_K = 5  # final number of results after fusion
RRF_K = 60  # RRF constant

client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")


def load_chunks() -> list[dict]:
    """Load AST-aware chunks from the JSONL chunk store.

    Args:
        None.

    Returns:
        A list of chunk dictionaries from the chunk store.
    """
    chunks = []
    with open(CHUNKS_PATH) as f:
        for line in f:
            if line.strip():
                chunks.append(json.loads(line))
    return chunks


def load_graphs() -> tuple[nx.DiGraph, nx.DiGraph]:
    """Load serialized import and call graphs from disk.

    Args:
        None.

    Returns:
        A tuple of ``(import_graph, call_graph)`` reconstructed from JSON.
    """
    data = json.loads(GRAPH_PATH.read_text())
    import_graph = nx.DiGraph()
    import_graph.add_nodes_from(data["import_graph"]["nodes"])
    for edge in data["import_graph"]["edges"]:
        import_graph.add_edge(edge["source"], edge["target"])
    call_graph = nx.DiGraph()
    for edge in data["call_graph"]["edges"]:
        call_graph.add_edge(edge["source"], edge["target"], callee=edge.get("callee", ""))
    return import_graph, call_graph


# Pre-load data
_chunks = load_chunks()
_tokenized = [re.findall(r'\w+', c["text"].lower()) for c in _chunks]
_bm25 = BM25Okapi(_tokenized)
_import_graph, _call_graph = load_graphs()


def vector_search(query: str, top_k: int = TOP_K) -> list[tuple[str, float]]:
    """Run vector retrieval against the chunk collection.

    Args:
        query: User query to embed and search.
        top_k: Number of top vector matches to return.

    Returns:
        A ranked list of ``(chunk_id, score)`` pairs from vector search.
    """
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=[query])
    results = qdrant.query_points(
        collection_name=COLLECTION_NAME,
        query=response.data[0].embedding,
        limit=top_k,
    )
    return [(hit.payload["chunk_id"], hit.score) for hit in results.points]


def lexical_search(query: str, top_k: int = TOP_K) -> list[tuple[str, float]]:
    """Run BM25 retrieval over the pre-tokenized chunk corpus.

    Args:
        query: User query to score lexically.
        top_k: Number of top lexical matches to return.

    Returns:
        A ranked list of ``(chunk_id, score)`` pairs from BM25 search.
    """
    tokens = re.findall(r'\w+', query.lower())
    scores = _bm25.get_scores(tokens)
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:top_k]
    return [(_chunks[idx]["chunk_id"], score) for idx, score in ranked if score > 0]


def graph_search(query: str, top_k: int = TOP_K) -> list[tuple[str, float]]:
    """Run graph-aware retrieval by matching identifiers to related files.

    Args:
        query: User query to inspect for files or symbol identifiers.
        top_k: Number of graph-derived matches to return.

    Returns:
        A ranked list of ``(chunk_id, score)`` pairs from graph search.
    """
    identifiers = re.findall(r'[A-Z][a-z]+(?:[A-Z][a-z]+)*|[a-z_]+(?:_[a-z]+)+|\w+\.py', query)
    if not identifiers:
        return []

    related_files = set()
    for ident in identifiers:
        if ident.endswith(".py"):
            if ident in _import_graph:
                related_files.update(_import_graph.predecessors(ident))
                related_files.update(_import_graph.successors(ident))
                related_files.add(ident)
        else:
            for u, v, data in _call_graph.edges(data=True):
                if data.get("callee") == ident:
                    related_files.add(u)
                    related_files.add(v)

    if not related_files:
        return []

    results = []
    for i, chunk in enumerate(_chunks):
        if chunk["file_path"] in related_files:
            score = 1.0
            results.append((chunk["chunk_id"], score))

    return results[:top_k]


def reciprocal_rank_fusion(
    *ranked_lists: list[tuple[str, float]],
    k: int = RRF_K,
) -> list[tuple[str, float]]:
    """Merge multiple ranked retrieval lists with Reciprocal Rank Fusion.

    Args:
        *ranked_lists: Ranked retrieval result lists to merge.
        k: Rank dampening constant used by the RRF formula.

    Returns:
        A fused ranked list of ``(chunk_id, fused_score)`` pairs.
    """
    scores = {}
    for ranked_list in ranked_lists:
        for rank, (chunk_id, _) in enumerate(ranked_list):
            if chunk_id not in scores:
                scores[chunk_id] = 0.0
            scores[chunk_id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


def hybrid_retrieve(query: str, final_k: int = FINAL_K) -> list[dict]:
    """Run vector, lexical, and graph retrieval and fuse the results.

    Args:
        query: User query to retrieve evidence for.
        final_k: Number of fused results to keep.

    Returns:
        A list of fused retrieval results with chunk metadata.
    """
    vec_results = vector_search(query)
    lex_results = lexical_search(query)
    graph_results = graph_search(query)

    fused = reciprocal_rank_fusion(vec_results, lex_results, graph_results)

    chunk_map = {c["chunk_id"]: c for c in _chunks}
    results = []
    for chunk_id, score in fused[:final_k]:
        chunk = chunk_map.get(chunk_id, {})
        results.append({
            "chunk_id": chunk_id,
            "file_path": chunk.get("file_path", "unknown"),
            "symbol_name": chunk.get("symbol_name", "unknown"),
            "rrf_score": round(score, 6),
            "text": chunk.get("text", ""),
        })

    return results


def hybrid_retrieve_and_answer(question: str) -> dict:
    """Answer a question using hybrid retrieval plus grounded generation.

    Args:
        question: User question to answer.

    Returns:
        A dictionary containing the answer text and chunks used.
    """
    chunks = hybrid_retrieve(question)
    context = "\n\n---\n\n".join(
        f"File: {c['file_path']} | Symbol: {c['symbol_name']} (RRF score: {c['rrf_score']})\n{c['text']}"
        for c in chunks
    )

    response = client.chat.completions.create(
        model=GENERATION_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a code assistant. Answer the question using ONLY the "
                    "retrieved code context below. If the context doesn't contain "
                    "enough information, say so.\n\n"
                    f"Retrieved context:\n{context}"
                ),
            },
            {"role": "user", "content": question},
        ],
        temperature=0,
    )

    return {
        "answer": response.choices[0].message.content,
        "chunks_used": chunks,
        "retrieval_method": "hybrid_vector_lexical_graph",
    }


if __name__ == "__main__":
    question = sys.argv[1] if len(sys.argv) > 1 else "What functions call validate_path?"
    print(f"Question: {question}\n")
    results = hybrid_retrieve(question)
    print(f"Top {len(results)} hybrid results:")
    for r in results:
        print(f"  [{r['rrf_score']}] {r['symbol_name']} in {r['file_path']}")
    print()
    result = hybrid_retrieve_and_answer(question)
    print(f"Answer:\n{result['answer']}")

python retrieval/hybrid_retrieve.py "What functions call validate_path?"

When graphs beat vector search (and when they don't)

I've found it helpful to be specific about this, because the temptation is to throw a graph at everything once you have one.

Graphs are clearly better for:

"What calls this function?": this is a direct edge query, not a similarity question
"What would break if I changed this file?": transitive dependency traversal
"Trace the request from the API endpoint to the database": path finding through call/import edges
"What are the entry points of this application?": graph centrality or root node detection

Graphs don't help (or hurt) for:

"How does authentication work?": this is a semantic question, not a structural one
"What does this error message mean?": no graph edges connect errors to explanations
"Find code similar to this pattern": similarity is a vector search problem

Hybrid is clearly better for:

"Where is validate_path and what calls it?": needs exact match (lexical) + relationship (graph)
Mixed workloads where you don't know the question type in advance

Run the hybrid retrieval benchmark

# retrieval/run_hybrid_benchmark.py
"""Run benchmark through hybrid retrieval."""
import json
import os
from datetime import datetime, timezone
from pathlib import Path
from retrieval.hybrid_retrieve import hybrid_retrieve_and_answer

RUN_ID = "hybrid-v1-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
OUTPUT_FILE = Path(f"harness/runs/{RUN_ID}.jsonl")
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()


def run_benchmark():
    """Run the hybrid retrieval benchmark over the question set.

    Args:
        None.

    Returns:
        None. The benchmark run is written to the JSONL output file.
    """
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    print(f"Running {len(questions)} questions through hybrid retrieval")
    print(f"Run ID: {RUN_ID}\n")

    results = []
    for i, q in enumerate(questions):
        print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
        result = hybrid_retrieve_and_answer(q["question"])

        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": result["answer"],
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": list(set(c["file_path"] for c in result["chunks_used"])),
            "rrf_scores": [c["rrf_score"] for c in result["chunks_used"]],
            "retrieval_method": "hybrid_vector_lexical_graph",
            "grade": None,
            "failure_label": None,
            "grading_notes": "",
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.2",
        }
        results.append(entry)

    os.makedirs("harness/runs", exist_ok=True)
    with open(OUTPUT_FILE, "w") as f:
        for entry in results:
            f.write(json.dumps(entry) + "\n")

    print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
    print("Grade and compare against Tier 1 and Tier 2.")


if __name__ == "__main__":
    run_benchmark()

python -m retrieval.run_hybrid_benchmark

After grading, you'll see improvement concentrated in two categories:

Change-impact questions: the graph traversal provides evidence that vector search can't reach. "What files are affected if I change models/user.py?" now returns concrete import chains.
Debugging questions: when a question references a specific function name, the lexical leg catches the exact match while the vector leg provides surrounding context. The fusion surfaces both.

Exercises

Build the import and call graphs (build_graph.py). Verify the edge counts match your repo's actual import and call structure.
Test graph queries: pick three files and run impact, depends, and callers queries. Verify the results against manual inspection.
Build the hybrid retrieval pipeline (hybrid_retrieve.py). Test it with a relationship question, a conceptual question, and an exact-identifier question. Observe which retrieval leg contributes to each.
Run the full benchmark through hybrid retrieval (run_hybrid_benchmark.py). Grade at least 15 answers and compare against your naive and AST-aware grades.
For three questions where hybrid retrieval still fails, analyze why. Is the graph missing edges? Is the lexical search failing on unusual identifiers? Is the fusion weighting wrong? Write down what a context compiler would need to fix.

Completion checkpoint

You should now have:

An import graph and call graph for your anchor repo, serialized to JSON and queryable
A hybrid retrieval pipeline that fuses vector, lexical, and graph results with RRF
Benchmark results graded and compared across all three tiers
A clear understanding of which question categories graph retrieval improves and which it doesn't affect
Notes on remaining failures that will inform your context compilation strategy

Reflection prompts

How much did the graph leg contribute to your benchmark improvement? Were there questions where it was the only retrieval method that surfaced the right evidence?
Did reciprocal rank fusion ever produce worse results than a single retrieval method? What happened in those cases?
Looking at your remaining failures across three tiers of retrieval, what's the common thread? Is it retrieval quality (wrong chunks) or context quality (right chunks, wrong assembly)?
If you could add one more retrieval leg to the hybrid pipeline, what would it be and what would it fix?

What's next

Context Compilation. Better retrieval is only half the job; the next lesson decides what actually earns space in the model's prompt and how much of it to send.

References

Start here

NetworkX documentation — graph creation, traversal, and analysis for Python

Build with this

rank_bm25 on PyPI — the BM25 implementation we use for the lexical retrieval leg
Reciprocal Rank Fusion paper — the original paper by Cormack, Clarke, and Butt (short and practical)

Deep dive

Neo4j Python driver — for when you outgrow NetworkX and need persistent graph storage with Cypher queries
Weaviate hybrid search — production-grade hybrid search as a managed feature; shows where the industry is heading
Cross-encoder reranking with Sentence Transformers — add a reranker for more precise final ranking