When to Use Which Retrieval Method

By the end of Module 3, your agent can read files, search text, and answer questions about your codebase. That's pretty useful, but you've probably noticed its limits. Some questions need more than what grep can offer. At this point it may seem like the next step is to reach for a vector database, but that instinct will lead people astray more often than it helps. So before we build any retrieval infrastructure, we'll spend this lesson developing the judgment to help pick the right retrieval method for each problem class.

RAG is a pattern, not a database choice. The "R" in RAG (retrieval) can be a file path lookup, a grep command, a SQL query, a symbol table scan, a vector search, a graph traversal, or any combination. The retrieval method you choose should match the question you're answering, not the hype cycle you're in.

What you'll learn

Evaluate nine retrieval methods and identify which question types each one handles well
Recognize when simpler retrieval methods outperform vector search
Build a structured JSON metadata index for your anchor repo and query it
Compare structured retrieval against the grep-based tools from Module 3 on the same benchmark questions
Use the retrieval method chooser as an ongoing decision framework

Concepts

Retrieval method: the underlying mechanism you use to find relevant information. A vector database is one retrieval method. A grep command is another. A SQL query against a metadata table is a third. The method you choose determines what kinds of questions you can answer efficiently.

RAG (Retrieval-Augmented Generation): a pattern where you retrieve relevant information, insert it into the model's context, and let the model generate an answer grounded in that evidence. RAG doesn't require any specific database. It requires a retrieval step, a context assembly step, and a generation step. We'll build the full pipeline in Module 5; this module focuses on making the retrieval step excellent.

Lexical search: finding documents by matching exact terms. Grep is the simplest form. BM25 is a more sophisticated version that accounts for term frequency and document length. Lexical search excels when the user knows the exact identifier, error message, or string they're looking for.

Semantic search: finding documents by meaning rather than exact terms. This is what vector databases do: they encode text as numerical vectors and find chunks whose vectors are close to the query's vector. Semantic search helps when the user describes what they want in different words than the code uses.

Hybrid search: combining lexical and semantic retrieval, then merging or reranking the results. This is often better than either approach alone, but it's also more complex to build and tune.

Reranker: a second-pass model that takes the initial retrieval results and re-scores them for relevance. A reranker sees both the query and each candidate together, which lets it make finer-grained relevance judgments than the initial retrieval pass.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Know which file to look at	The answer is in a predictable location (README, config, specific module)	Hardcoded path rules	File tree + path metadata
Need an exact identifier	Looking for a function name, class name, error string, or config key	grep/regex in Module 3 tools	Grep / regex search
Keyword-heavy code search	Need documents containing specific terms but with ranking	grep with manual sorting	BM25 / lexical search
Structured metadata queries	"Which files were modified most recently?" or "List all Python files importing X"	Manual file inspection	JSON / SQL metadata index
Symbol lookup	"Where is `UserService` defined?" or "What methods does `Router` have?"	grep for the symbol name	AST / symbol index
Conceptual questions	"How does authentication work?" (user's words differ from code's terms)	Keyword search	Vector search
Mixed exact + conceptual	Some questions need exact matches, others need semantic similarity	Run both and eyeball	Hybrid lexical + vector
Relationship questions	"What calls this function?" or "What breaks if I change this file?"	grep for import/usage	Graph traversal
Too many results from any method	Top-k returns partly relevant, partly noise	Increase k and hope	Reranker on top of any of the above

The retrieval method chooser

This table is your decision framework. It will be helpful to consult before building anything as it saves you from over-engineering retrieval for problems that have simpler solutions.

Method	Good for	Weak for	Cheapest implementation	Upgrade signal
File tree + path metadata	Known locations, configuration files, READMEs, directory conventions	Anything requiring content understanding	`os.listdir` + path pattern matching	You need to search inside files, not just find them
Grep / regex	Exact identifiers, error strings, import statements, config keys	Semantic similarity, fuzzy matches, typo tolerance	Your existing `search_text` tool from Module 3	Queries use different words than the code (e.g., "auth" vs. `verify_credentials`)
BM25 / lexical search	Ranked keyword search, documentation, comments, docstrings	Conceptual questions where vocabulary differs	`rank_bm25` Python library over your chunked corpus	Relevant results rank below irrelevant ones because of vocabulary mismatch
JSON / SQL metadata index	File metadata, symbol lists, dependency tracking, structured queries	Free-text conceptual search	A JSON file mapping filenames to metadata (language, imports, exports, size)	You need to search content semantics, not just attributes
AST / symbol index	Symbol lookup, function signatures, class hierarchies, definition locations	Cross-file relationship reasoning, natural language questions	Tree-sitter parse + symbol table (we'll build this in the AST-aware lesson)	You need to answer "what calls this?" or "what depends on this?"
Vector search	Conceptual similarity, natural language questions, documentation search	Exact identifier lookup, structured queries, relationship traversal	Embedding model + Qdrant (we'll build this in the naive baseline lesson)	You need exact matches and semantic matches together
Hybrid lexical + vector	Mixed question types, production systems that serve varied queries	Simple use cases where one method is clearly sufficient	BM25 + vector search with reciprocal rank fusion	Your retrieval needs are narrow enough that one method works fine
Graph traversal	Import chains, call graphs, dependency impact, "what breaks if I change X?"	Similarity-based questions, conceptual search	NetworkX with import/call edges (we'll build this in the graph/hybrid lesson)	Your questions don't involve relationships between code entities
Reranker	Improving precision in the top results from any retrieval method	Being a standalone retrieval method (it needs candidates to rerank)	Cross-encoder model on top of initial results	Your initial retrieval already returns mostly relevant results

For the full decision matrix with additional columns and edge cases, see the Retrieval Method Chooser reference page.

Walkthrough

Start cheap, upgrade on evidence

The most effective retrieval systems are built incrementally. Don't start with a vector database and graph store. Start with the simplest method that answers your questions, observe where it fails, and upgrade only the methods that need upgrading.

Your Module 3 agent already has grep-based retrieval. For many question types, like exact symbol lookup, error string search, import tracing, grep is often good enough. The goal of this lesson is to build one more retrieval method (structured metadata) and see how far it gets before we need vectors.

Build a structured metadata index

We'll create a JSON index that stores metadata about every file in your anchor repo. This gives you a queryable data structure for questions like "which files define classes?" or "what are the entry points?". Using grep can answer these, but neither very well nor efficiently.

cd anchor-repo
mkdir -p retrieval

# retrieval/build_metadata_index.py
"""Build a JSON metadata index for the anchor repository."""
import ast
import json
import os
from pathlib import Path

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
INDEX_PATH = Path("retrieval/metadata_index.json")


def is_excluded(path: Path) -> bool:
    return any(part in EXCLUDED_DIRS for part in path.parts)


def extract_python_metadata(file_path: Path) -> dict:
    """Extract metadata from a Python file using the ast module."""
    source = file_path.read_text(errors="replace")
    metadata = {
        "functions": [],
        "classes": [],
        "imports": [],
        "line_count": len(source.splitlines()),
    }
    try:
        tree = ast.parse(source)
    except SyntaxError:
        metadata["parse_error"] = True
        return metadata

    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
            metadata["functions"].append({
                "name": node.name,
                "line": node.lineno,
                "args": [a.arg for a in node.args.args],
            })
        elif isinstance(node, ast.ClassDef):
            methods = [
                n.name for n in node.body
                if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))
            ]
            metadata["classes"].append({
                "name": node.name,
                "line": node.lineno,
                "methods": methods,
            })
        elif isinstance(node, ast.Import):
            for alias in node.names:
                metadata["imports"].append(alias.name)
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                metadata["imports"].append(node.module)
    return metadata


def build_index() -> dict:
    """Walk the repo and build metadata for every file."""
    index = {}
    for root, dirs, files in os.walk(REPO_ROOT):
        # Skip excluded directories
        dirs[:] = [d for d in dirs if d not in EXCLUDED_DIRS]
        for fname in files:
            fpath = Path(root) / fname
            rel = str(fpath.relative_to(REPO_ROOT))
            entry = {
                "path": rel,
                "extension": fpath.suffix,
                "size_bytes": fpath.stat().st_size,
            }
            if fpath.suffix == ".py":
                entry.update(extract_python_metadata(fpath))
            index[rel] = entry
    return index


if __name__ == "__main__":
    index = build_index()
    INDEX_PATH.write_text(json.dumps(index, indent=2))
    py_files = [k for k, v in index.items() if v["extension"] == ".py"]
    total_functions = sum(len(v.get("functions", [])) for v in index.values())
    total_classes = sum(len(v.get("classes", [])) for v in index.values())
    print(f"Indexed {len(index)} files ({len(py_files)} Python)")
    print(f"Found {total_functions} functions, {total_classes} classes")
    print(f"Index saved to {INDEX_PATH}")

python retrieval/build_metadata_index.py

Expected output (these numbers will vary based on your anchor repo):

Indexed 47 files (23 Python)
Found 68 functions, 12 classes
Index saved to retrieval/metadata_index.json

Query the metadata index

Now build a query tool that can answer structured questions using this index:

# retrieval/query_metadata.py
"""Query the metadata index for structured code questions."""
import json
from pathlib import Path

INDEX_PATH = Path("retrieval/metadata_index.json")


def load_index() -> dict:
    return json.loads(INDEX_PATH.read_text())


def find_symbol(name: str, index: dict = None) -> list[dict]:
    """Find where a function or class is defined."""
    if index is None:
        index = load_index()
    results = []
    for path, meta in index.items():
        for fn in meta.get("functions", []):
            if name.lower() in fn["name"].lower():
                results.append({
                    "type": "function",
                    "name": fn["name"],
                    "file": path,
                    "line": fn["line"],
                    "args": fn["args"],
                })
        for cls in meta.get("classes", []):
            if name.lower() in cls["name"].lower():
                results.append({
                    "type": "class",
                    "name": cls["name"],
                    "file": path,
                    "line": cls["line"],
                    "methods": cls["methods"],
                })
    return results


def find_importers(module_name: str, index: dict = None) -> list[str]:
    """Find files that import a given module."""
    if index is None:
        index = load_index()
    return [
        path for path, meta in index.items()
        if module_name in meta.get("imports", [])
    ]


def list_entry_points(index: dict = None) -> list[dict]:
    """Find likely entry points: files with if __name__ == '__main__' or main()."""
    if index is None:
        index = load_index()
    results = []
    for path, meta in index.items():
        if meta.get("extension") != ".py":
            continue
        fn_names = [f["name"] for f in meta.get("functions", [])]
        if "main" in fn_names or path.endswith("__main__.py"):
            results.append({"file": path, "functions": fn_names})
    return results


if __name__ == "__main__":
    import sys
    index = load_index()
    if len(sys.argv) > 1:
        query = sys.argv[1]
        print(f"Searching for symbol: {query}")
        results = find_symbol(query, index)
        for r in results:
            print(f"  {r['type']} {r['name']} in {r['file']}:{r['line']}")
        if not results:
            print("  No matches found")
        print(f"\nFiles importing '{query}':")
        importers = find_importers(query, index)
        for f in importers:
            print(f"  {f}")
        if not importers:
            print("  None")
    else:
        print("Entry points:")
        for ep in list_entry_points(index):
            print(f"  {ep['file']}: {ep['functions']}")

# Search for a symbol in your repo
python retrieval/query_metadata.py "UserService"

# Or list entry points
python retrieval/query_metadata.py

Tier 0.5: Compare structured retrieval against your Module 3 tools

This is where we see the importance of our decision framework. We'll run a subset of your benchmark questions through three retrieval methods (grep (Module 3), the metadata index (this lesson), and later vector search (next lesson)) and compare which questions each one handles well.

# retrieval/compare_retrieval_methods.py
"""Compare grep vs. metadata index on benchmark questions."""
import json
import subprocess
from pathlib import Path
from retrieval.query_metadata import load_index, find_symbol, find_importers

BENCHMARK_FILE = Path("benchmark-questions.jsonl")
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules"}


def grep_search(query: str) -> list[str]:
    """Run grep and return matching file paths."""
    exclude_args = []
    for d in EXCLUDED_DIRS:
        exclude_args.extend(["--exclude-dir", d])
    cmd = ["grep", "-rl", "--include=*.py"] + exclude_args + [query, "."]
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=10, cwd=REPO_ROOT)
        return [line.strip() for line in result.stdout.strip().split("\n") if line.strip()]
    except subprocess.TimeoutExpired:
        return []


def metadata_search(query: str, index: dict) -> list[str]:
    """Search the metadata index for symbols matching the query."""
    symbols = find_symbol(query, index)
    importers = find_importers(query, index)
    files = list(set([s["file"] for s in symbols] + importers))
    return files


def run_comparison():
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    index = load_index()

    print(f"{'Category':<20} {'Question (truncated)':<45} {'Grep':<8} {'Metadata':<8}")
    print("-" * 85)

    for q in questions[:15]:  # Compare first 15 questions
        # Extract a likely search term from the question
        # In practice, you'd use the model to extract terms; here we use a simple heuristic
        words = q["question"].split()
        # Look for CamelCase or snake_case terms as likely identifiers
        search_terms = [
            w.strip("?.,\"'") for w in words
            if ("_" in w or (any(c.isupper() for c in w[1:]) and any(c.islower() for c in w)))
        ]
        search_term = search_terms[0] if search_terms else words[-2] if len(words) > 1 else words[0]

        grep_results = grep_search(search_term)
        meta_results = metadata_search(search_term, index)

        print(f"{q['category']:<20} {q['question'][:43]:<45} {len(grep_results):<8} {len(meta_results):<8}")


if __name__ == "__main__":
    run_comparison()

python -m retrieval.compare_retrieval_methods

You should see a table showing how many files each method found for each question. Watch for this patterns:

Symbol lookup questions: the metadata index will often find the exact file and line, while grep returns more noise
"What imports X?" questions: the metadata index answers directly, grep gives partial matches
Conceptual questions ("How does authentication work?"): neither grep nor the metadata index handles these well. That's the signal that you'll need semantic search

These patterns are precisely what the retrieval method chooser predicts. The goal here isn't to build one retrieval method that handles everything, but rather to know which approach to reach for based on the question type.

When you don't need a vector database

I'll be a little politically incorrect here, because I see a lot of content encouraging teams to waste weeks building vector retrieval infrastructure for problems that grep solves in milliseconds. You don't need a vector database by default. If these are your scenarios, skip vector search:

Your questions use the same vocabulary as your code. If the user asks "where is validate_path defined?" that's grep territory. Embeddings add latency and lose precision for exact matches.
Your corpus is small enough to scan. If your repo has fewer than a few thousand files, grep over the full codebase runs in under a second. Vector search adds complexity without meaningful speed benefit at this scale.
Your questions are structural, not semantic. "What files import datetime?" is a metadata query. "What methods does Router have?" is a symbol table lookup. Neither needs embeddings.
You need exact recall. Vector search is approximate by nature. If you need to guarantee that a specific identifier appears in the results, lexical search is more reliable.

You do need vector search (or something beyond lexical) when:

The user's vocabulary differs from the code's vocabulary ("auth flow" vs. verify_credentials)
The question is conceptual ("how does error handling work in this codebase?")
You need to find code that's semantically similar to a description
Your corpus is large enough that scanning is too slow

We'll build that vector search baseline in the next lesson. But we'll build it knowing exactly which questions it needs to answer, the ones our simpler methods can't handle.

Exercises

Build the metadata index (build_metadata_index.py) for your anchor repo. Inspect the JSON output and verify it captured your files, functions, and classes accurately.
Run query_metadata.py against five symbol names from your repo. Compare the results against what you get from the search_text tool in agent/tools.py. Note which approach gives you more precise results for each query.
Run compare_retrieval_methods.py against your benchmark questions. Categorize each question as "grep handles this," "metadata handles this," "neither handles this well."
For the questions in the "neither handles this" category, write down what kind of retrieval you think would help. Don't look ahead. Form your own hypothesis first.
Add a find_dependents function to query_metadata.py that answers "which files would be affected if I changed file X?" by tracing import relationships. Test it on a core module in your repo.

Completion checkpoint

You should now have:

A working metadata index covering all files in your anchor repo
Query functions that answer symbol lookup and import-tracing questions
A comparison showing which benchmark questions each retrieval method handles well
A categorized list of questions that need semantic retrieval (this will become your test set for the next lesson)
A clear understanding of when simpler retrieval outperforms vector search

Reflection prompts

Which of your benchmark questions were answered well by grep alone? What do those questions have in common?
Which questions did the metadata index handle better than grep? What structural information made the difference?
For the questions that neither method handled, what's missing: vocabulary mapping, conceptual understanding, or relationship awareness?
Looking at the retrieval method chooser table, which methods do you think your final system will need to combine? Why?

What's next

Naive Vector Baseline. Start with the obvious semantic-search baseline so you can see exactly what it helps with and what it still breaks.

References

Start here

Retrieval Method Chooser — the full decision matrix for all nine retrieval methods, with additional edge cases and upgrade signals

Build with this

rank_bm25 on PyPI — a lightweight BM25 implementation for when you need ranked lexical search beyond grep
Python ast module — the standard library module we used for extracting Python metadata

Deep dive

Google Research: Rethinking Search — why lexical and structured retrieval remain essential even in a world of embeddings
Pinecone: Hybrid Search — a practical overview of combining lexical and vector approaches