When to Use Which Retrieval Method
By the end of Module 3, your agent can read files, search text, and answer questions about your codebase. That's pretty useful, but you've probably noticed its limits. Some questions need more than what grep can offer. At this point it may seem like the next step is to reach for a vector database, but that instinct will lead people astray more often than it helps. So before we build any retrieval infrastructure, we'll spend this lesson developing the judgment to help pick the right retrieval method for each problem class.
RAG is a pattern, not a database choice. The "R" in RAG (retrieval) can be a file path lookup, a grep command, a SQL query, a symbol table scan, a vector search, a graph traversal, or any combination. The retrieval method you choose should match the question you're answering, not the hype cycle you're in.
What you'll learn
- Evaluate nine retrieval methods and identify which question types each one handles well
- Recognize when simpler retrieval methods outperform vector search
- Build a structured JSON metadata index for your anchor repo and query it
- Compare structured retrieval against the grep-based tools from Module 3 on the same benchmark questions
- Use the retrieval method chooser as an ongoing decision framework
Concepts
Retrieval method: the underlying mechanism you use to find relevant information. A vector database is one retrieval method. A grep command is another. A SQL query against a metadata table is a third. The method you choose determines what kinds of questions you can answer efficiently.
RAG (Retrieval-Augmented Generation): a pattern where you retrieve relevant information, insert it into the model's context, and let the model generate an answer grounded in that evidence. RAG doesn't require any specific database. It requires a retrieval step, a context assembly step, and a generation step. We'll build the full pipeline in Module 5; this module focuses on making the retrieval step excellent.
Lexical search: finding documents by matching exact terms. Grep is the simplest form. BM25 is a more sophisticated version that accounts for term frequency and document length. Lexical search excels when the user knows the exact identifier, error message, or string they're looking for.
Semantic search: finding documents by meaning rather than exact terms. This is what vector databases do: they encode text as numerical vectors and find chunks whose vectors are close to the query's vector. Semantic search helps when the user describes what they want in different words than the code uses.
Hybrid search: combining lexical and semantic retrieval, then merging or reranking the results. This is often better than either approach alone, but it's also more complex to build and tune.
Reranker: a second-pass model that takes the initial retrieval results and re-scores them for relevance. A reranker sees both the query and each candidate together, which lets it make finer-grained relevance judgments than the initial retrieval pass.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Know which file to look at | The answer is in a predictable location (README, config, specific module) | Hardcoded path rules | File tree + path metadata |
| Need an exact identifier | Looking for a function name, class name, error string, or config key | grep/regex in Module 3 tools | Grep / regex search |
| Keyword-heavy code search | Need documents containing specific terms but with ranking | grep with manual sorting | BM25 / lexical search |
| Structured metadata queries | "Which files were modified most recently?" or "List all Python files importing X" | Manual file inspection | JSON / SQL metadata index |
| Symbol lookup | "Where is UserService defined?" or "What methods does Router have?" | grep for the symbol name | AST / symbol index |
| Conceptual questions | "How does authentication work?" (user's words differ from code's terms) | Keyword search | Vector search |
| Mixed exact + conceptual | Some questions need exact matches, others need semantic similarity | Run both and eyeball | Hybrid lexical + vector |
| Relationship questions | "What calls this function?" or "What breaks if I change this file?" | grep for import/usage | Graph traversal |
| Too many results from any method | Top-k returns partly relevant, partly noise | Increase k and hope | Reranker on top of any of the above |
The retrieval method chooser
This table is your decision framework. It will be helpful to consult before building anything as it saves you from over-engineering retrieval for problems that have simpler solutions.
| Method | Good for | Weak for | Cheapest implementation | Upgrade signal |
|---|---|---|---|---|
| File tree + path metadata | Known locations, configuration files, READMEs, directory conventions | Anything requiring content understanding | os.listdir + path pattern matching | You need to search inside files, not just find them |
| Grep / regex | Exact identifiers, error strings, import statements, config keys | Semantic similarity, fuzzy matches, typo tolerance | Your existing search_text tool from Module 3 | Queries use different words than the code (e.g., "auth" vs. verify_credentials) |
| BM25 / lexical search | Ranked keyword search, documentation, comments, docstrings | Conceptual questions where vocabulary differs | rank_bm25 Python library over your chunked corpus | Relevant results rank below irrelevant ones because of vocabulary mismatch |
| JSON / SQL metadata index | File metadata, symbol lists, dependency tracking, structured queries | Free-text conceptual search | A JSON file mapping filenames to metadata (language, imports, exports, size) | You need to search content semantics, not just attributes |
| AST / symbol index | Symbol lookup, function signatures, class hierarchies, definition locations | Cross-file relationship reasoning, natural language questions | Tree-sitter parse + symbol table (we'll build this in the AST-aware lesson) | You need to answer "what calls this?" or "what depends on this?" |
| Vector search | Conceptual similarity, natural language questions, documentation search | Exact identifier lookup, structured queries, relationship traversal | Embedding model + Qdrant (we'll build this in the naive baseline lesson) | You need exact matches and semantic matches together |
| Hybrid lexical + vector | Mixed question types, production systems that serve varied queries | Simple use cases where one method is clearly sufficient | BM25 + vector search with reciprocal rank fusion | Your retrieval needs are narrow enough that one method works fine |
| Graph traversal | Import chains, call graphs, dependency impact, "what breaks if I change X?" | Similarity-based questions, conceptual search | NetworkX with import/call edges (we'll build this in the graph/hybrid lesson) | Your questions don't involve relationships between code entities |
| Reranker | Improving precision in the top results from any retrieval method | Being a standalone retrieval method (it needs candidates to rerank) | Cross-encoder model on top of initial results | Your initial retrieval already returns mostly relevant results |
For the full decision matrix with additional columns and edge cases, see the Retrieval Method Chooser reference page.
Walkthrough
Start cheap, upgrade on evidence
The most effective retrieval systems are built incrementally. Don't start with a vector database and graph store. Start with the simplest method that answers your questions, observe where it fails, and upgrade only the methods that need upgrading.
Your Module 3 agent already has grep-based retrieval. For many question types, like exact symbol lookup, error string search, import tracing, grep is often good enough. The goal of this lesson is to build one more retrieval method (structured metadata) and see how far it gets before we need vectors.
Build a structured metadata index
We'll create a JSON index that stores metadata about every file in your anchor repo. This gives you a queryable data structure for questions like "which files define classes?" or "what are the entry points?". Using grep can answer these, but neither very well nor efficiently.
cd anchor-repo
mkdir -p retrieval# retrieval/build_metadata_index.py
"""Build a JSON metadata index for the anchor repository."""
import ast
import json
import os
from pathlib import Path
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
INDEX_PATH = Path("retrieval/metadata_index.json")
def is_excluded(path: Path) -> bool:
return any(part in EXCLUDED_DIRS for part in path.parts)
def extract_python_metadata(file_path: Path) -> dict:
"""Extract metadata from a Python file using the ast module."""
source = file_path.read_text(errors="replace")
metadata = {
"functions": [],
"classes": [],
"imports": [],
"line_count": len(source.splitlines()),
}
try:
tree = ast.parse(source)
except SyntaxError:
metadata["parse_error"] = True
return metadata
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
metadata["functions"].append({
"name": node.name,
"line": node.lineno,
"args": [a.arg for a in node.args.args],
})
elif isinstance(node, ast.ClassDef):
methods = [
n.name for n in node.body
if isinstance(n, (ast.FunctionDef, ast.AsyncFunctionDef))
]
metadata["classes"].append({
"name": node.name,
"line": node.lineno,
"methods": methods,
})
elif isinstance(node, ast.Import):
for alias in node.names:
metadata["imports"].append(alias.name)
elif isinstance(node, ast.ImportFrom):
if node.module:
metadata["imports"].append(node.module)
return metadata
def build_index() -> dict:
"""Walk the repo and build metadata for every file."""
index = {}
for root, dirs, files in os.walk(REPO_ROOT):
# Skip excluded directories
dirs[:] = [d for d in dirs if d not in EXCLUDED_DIRS]
for fname in files:
fpath = Path(root) / fname
rel = str(fpath.relative_to(REPO_ROOT))
entry = {
"path": rel,
"extension": fpath.suffix,
"size_bytes": fpath.stat().st_size,
}
if fpath.suffix == ".py":
entry.update(extract_python_metadata(fpath))
index[rel] = entry
return index
if __name__ == "__main__":
index = build_index()
INDEX_PATH.write_text(json.dumps(index, indent=2))
py_files = [k for k, v in index.items() if v["extension"] == ".py"]
total_functions = sum(len(v.get("functions", [])) for v in index.values())
total_classes = sum(len(v.get("classes", [])) for v in index.values())
print(f"Indexed {len(index)} files ({len(py_files)} Python)")
print(f"Found {total_functions} functions, {total_classes} classes")
print(f"Index saved to {INDEX_PATH}")python retrieval/build_metadata_index.pyExpected output (these numbers will vary based on your anchor repo):
Indexed 47 files (23 Python)
Found 68 functions, 12 classes
Index saved to retrieval/metadata_index.json
Query the metadata index
Now build a query tool that can answer structured questions using this index:
# retrieval/query_metadata.py
"""Query the metadata index for structured code questions."""
import json
from pathlib import Path
INDEX_PATH = Path("retrieval/metadata_index.json")
def load_index() -> dict:
return json.loads(INDEX_PATH.read_text())
def find_symbol(name: str, index: dict = None) -> list[dict]:
"""Find where a function or class is defined."""
if index is None:
index = load_index()
results = []
for path, meta in index.items():
for fn in meta.get("functions", []):
if name.lower() in fn["name"].lower():
results.append({
"type": "function",
"name": fn["name"],
"file": path,
"line": fn["line"],
"args": fn["args"],
})
for cls in meta.get("classes", []):
if name.lower() in cls["name"].lower():
results.append({
"type": "class",
"name": cls["name"],
"file": path,
"line": cls["line"],
"methods": cls["methods"],
})
return results
def find_importers(module_name: str, index: dict = None) -> list[str]:
"""Find files that import a given module."""
if index is None:
index = load_index()
return [
path for path, meta in index.items()
if module_name in meta.get("imports", [])
]
def list_entry_points(index: dict = None) -> list[dict]:
"""Find likely entry points: files with if __name__ == '__main__' or main()."""
if index is None:
index = load_index()
results = []
for path, meta in index.items():
if meta.get("extension") != ".py":
continue
fn_names = [f["name"] for f in meta.get("functions", [])]
if "main" in fn_names or path.endswith("__main__.py"):
results.append({"file": path, "functions": fn_names})
return results
if __name__ == "__main__":
import sys
index = load_index()
if len(sys.argv) > 1:
query = sys.argv[1]
print(f"Searching for symbol: {query}")
results = find_symbol(query, index)
for r in results:
print(f" {r['type']} {r['name']} in {r['file']}:{r['line']}")
if not results:
print(" No matches found")
print(f"\nFiles importing '{query}':")
importers = find_importers(query, index)
for f in importers:
print(f" {f}")
if not importers:
print(" None")
else:
print("Entry points:")
for ep in list_entry_points(index):
print(f" {ep['file']}: {ep['functions']}")# Search for a symbol in your repo
python retrieval/query_metadata.py "UserService"
# Or list entry points
python retrieval/query_metadata.pyTier 0.5: Compare structured retrieval against your Module 3 tools
This is where we see the importance of our decision framework. We'll run a subset of your benchmark questions through three retrieval methods (grep (Module 3), the metadata index (this lesson), and later vector search (next lesson)) and compare which questions each one handles well.
# retrieval/compare_substrates.py
"""Compare grep vs. metadata index on benchmark questions."""
import json
import subprocess
from pathlib import Path
from retrieval.query_metadata import load_index, find_symbol, find_importers
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules"}
def grep_search(query: str) -> list[str]:
"""Run grep and return matching file paths."""
exclude_args = []
for d in EXCLUDED_DIRS:
exclude_args.extend(["--exclude-dir", d])
cmd = ["grep", "-rl", "--include=*.py"] + exclude_args + [query, "."]
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10, cwd=REPO_ROOT)
return [line.strip() for line in result.stdout.strip().split("\n") if line.strip()]
except subprocess.TimeoutExpired:
return []
def metadata_search(query: str, index: dict) -> list[str]:
"""Search the metadata index for symbols matching the query."""
symbols = find_symbol(query, index)
importers = find_importers(query, index)
files = list(set([s["file"] for s in symbols] + importers))
return files
def run_comparison():
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
index = load_index()
print(f"{'Category':<20} {'Question (truncated)':<45} {'Grep':<8} {'Metadata':<8}")
print("-" * 85)
for q in questions[:15]: # Compare first 15 questions
# Extract a likely search term from the question
# In practice, you'd use the model to extract terms; here we use a simple heuristic
words = q["question"].split()
# Look for CamelCase or snake_case terms as likely identifiers
search_terms = [
w.strip("?.,\"'") for w in words
if ("_" in w or (any(c.isupper() for c in w[1:]) and any(c.islower() for c in w)))
]
search_term = search_terms[0] if search_terms else words[-2] if len(words) > 1 else words[0]
grep_results = grep_search(search_term)
meta_results = metadata_search(search_term, index)
print(f"{q['category']:<20} {q['question'][:43]:<45} {len(grep_results):<8} {len(meta_results):<8}")
if __name__ == "__main__":
run_comparison()python -m retrieval.compare_substratesYou should see a table showing how many files each method found for each question. Watch for this patterns:
- Symbol lookup questions: the metadata index will often find the exact file and line, while grep returns more noise
- "What imports X?" questions: the metadata index answers directly, grep gives partial matches
- Conceptual questions ("How does authentication work?"): neither grep nor the metadata index handles these well. That's the signal that you'll need semantic search
These patterns are precisely what the retrieval method chooser predicts. The goal here isn't to build one retrieval method that handles everything, but rather to know which approach to reach for based on the question type.
When you don't need a vector database
I'll be a little politically incorrect here, because I see a lot of content encouraging teams to waste weeks building vector retrieval infrastructure for problems that grep solves in milliseconds. You don't need a vector database by default. If these are your scenarios, skip vector search:
-
Your questions use the same vocabulary as your code. If the user asks "where is
validate_pathdefined?" that's grep territory. Embeddings add latency and lose precision for exact matches. -
Your corpus is small enough to scan. If your repo has fewer than a few thousand files, grep over the full codebase runs in under a second. Vector search adds complexity without meaningful speed benefit at this scale.
-
Your questions are structural, not semantic. "What files import
datetime?" is a metadata query. "What methods doesRouterhave?" is a symbol table lookup. Neither needs embeddings. -
You need exact recall. Vector search is approximate by nature. If you need to guarantee that a specific identifier appears in the results, lexical search is more reliable.
You do need vector search (or something beyond lexical) when:
- The user's vocabulary differs from the code's vocabulary ("auth flow" vs.
verify_credentials) - The question is conceptual ("how does error handling work in this codebase?")
- You need to find code that's semantically similar to a description
- Your corpus is large enough that scanning is too slow
We'll build that vector search baseline in the next lesson. But we'll build it knowing exactly which questions it needs to answer, the ones our simpler methods can't handle.
Exercises
- Build the metadata index (
build_metadata_index.py) for your anchor repo. Inspect the JSON output and verify it captured your files, functions, and classes accurately. - Run
query_metadata.pyagainst five symbol names from your repo. Compare the results against what you get from thesearch_texttool inagent/tools.py. Note which approach gives you more precise results for each query. - Run
compare_substrates.pyagainst your benchmark questions. Categorize each question as "grep handles this," "metadata handles this," "neither handles this well." - For the questions in the "neither handles this" category, write down what kind of retrieval you think would help. Don't look ahead. Form your own hypothesis first.
- Add a
find_dependentsfunction toquery_metadata.pythat answers "which files would be affected if I changed file X?" by tracing import relationships. Test it on a core module in your repo.
Completion checkpoint
You should now have:
- A working metadata index covering all files in your anchor repo
- Query functions that answer symbol lookup and import-tracing questions
- A comparison showing which benchmark questions each retrieval method handles well
- A categorized list of questions that need semantic retrieval (this will become your test set for the next lesson)
- A clear understanding of when simpler retrieval outperforms vector search
Reflection prompts
- Which of your benchmark questions were answered well by grep alone? What do those questions have in common?
- Which questions did the metadata index handle better than grep? What structural information made the difference?
- For the questions that neither method handled, what's missing: vocabulary mapping, conceptual understanding, or relationship awareness?
- Looking at the retrieval method chooser table, which methods do you think your final system will need to combine? Why?
What's next
Naive Vector Baseline. Start with the obvious semantic-search baseline so you can see exactly what it helps with and what it still breaks.
References
Start here
- Retrieval Method Chooser — the full decision matrix for all nine retrieval methods, with additional edge cases and upgrade signals
Build with this
- rank_bm25 on PyPI — a lightweight BM25 implementation for when you need ranked lexical search beyond grep
- Python ast module — the standard library module we used for extracting Python metadata
Deep dive
- Google Research: Rethinking Search — why lexical and structured retrieval remain essential even in a world of embeddings
- Pinecone: Hybrid Search — a practical overview of combining lexical and vector approaches