Retrieval Method Chooser

This is the decision matrix for choosing retrieval methods. Consult it before building retrieval infrastructure. It'll save you from over-engineering for problems that have simpler solutions.

The core message: RAG is a pattern, not a database choice. The retrieval step can use lexical, structural, relational, vector, or hybrid methods. The method you choose should match the question you're answering.

How to use this matrix

Look at your benchmark questions and categorize them by the "Good for" column
Start with the cheapest method that covers each category
Watch for the "Upgrade signal," which tells you when to add the next method
Combine methods through hybrid retrieval when your questions span multiple categories

The curriculum builds these methods incrementally in Module 4: structured metadata in the decision framework lesson, vector search in the naive baseline, AST/symbol indexing in the AST-aware lesson, graph + hybrid retrieval, and finally context compilation.

Decision matrix

Method	Good for	Weak for	Cheapest implementation	Upgrade signal
File tree + path metadata	Known locations, configuration files, READMEs, directory conventions	Anything requiring content understanding	`os.listdir` + path pattern matching	You need to search inside files, not just find them
Grep / regex	Exact identifiers, error strings, import statements, config keys	Semantic similarity, fuzzy matches, typo tolerance	`grep -rn` or the `search_text` tool from Module 3	Queries use different words than the code (e.g., "auth" vs `verify_credentials`)
BM25 / lexical search	Ranked keyword search, documentation, comments, docstrings	Conceptual questions where vocabulary differs	`rank_bm25` Python library over your chunked corpus	Relevant results rank below irrelevant ones because of vocabulary mismatch
JSON / SQL metadata index	File metadata, symbol lists, dependency tracking, structured queries	Free-text conceptual search	A JSON file mapping filenames to metadata (language, imports, exports, size)	You need to search content semantics, not just attributes
AST / symbol index	Symbol lookup, function signatures, class hierarchies, definition locations	Cross-file relationship reasoning, natural language questions	Tree-sitter parse + symbol table (Module 4, Tier 2)	You need to answer "what calls this?" or "what depends on this?"
Vector search	Conceptual similarity, natural language questions, documentation search	Exact identifier lookup, structured queries, relationship traversal	Embedding model + Qdrant (Module 4, Tier 1)	You need exact matches and semantic matches together
Hybrid lexical + vector	Mixed question types, production systems that serve varied queries	Simple use cases where one method is clearly sufficient	BM25 + vector search with reciprocal rank fusion (Module 4, Tier 3)	Your retrieval needs are narrow enough that one method works fine
Graph traversal	Import chains, call graphs, dependency impact, "what breaks if I change X?"	Similarity-based questions, conceptual search	NetworkX with import/call edges (Module 4, Tier 3)	Your questions don't involve relationships between code entities
Reranker	Improving precision in the top results from any retrieval method	Being a standalone retrieval method (it needs candidates to rerank)	Cross-encoder model on top of initial results	Your initial retrieval already returns mostly relevant results

When you don't need a vector database

You don't need vector search when:

Your questions use the same vocabulary as your code. "Where is validate_path defined?" is grep territory.
Your corpus is small enough to scan. Under a few thousand files, grep runs in under a second.
Your questions are structural, not semantic. "What files import datetime?" is a metadata query.
You need exact recall. Vector search is approximate by nature.

You do need vector search (or something beyond lexical) when:

The user's vocabulary differs from the code's vocabulary
The question is conceptual ("how does error handling work?")
You need to find semantically similar code
Your corpus is large enough that scanning is too slow