Retrieval Modes and Routing
Up to this point, every question in our pipeline has followed the same retrieval path: hybrid search over code chunks. That was the right starting point. A single retrieval strategy is easier to debug and benchmark. But your anchor project indexes both code and documentation, and those need different retrieval approaches. A question about "how does the caching layer work?" benefits from README and design doc retrieval. A question about "what does validate_path return?" needs precise symbol lookup. A question about "what changed in the auth module last week?" might need neither traditional retrieval — it might need git log.
And some questions don't need retrieval at all. "What's the difference between a list and a tuple in Python?" is general knowledge. Retrieving from your codebase for that question wastes tokens, adds latency, and can actually hurt answer quality by injecting irrelevant code context.
This lesson builds a retrieve(query, mode) service that routes questions to the right retrieval strategy, or decides to skip retrieval entirely. This is the last piece of the RAG pipeline before we move to making the whole system visible and accountable in Module 6.
What you'll learn
- Build a
retrieve(query, mode)service with four modes:code,docs,hybrid, andauto - Implement auto-routing logic that classifies questions and selects the appropriate retrieval mode
- Decide when not to retrieve, and handle those cases gracefully
- Understand retrieval routing as both a quality optimization and a cost optimization
- Test routing accuracy on your benchmark questions
Concepts
Retrieval mode: a named retrieval strategy optimized for a specific type of evidence. In our system, code mode uses AST-aware vector search, lexical matching, and graph traversal over code files. docs mode is designed for documentation files (README, design docs, comments). hybrid mode combines both. auto mode selects the best mode based on the question. In a production system, each mode would have its own index and chunking strategy. In our implementation, docs mode currently reuses the code retriever — building a separate documentation index is a natural extension once you have documentation worth indexing separately.
Retrieval router: the component that decides which retrieval mode to use for a given question. The router sits between the question and the retrieval pipeline. A simple router uses keyword heuristics; a more sophisticated one uses an LLM to classify the question. The router's accuracy directly affects answer quality — routing a code question to docs retrieval (or vice versa) degrades results even when the underlying retrieval is good.
Retrieval skipping: the deliberate decision not to retrieve for a given question. This is itself a routing decision: the router classifies the question as answerable from general knowledge and skips the retrieval pipeline entirely. Retrieval skipping is a cost optimization (no embedding call, no vector search, no context tokens) and sometimes a quality optimization (irrelevant evidence can confuse the model). We'll build explicit skip logic with a fallback path.
Retrieval policy: the set of rules that govern when, how, and what to retrieve. A retrieval policy includes the routing rules, mode configurations, skip conditions, and fallback behavior. Making the policy explicit and configurable (rather than hardcoding it in the pipeline) lets you tune retrieval behavior without changing code.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Wrong retrieval mode | Code questions return documentation; doc questions return code | Single retrieval path | Mode-based routing with auto-classification |
| Always-on retrieval | System retrieves for every question, including general knowledge | No skip logic | Retrieval policy with explicit skip conditions |
| Mode mismatch noise | Model sees irrelevant evidence from the wrong corpus | Broader hybrid search | Targeted retrieval with mode-specific indexes |
| Routing errors | Auto-router picks the wrong mode and answer quality drops | Manual mode override | Routing accuracy benchmark with manual labels |
| Unnecessary cost | Every question incurs embedding + search costs | Reduce top-k | Skip retrieval for questions that don't need it |
Walkthrough
The retrieve(query, mode) service
This is the unified retrieval interface. All downstream code (the RAG pipeline, the evidence bundle builder, the answer generator) calls retrieve() and gets back a consistent evidence bundle regardless of which mode was used internally.
# rag/retrieval_service.py
"""Unified retrieval service with mode routing."""
import re
import sys
from dataclasses import dataclass
from enum import Enum
from typing import Callable
sys.path.insert(0, ".")
from retrieval.context_compiler import compile_context, ContextPack
from rag.pack_to_bundle import context_pack_to_bundle
from rag.evidence_bundle import EvidenceBundle
class RetrievalMode(Enum):
CODE = "code"
DOCS = "docs"
HYBRID = "hybrid"
AUTO = "auto"
SKIP = "skip"
@dataclass
class RetrievalPolicy:
"""Configuration for the retrieval router."""
default_mode: RetrievalMode = RetrievalMode.AUTO
token_budget: int = 4000
enable_skip: bool = True
# Questions matching these patterns will skip retrieval
skip_patterns: list[str] = None
# Override: force a specific mode regardless of classification
force_mode: RetrievalMode | None = None
def __post_init__(self):
if self.skip_patterns is None:
self.skip_patterns = [
r"^what is (a |an |the )?(python|javascript|variable|function|class|api|http|rest)\b",
r"^(explain|describe) (what |how )?(a |an |the )?(list|dict|tuple|set|string|integer|float|boolean)",
r"^how (do|does) (python|javascript|git|docker|linux)",
r"^what('s| is) the difference between",
]
# ---------------------------------------------------------------------------
# Question classifier
# ---------------------------------------------------------------------------
@dataclass
class QuestionClassification:
"""Result of classifying a question for routing."""
mode: RetrievalMode
confidence: float
reasoning: str
def classify_question(question: str, policy: RetrievalPolicy) -> QuestionClassification:
"""Classify a question and determine the best retrieval mode.
This uses heuristics. In a production system, you might use an LLM
for classification, but heuristics are faster, cheaper, and easier
to debug. Start here and upgrade if routing accuracy is a bottleneck.
"""
q_lower = question.lower().strip()
# Check skip patterns first
if policy.enable_skip:
for pattern in policy.skip_patterns:
if re.match(pattern, q_lower):
return QuestionClassification(
mode=RetrievalMode.SKIP,
confidence=0.8,
reasoning=f"Matches skip pattern: general knowledge question",
)
# Code signals: identifiers, file paths, code-specific terms
code_signals = 0
# CamelCase or snake_case identifiers
if re.search(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)+\b|\b[a-z_]+(?:_[a-z]+){2,}\b', question):
code_signals += 2
# File paths
if re.search(r'[\w/]+\.(py|js|ts|go|rs|java|rb)\b', question):
code_signals += 2
# Code-specific verbs
if any(w in q_lower for w in ["return", "import", "call", "implement", "function", "class", "method"]):
code_signals += 1
# Line numbers or stack traces
if re.search(r'line \d+|traceback|error at', q_lower):
code_signals += 1
# Doc signals: design, architecture, explanation requests
doc_signals = 0
if any(w in q_lower for w in ["readme", "documentation", "design", "architecture", "overview"]):
doc_signals += 2
if any(w in q_lower for w in ["why", "decision", "tradeoff", "approach", "philosophy"]):
doc_signals += 1
if any(w in q_lower for w in ["how to use", "getting started", "setup", "install"]):
doc_signals += 1
# Route based on signal strength
if code_signals >= 3 and doc_signals == 0:
return QuestionClassification(
mode=RetrievalMode.CODE,
confidence=0.8,
reasoning=f"Strong code signals ({code_signals}), no doc signals",
)
elif doc_signals >= 3 and code_signals == 0:
return QuestionClassification(
mode=RetrievalMode.DOCS,
confidence=0.8,
reasoning=f"Strong doc signals ({doc_signals}), no code signals",
)
elif code_signals > 0 and doc_signals > 0:
return QuestionClassification(
mode=RetrievalMode.HYBRID,
confidence=0.6,
reasoning=f"Mixed signals (code: {code_signals}, docs: {doc_signals})",
)
else:
# Default to hybrid when we're not sure
return QuestionClassification(
mode=RetrievalMode.HYBRID,
confidence=0.4,
reasoning="No strong signals; defaulting to hybrid",
)
# ---------------------------------------------------------------------------
# Mode-specific retrieval
# ---------------------------------------------------------------------------
def retrieve_code(
question: str,
hybrid_retrieve_fn: Callable,
graph_traverse_fn: Callable | None,
token_budget: int,
) -> ContextPack:
"""Retrieve from code indexes only."""
# In a full implementation, this would use code-specific indexes
# and scoring. For now, we use the context compiler which already
# does code-focused retrieval.
return compile_context(
question,
hybrid_retrieve_fn,
graph_traverse_fn=graph_traverse_fn,
token_budget=token_budget,
)
def retrieve_docs(
question: str,
hybrid_retrieve_fn: Callable,
token_budget: int,
) -> ContextPack:
"""Retrieve from documentation indexes only.
In a production system, this would use a separate doc index with
different chunking (e.g., section-level instead of AST-level).
For now, we use the same hybrid retrieval but you'd swap in a
doc-specific retriever when your doc index is ready.
"""
return compile_context(
question,
hybrid_retrieve_fn,
graph_traverse_fn=None, # No graph traversal for docs
token_budget=token_budget,
)
def retrieve_hybrid(
question: str,
hybrid_retrieve_fn: Callable,
graph_traverse_fn: Callable | None,
token_budget: int,
) -> ContextPack:
"""Retrieve from both code and doc indexes."""
# Split the budget between code and docs
code_budget = int(token_budget * 0.6)
doc_budget = token_budget - code_budget
code_pack = retrieve_code(question, hybrid_retrieve_fn, graph_traverse_fn, code_budget)
doc_pack = retrieve_docs(question, hybrid_retrieve_fn, doc_budget)
# Merge the packs: combine chunks, re-sort, and re-budget
all_chunks = code_pack.chunks + doc_pack.chunks
# Deduplicate by content hash
seen = set()
unique = []
for chunk in all_chunks:
if chunk.content_hash not in seen:
seen.add(chunk.content_hash)
unique.append(chunk)
# Sort by score and trim to budget
unique.sort(key=lambda c: c.retrieval_score, reverse=True)
from retrieval.context_compiler import ContextPack as CP, apply_token_budget
merged = CP(
question=question,
chunks=unique,
total_tokens=sum(c.token_count for c in unique),
token_budget=token_budget,
)
return apply_token_budget(merged, budget=token_budget)
# ---------------------------------------------------------------------------
# Unified retrieval service
# ---------------------------------------------------------------------------
@dataclass
class RetrievalResult:
"""Result from the retrieval service."""
bundle: EvidenceBundle | None
mode_used: RetrievalMode
classification: QuestionClassification
skipped: bool = False
skip_reason: str = ""
def retrieve(
question: str,
hybrid_retrieve_fn: Callable,
graph_traverse_fn: Callable | None = None,
policy: RetrievalPolicy | None = None,
mode: RetrievalMode | None = None,
) -> RetrievalResult:
"""Unified retrieval service.
Call this with a question and optionally a mode. If mode is None or
AUTO, the router will classify the question and pick the best mode.
"""
if policy is None:
policy = RetrievalPolicy()
# Determine the mode to use
if policy.force_mode:
classification = QuestionClassification(
mode=policy.force_mode,
confidence=1.0,
reasoning="Forced by policy",
)
elif mode and mode != RetrievalMode.AUTO:
classification = QuestionClassification(
mode=mode,
confidence=1.0,
reasoning=f"Explicitly requested: {mode.value}",
)
else:
classification = classify_question(question, policy)
# Handle skip
if classification.mode == RetrievalMode.SKIP:
return RetrievalResult(
bundle=None,
mode_used=RetrievalMode.SKIP,
classification=classification,
skipped=True,
skip_reason=classification.reasoning,
)
# Route to the appropriate retrieval function
if classification.mode == RetrievalMode.CODE:
pack = retrieve_code(
question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
)
elif classification.mode == RetrievalMode.DOCS:
pack = retrieve_docs(
question, hybrid_retrieve_fn, policy.token_budget,
)
else: # HYBRID or fallback
pack = retrieve_hybrid(
question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
)
bundle = context_pack_to_bundle(pack)
return RetrievalResult(
bundle=bundle,
mode_used=classification.mode,
classification=classification,
)
if __name__ == "__main__":
from retrieval.hybrid_retrieve import hybrid_retrieve
test_questions = [
"What does validate_path return?",
"What is the architecture of the retrieval system?",
"What is a Python decorator?",
"What functions call read_file and how does the caching work?",
]
print("Retrieval routing demo\n")
for q in test_questions:
result = retrieve(q, hybrid_retrieve)
print(f"Q: {q}")
print(f" Mode: {result.mode_used.value}")
print(f" Confidence: {result.classification.confidence:.1f}")
print(f" Reasoning: {result.classification.reasoning}")
if result.skipped:
print(f" SKIPPED: {result.skip_reason}")
else:
print(f" Snippets: {len(result.bundle.snippets)}, "
f"Tokens: {result.bundle.total_tokens}")
print()python rag/retrieval_service.pyExpected output:
Retrieval routing demo
Q: What does validate_path return?
Mode: code
Confidence: 0.8
Reasoning: Strong code signals (3), no doc signals
Snippets: 4, Tokens: 1823
Q: What is the architecture of the retrieval system?
Mode: docs
Confidence: 0.8
Reasoning: Strong doc signals (3), no code signals
Snippets: 3, Tokens: 1450
Q: What is a Python decorator?
Mode: skip
Confidence: 0.8
Reasoning: Matches skip pattern: general knowledge question
SKIPPED: Matches skip pattern: general knowledge question
Q: What functions call read_file and how does the caching work?
Mode: hybrid
Confidence: 0.6
Reasoning: Mixed signals (code: 3, docs: 1)
Snippets: 5, Tokens: 2100
When NOT to retrieve
Retrieval skipping is the simplest cost optimization in the entire RAG pipeline, and it's often overlooked. Every retrieval call has a cost:
- Latency: embedding the query, searching the index, compiling the context pack
- Token cost: the retrieved evidence consumes input tokens in the generation call
- Quality risk: irrelevant evidence can confuse the model into producing worse answers than it would with no evidence at all
The skip logic we built uses pattern matching, which catches the obvious cases (general knowledge questions). For more nuanced cases, you'll want to track skip accuracy in your benchmark: did skipping help or hurt for each question?
A useful heuristic: if you're not sure whether to retrieve, retrieve but track whether the evidence was actually cited in the answer. If the model consistently ignores the evidence for a class of questions, that's a signal to add those questions to the skip list. We'll build this tracking in Module 6.
Auto-routing logic
The auto-router uses a signal-counting approach: count code signals, count doc signals, and route based on which is stronger. This is deliberately simple. Here's why:
- Debuggable: when routing goes wrong, you can inspect the signal counts and understand why. An LLM-based classifier is more accurate but harder to debug.
- Fast: no API call needed for classification. The router adds microseconds, not milliseconds.
- Cheap: no token cost for the classification step.
The tradeoff is accuracy. The heuristic router will misclassify some questions, particularly ambiguous ones like "how does the auth module handle token refresh?" (is that a code question or a design question?). If routing accuracy becomes a bottleneck, you can upgrade to an LLM classifier without changing the interface, as the retrieve() function's signature stays the same.
Measuring routing accuracy
To know whether your router is working, you'll need ground truth labels. Add a routing_mode field to your benchmark questions:
{"id": "q001", "question": "What does validate_path do?", "category": "code_lookup", "routing_mode": "code"}
{"id": "q002", "question": "What is the project architecture?", "category": "explanation", "routing_mode": "docs"}
{"id": "q003", "question": "What is a Python list?", "category": "general", "routing_mode": "skip"}
{"id": "q004", "question": "What calls read_file and why was it designed that way?", "category": "relationship", "routing_mode": "hybrid"}Then run the classifier on each question and compare:
# rag/test_routing.py
"""Test routing accuracy against labeled benchmark questions."""
import json
import sys
sys.path.insert(0, ".")
from rag.retrieval_service import (
classify_question, RetrievalPolicy, RetrievalMode,
)
from pathlib import Path
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
def test_routing():
policy = RetrievalPolicy()
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
q = json.loads(line)
if "routing_mode" in q:
questions.append(q)
if not questions:
print("No questions with routing_mode labels found.")
print("Add 'routing_mode' to your benchmark questions to test routing.")
return
correct = 0
total = len(questions)
for q in questions:
classification = classify_question(q["question"], policy)
expected = q["routing_mode"]
predicted = classification.mode.value
match = expected == predicted
if match:
correct += 1
else:
print(f"MISMATCH: {q['question'][:50]}...")
print(f" Expected: {expected}, Got: {predicted}")
print(f" Reasoning: {classification.reasoning}")
accuracy = correct / total * 100 if total > 0 else 0
print(f"\nRouting accuracy: {correct}/{total} ({accuracy:.0f}%)")
if accuracy < 80:
print("Consider adding more signal patterns or switching to LLM classification.")
if __name__ == "__main__":
test_routing()python rag/test_routing.pyIntegrating the router into the RAG pipeline
Now we'll update our RAG pipeline from Lesson 1 to use the retrieval service instead of calling the context compiler directly:
# rag/rag_with_routing.py
"""RAG pipeline with retrieval routing: the complete Module 5 pipeline."""
import sys
sys.path.insert(0, ".")
from rag.retrieval_service import retrieve, RetrievalPolicy, RetrievalMode
from rag.grounded_answer import (
GroundedAnswer, check_grounding, generate_answer,
)
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-4o-mini"
def rag_pipeline_with_routing(
question: str,
hybrid_retrieve_fn,
graph_traverse_fn=None,
policy: RetrievalPolicy | None = None,
mode: RetrievalMode | None = None,
model: str = MODEL,
) -> GroundedAnswer:
result = retrieve(
question, hybrid_retrieve_fn,
graph_traverse_fn=graph_traverse_fn, policy=policy, mode=mode,
)
if result.skipped:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": (
"You are a helpful assistant. Answer the question from "
"your general knowledge. Be concise and accurate."
)},
{"role": "user", "content": question},
],
temperature=0,
)
return GroundedAnswer(
question=question,
answer=response.choices[0].message.content,
abstained=False, model=model,
)
from retrieval.context_compiler import ContextPack, EvidenceChunk
pack = ContextPack(
question=question,
chunks=[
EvidenceChunk(
chunk_id=s.chunk_id, file_path=s.file_path,
symbol_name=s.symbol_name or "", text=s.text,
start_line=s.start_line, end_line=s.end_line,
retrieval_method=s.retrieval_method,
retrieval_score=s.relevance_score,
)
for s in result.bundle.snippets
],
total_tokens=result.bundle.total_tokens,
token_budget=result.bundle.token_budget,
)
sufficient, reason = check_grounding(pack)
if not sufficient:
return GroundedAnswer(
question=question,
answer=f"I don't have enough evidence to answer this question reliably. "
f"{reason} Rather than guessing, I'd recommend checking the "
f"codebase directly or refining the question.",
abstained=True, abstention_reason=reason,
model=model, context_tokens=pack.total_tokens,
)
return generate_answer(question, pack, model=model)
if __name__ == "__main__":
from retrieval.hybrid_retrieve import hybrid_retrieve
test_questions = [
"What does validate_path return?",
"What is a Python decorator?",
"What is the architecture of the retrieval system?",
]
for q in test_questions:
print(f"Q: {q}")
answer = rag_pipeline_with_routing(q, hybrid_retrieve)
if answer.abstained:
print(f" ABSTAINED: {answer.abstention_reason}")
print(f" Answer: {answer.answer[:150]}...")
print(f" Citations: {len(answer.citations)}")
print()python rag/rag_with_routing.pyExercises
- Build the retrieval service (
rag/retrieval_service.py). Test it with the four example questions and verify that each one routes to the expected mode. - Add
routing_modelabels to at least 15 of your benchmark questions. Runrag/test_routing.pyand measure routing accuracy. If accuracy is below 80%, adjust the signal patterns. - Ask the same question in all four modes (
code,docs,hybrid,auto) and compare the answers. For which questions does the mode choice matter most? For which does it barely matter? - Test retrieval skipping on five general-knowledge questions and five codebase-specific questions. Verify that skipping works correctly for general questions and doesn't trigger for codebase questions.
- Integrate the router into the full RAG pipeline (
rag/rag_with_routing.py). Run your complete benchmark through the routed pipeline and compare answer quality to the non-routed version from Lesson 1. Does routing improve accuracy? For which question categories?
Completion checkpoint
You should now have:
- A
retrieve(query, mode)service with four modes:code,docs,hybrid, andauto - An auto-routing classifier that picks the best retrieval mode based on question signals
- Retrieval skip logic for general knowledge questions that don't need codebase evidence
- Routing accuracy measured against labeled benchmark questions (target: 80%+)
- A complete RAG pipeline with routing integrated, tested on your full benchmark
Reflection prompts
- Which questions did the router misclassify? What would you need to fix those: better heuristics, or an LLM-based classifier?
- How much did retrieval skipping save in terms of latency and tokens? Was the answer quality for skipped questions better, worse, or the same as when evidence was retrieved?
- Are there question categories where mode routing made a significant quality difference? Categories where it didn't matter?
Connecting to the project
We've built a complete answer pipeline. A question comes in, the router classifies it and selects a retrieval strategy (or skips retrieval entirely), the context compiler produces a structured evidence bundle, and the answer generator produces a grounded response with citations and abstention. Every piece of retrieved evidence has provenance. Every claim in the answer can be traced to a source.
Now we need to make it visible and accountable. The pipeline works, but we can't see inside it during operation. How long does each stage take? How much does each answer cost? When routing goes wrong, how do we detect it? When the model hallucinates despite grounding, how do we catch it? Module 6 will add the observability and evaluation layers that turn this from a working prototype into a system you can operate and improve with confidence.
What's next
Telemetry. You now have a full pipeline with branching behavior; the next lesson makes that behavior visible so you can see routes, tool calls, latency, and failures instead of guessing.
References
Start here
- OpenAI: Text generation guide — the generation API underlying both the routed and non-routed answer paths
Build with this
- LlamaIndex: Query engine — LlamaIndex's approach to query routing and retrieval strategy selection
- Anthropic: Prompt engineering guide — useful for refining the system prompts used in both the routed and skip paths
Deep dive
- Semantic Router — an open-source library for semantic query routing, useful if you want to replace heuristic routing with embedding-based classification
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — the original RAG paper; the routing concept extends the paper's single-retriever design to multiple specialized retrievers