Retrieval Modes and Routing

Up to this point, every question in our pipeline has followed the same retrieval path: hybrid search over code chunks. That was the right starting point. A single retrieval strategy is easier to debug and benchmark. But your anchor project indexes both code and documentation, and those need different retrieval approaches. A question about "how does the caching layer work?" benefits from README and design doc retrieval. A question about "what does validate_path return?" needs precise symbol lookup. A question about "what changed in the auth module last week?" might need neither traditional retrieval — it might need git log.

And some questions don't need retrieval at all. "What's the difference between a list and a tuple in Python?" is general knowledge. Retrieving from your codebase for that question wastes tokens, adds latency, and can actually hurt answer quality by injecting irrelevant code context.

This lesson builds a retrieve(query, mode) service that routes questions to the right retrieval strategy, or decides to skip retrieval entirely. This is the last piece of the RAG pipeline before we move to making the whole system visible and accountable in Module 6.

What you'll learn

Build a retrieve(query, mode) service with four modes: code, docs, hybrid, and auto
Implement auto-routing logic that classifies questions and selects the appropriate retrieval mode
Decide when not to retrieve, and handle those cases gracefully
Understand retrieval routing as both a quality optimization and a cost optimization
Test routing accuracy on your benchmark questions

Concepts

Retrieval mode: a named retrieval strategy optimized for a specific type of evidence. In our system, code mode uses AST-aware vector search, lexical matching, and graph traversal over code files. docs mode is designed for documentation files (README, design docs, comments). hybrid mode combines both. auto mode selects the best mode based on the question. In a production system, each mode would have its own index and chunking strategy. In our implementation, docs mode currently reuses the code retriever — building a separate documentation index is a natural extension once you have documentation worth indexing separately.

Retrieval router: the component that decides which retrieval mode to use for a given question. The router sits between the question and the retrieval pipeline. A simple router uses keyword heuristics; a more sophisticated one uses an LLM to classify the question. The router's accuracy directly affects answer quality — routing a code question to docs retrieval (or vice versa) degrades results even when the underlying retrieval is good.

Retrieval skipping: the deliberate decision not to retrieve for a given question. This is itself a routing decision: the router classifies the question as answerable from general knowledge and skips the retrieval pipeline entirely. Retrieval skipping is a cost optimization (no embedding call, no vector search, no context tokens) and sometimes a quality optimization (irrelevant evidence can confuse the model). We'll build explicit skip logic with a fallback path.

Retrieval policy: the set of rules that govern when, how, and what to retrieve. A retrieval policy includes the routing rules, mode configurations, skip conditions, and fallback behavior. Making the policy explicit and configurable (rather than hardcoding it in the pipeline) lets you tune retrieval behavior without changing code.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Wrong retrieval mode	Code questions return documentation; doc questions return code	Single retrieval path	Mode-based routing with auto-classification
Always-on retrieval	System retrieves for every question, including general knowledge	No skip logic	Retrieval policy with explicit skip conditions
Mode mismatch noise	Model sees irrelevant evidence from the wrong corpus	Broader hybrid search	Targeted retrieval with mode-specific indexes
Routing errors	Auto-router picks the wrong mode and answer quality drops	Manual mode override	Routing accuracy benchmark with manual labels
Unnecessary cost	Every question incurs embedding + search costs	Reduce top-k	Skip retrieval for questions that don't need it

Walkthrough

The retrieve(query, mode) service

This is the unified retrieval interface. All downstream code (the RAG pipeline, the evidence bundle builder, the answer generator) calls retrieve() and gets back a consistent evidence bundle regardless of which mode was used internally.

# rag/retrieval_service.py
"""Unified retrieval service with mode routing."""
import re
import sys
from dataclasses import dataclass
from enum import Enum
from typing import Callable

sys.path.insert(0, ".")
from retrieval.context_compiler import compile_context, ContextPack
from rag.pack_to_bundle import context_pack_to_bundle
from rag.evidence_bundle import EvidenceBundle


class RetrievalMode(Enum):
    CODE = "code"
    DOCS = "docs"
    HYBRID = "hybrid"
    AUTO = "auto"
    SKIP = "skip"


@dataclass
class RetrievalPolicy:
    """Configuration for the retrieval router."""
    default_mode: RetrievalMode = RetrievalMode.AUTO
    token_budget: int = 4000
    enable_skip: bool = True
    # Questions matching these patterns will skip retrieval
    skip_patterns: list[str] = None
    # Override: force a specific mode regardless of classification
    force_mode: RetrievalMode | None = None

    def __post_init__(self):
        if self.skip_patterns is None:
            self.skip_patterns = [
                r"^what is (a |an |the )?(python|javascript|variable|function|class|api|http|rest)\b",
                r"^(explain|describe) (what |how )?(a |an |the )?(list|dict|tuple|set|string|integer|float|boolean)",
                r"^how (do|does) (python|javascript|git|docker|linux)",
                r"^what('s| is) the difference between",
            ]


# ---------------------------------------------------------------------------
# Question classifier
# ---------------------------------------------------------------------------

@dataclass
class QuestionClassification:
    """Result of classifying a question for routing."""
    mode: RetrievalMode
    confidence: float
    reasoning: str


def classify_question(question: str, policy: RetrievalPolicy) -> QuestionClassification:
    """Classify a question and determine the best retrieval mode.

    This uses heuristics. In a production system, you might use an LLM
    for classification, but heuristics are faster, cheaper, and easier
    to debug. Start here and upgrade if routing accuracy is a bottleneck.
    """
    q_lower = question.lower().strip()

    # Check skip patterns first
    if policy.enable_skip:
        for pattern in policy.skip_patterns:
            if re.match(pattern, q_lower):
                return QuestionClassification(
                    mode=RetrievalMode.SKIP,
                    confidence=0.8,
                    reasoning=f"Matches skip pattern: general knowledge question",
                )

    # Code signals: identifiers, file paths, code-specific terms
    code_signals = 0
    # CamelCase or snake_case identifiers
    if re.search(r'\b[A-Z][a-z]+(?:[A-Z][a-z]+)+\b|\b[a-z_]+(?:_[a-z]+){2,}\b', question):
        code_signals += 2
    # File paths
    if re.search(r'[\w/]+\.(py|js|ts|go|rs|java|rb)\b', question):
        code_signals += 2
    # Code-specific verbs
    if any(w in q_lower for w in ["return", "import", "call", "implement", "function", "class", "method"]):
        code_signals += 1
    # Line numbers or stack traces
    if re.search(r'line \d+|traceback|error at', q_lower):
        code_signals += 1

    # Doc signals: design, architecture, explanation requests
    doc_signals = 0
    if any(w in q_lower for w in ["readme", "documentation", "design", "architecture", "overview"]):
        doc_signals += 2
    if any(w in q_lower for w in ["why", "decision", "tradeoff", "approach", "philosophy"]):
        doc_signals += 1
    if any(w in q_lower for w in ["how to use", "getting started", "setup", "install"]):
        doc_signals += 1

    # Route based on signal strength
    if code_signals >= 3 and doc_signals == 0:
        return QuestionClassification(
            mode=RetrievalMode.CODE,
            confidence=0.8,
            reasoning=f"Strong code signals ({code_signals}), no doc signals",
        )
    elif doc_signals >= 3 and code_signals == 0:
        return QuestionClassification(
            mode=RetrievalMode.DOCS,
            confidence=0.8,
            reasoning=f"Strong doc signals ({doc_signals}), no code signals",
        )
    elif code_signals > 0 and doc_signals > 0:
        return QuestionClassification(
            mode=RetrievalMode.HYBRID,
            confidence=0.6,
            reasoning=f"Mixed signals (code: {code_signals}, docs: {doc_signals})",
        )
    else:
        # Default to hybrid when we're not sure
        return QuestionClassification(
            mode=RetrievalMode.HYBRID,
            confidence=0.4,
            reasoning="No strong signals; defaulting to hybrid",
        )


# ---------------------------------------------------------------------------
# Mode-specific retrieval
# ---------------------------------------------------------------------------

def retrieve_code(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None,
    token_budget: int,
) -> ContextPack:
    """Retrieve from code indexes only."""
    # In a full implementation, this would use code-specific indexes
    # and scoring. For now, we use the context compiler which already
    # does code-focused retrieval.
    return compile_context(
        question,
        hybrid_retrieve_fn,
        graph_traverse_fn=graph_traverse_fn,
        token_budget=token_budget,
    )


def retrieve_docs(
    question: str,
    hybrid_retrieve_fn: Callable,
    token_budget: int,
) -> ContextPack:
    """Retrieve from documentation indexes only.

    In a production system, this would use a separate doc index with
    different chunking (e.g., section-level instead of AST-level).
    For now, we use the same hybrid retrieval but you'd swap in a
    doc-specific retriever when your doc index is ready.
    """
    return compile_context(
        question,
        hybrid_retrieve_fn,
        graph_traverse_fn=None,  # No graph traversal for docs
        token_budget=token_budget,
    )


def retrieve_hybrid(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None,
    token_budget: int,
) -> ContextPack:
    """Retrieve from both code and doc indexes."""
    # Split the budget between code and docs
    code_budget = int(token_budget * 0.6)
    doc_budget = token_budget - code_budget

    code_pack = retrieve_code(question, hybrid_retrieve_fn, graph_traverse_fn, code_budget)
    doc_pack = retrieve_docs(question, hybrid_retrieve_fn, doc_budget)

    # Merge the packs: combine chunks, re-sort, and re-budget
    all_chunks = code_pack.chunks + doc_pack.chunks

    # Deduplicate by content hash
    seen = set()
    unique = []
    for chunk in all_chunks:
        if chunk.content_hash not in seen:
            seen.add(chunk.content_hash)
            unique.append(chunk)

    # Sort by score and trim to budget
    unique.sort(key=lambda c: c.retrieval_score, reverse=True)

    from retrieval.context_compiler import ContextPack as CP, apply_token_budget
    merged = CP(
        question=question,
        chunks=unique,
        total_tokens=sum(c.token_count for c in unique),
        token_budget=token_budget,
    )
    return apply_token_budget(merged, budget=token_budget)


# ---------------------------------------------------------------------------
# Unified retrieval service
# ---------------------------------------------------------------------------

@dataclass
class RetrievalResult:
    """Result from the retrieval service."""
    bundle: EvidenceBundle | None
    mode_used: RetrievalMode
    classification: QuestionClassification
    skipped: bool = False
    skip_reason: str = ""


def retrieve(
    question: str,
    hybrid_retrieve_fn: Callable,
    graph_traverse_fn: Callable | None = None,
    policy: RetrievalPolicy | None = None,
    mode: RetrievalMode | None = None,
) -> RetrievalResult:
    """Unified retrieval service.

    Call this with a question and optionally a mode. If mode is None or
    AUTO, the router will classify the question and pick the best mode.
    """
    if policy is None:
        policy = RetrievalPolicy()

    # Determine the mode to use
    if policy.force_mode:
        classification = QuestionClassification(
            mode=policy.force_mode,
            confidence=1.0,
            reasoning="Forced by policy",
        )
    elif mode and mode != RetrievalMode.AUTO:
        classification = QuestionClassification(
            mode=mode,
            confidence=1.0,
            reasoning=f"Explicitly requested: {mode.value}",
        )
    else:
        classification = classify_question(question, policy)

    # Handle skip
    if classification.mode == RetrievalMode.SKIP:
        return RetrievalResult(
            bundle=None,
            mode_used=RetrievalMode.SKIP,
            classification=classification,
            skipped=True,
            skip_reason=classification.reasoning,
        )

    # Route to the appropriate retrieval function
    if classification.mode == RetrievalMode.CODE:
        pack = retrieve_code(
            question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
        )
    elif classification.mode == RetrievalMode.DOCS:
        pack = retrieve_docs(
            question, hybrid_retrieve_fn, policy.token_budget,
        )
    else:  # HYBRID or fallback
        pack = retrieve_hybrid(
            question, hybrid_retrieve_fn, graph_traverse_fn, policy.token_budget,
        )

    bundle = context_pack_to_bundle(pack)

    return RetrievalResult(
        bundle=bundle,
        mode_used=classification.mode,
        classification=classification,
    )


if __name__ == "__main__":
    from retrieval.hybrid_retrieve import hybrid_retrieve

    test_questions = [
        "What does validate_path return?",
        "What is the architecture of the retrieval system?",
        "What is a Python decorator?",
        "What functions call read_file and how does the caching work?",
    ]

    print("Retrieval routing demo\n")
    for q in test_questions:
        result = retrieve(q, hybrid_retrieve)
        print(f"Q: {q}")
        print(f"  Mode: {result.mode_used.value}")
        print(f"  Confidence: {result.classification.confidence:.1f}")
        print(f"  Reasoning: {result.classification.reasoning}")
        if result.skipped:
            print(f"  SKIPPED: {result.skip_reason}")
        else:
            print(f"  Snippets: {len(result.bundle.snippets)}, "
                  f"Tokens: {result.bundle.total_tokens}")
        print()

python rag/retrieval_service.py

Expected output:

Retrieval routing demo

Q: What does validate_path return?
  Mode: code
  Confidence: 0.8
  Reasoning: Strong code signals (3), no doc signals
  Snippets: 4, Tokens: 1823

Q: What is the architecture of the retrieval system?
  Mode: docs
  Confidence: 0.8
  Reasoning: Strong doc signals (3), no code signals
  Snippets: 3, Tokens: 1450

Q: What is a Python decorator?
  Mode: skip
  Confidence: 0.8
  Reasoning: Matches skip pattern: general knowledge question
  SKIPPED: Matches skip pattern: general knowledge question

Q: What functions call read_file and how does the caching work?
  Mode: hybrid
  Confidence: 0.6
  Reasoning: Mixed signals (code: 3, docs: 1)
  Snippets: 5, Tokens: 2100

When NOT to retrieve

Retrieval skipping is the simplest cost optimization in the entire RAG pipeline, and it's often overlooked. Every retrieval call has a cost:

Latency: embedding the query, searching the index, compiling the context pack
Token cost: the retrieved evidence consumes input tokens in the generation call
Quality risk: irrelevant evidence can confuse the model into producing worse answers than it would with no evidence at all

The skip logic we built uses pattern matching, which catches the obvious cases (general knowledge questions). For more nuanced cases, you'll want to track skip accuracy in your benchmark: did skipping help or hurt for each question?

A useful heuristic: if you're not sure whether to retrieve, retrieve but track whether the evidence was actually cited in the answer. If the model consistently ignores the evidence for a class of questions, that's a signal to add those questions to the skip list. We'll build this tracking in Module 6.

Auto-routing logic

The auto-router uses a signal-counting approach: count code signals, count doc signals, and route based on which is stronger. This is deliberately simple. Here's why:

Debuggable: when routing goes wrong, you can inspect the signal counts and understand why. An LLM-based classifier is more accurate but harder to debug.
Fast: no API call needed for classification. The router adds microseconds, not milliseconds.
Cheap: no token cost for the classification step.

The tradeoff is accuracy. The heuristic router will misclassify some questions, particularly ambiguous ones like "how does the auth module handle token refresh?" (is that a code question or a design question?). If routing accuracy becomes a bottleneck, you can upgrade to an LLM classifier without changing the interface, as the retrieve() function's signature stays the same.

Measuring routing accuracy

To know whether your router is working, you'll need ground truth labels. Add a routing_mode field to your benchmark questions:

{"id": "q001", "question": "What does validate_path do?", "category": "code_lookup", "routing_mode": "code"}
{"id": "q002", "question": "What is the project architecture?", "category": "explanation", "routing_mode": "docs"}
{"id": "q003", "question": "What is a Python list?", "category": "general", "routing_mode": "skip"}
{"id": "q004", "question": "What calls read_file and why was it designed that way?", "category": "relationship", "routing_mode": "hybrid"}

Then run the classifier on each question and compare:

# rag/test_routing.py
"""Test routing accuracy against labeled benchmark questions."""
import json
import sys

sys.path.insert(0, ".")
from rag.retrieval_service import (
    classify_question, RetrievalPolicy, RetrievalMode,
)
from pathlib import Path

BENCHMARK_FILE = Path("benchmark-questions.jsonl")


def test_routing():
    policy = RetrievalPolicy()

    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                if "routing_mode" in q:
                    questions.append(q)

    if not questions:
        print("No questions with routing_mode labels found.")
        print("Add 'routing_mode' to your benchmark questions to test routing.")
        return

    correct = 0
    total = len(questions)

    for q in questions:
        classification = classify_question(q["question"], policy)
        expected = q["routing_mode"]
        predicted = classification.mode.value
        match = expected == predicted

        if match:
            correct += 1
        else:
            print(f"MISMATCH: {q['question'][:50]}...")
            print(f"  Expected: {expected}, Got: {predicted}")
            print(f"  Reasoning: {classification.reasoning}")

    accuracy = correct / total * 100 if total > 0 else 0
    print(f"\nRouting accuracy: {correct}/{total} ({accuracy:.0f}%)")
    if accuracy < 80:
        print("Consider adding more signal patterns or switching to LLM classification.")


if __name__ == "__main__":
    test_routing()

python rag/test_routing.py

Integrating the router into the RAG pipeline

Now we'll update our RAG pipeline from Lesson 1 to use the retrieval service instead of calling the context compiler directly:

# rag/rag_with_routing.py
"""RAG pipeline with retrieval routing: the complete Module 5 pipeline."""
import sys
sys.path.insert(0, ".")

from rag.retrieval_service import retrieve, RetrievalPolicy, RetrievalMode
from rag.grounded_answer import (
    GroundedAnswer, check_grounding, generate_answer,
)
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4o-mini"


def rag_pipeline_with_routing(
    question: str,
    hybrid_retrieve_fn,
    graph_traverse_fn=None,
    policy: RetrievalPolicy | None = None,
    mode: RetrievalMode | None = None,
    model: str = MODEL,
) -> GroundedAnswer:
    result = retrieve(
        question, hybrid_retrieve_fn,
        graph_traverse_fn=graph_traverse_fn, policy=policy, mode=mode,
    )

    if result.skipped:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": (
                    "You are a helpful assistant. Answer the question from "
                    "your general knowledge. Be concise and accurate."
                )},
                {"role": "user", "content": question},
            ],
            temperature=0,
        )
        return GroundedAnswer(
            question=question,
            answer=response.choices[0].message.content,
            abstained=False, model=model,
        )

    from retrieval.context_compiler import ContextPack, EvidenceChunk
    pack = ContextPack(
        question=question,
        chunks=[
            EvidenceChunk(
                chunk_id=s.chunk_id, file_path=s.file_path,
                symbol_name=s.symbol_name or "", text=s.text,
                start_line=s.start_line, end_line=s.end_line,
                retrieval_method=s.retrieval_method,
                retrieval_score=s.relevance_score,
            )
            for s in result.bundle.snippets
        ],
        total_tokens=result.bundle.total_tokens,
        token_budget=result.bundle.token_budget,
    )

    sufficient, reason = check_grounding(pack)
    if not sufficient:
        return GroundedAnswer(
            question=question,
            answer=f"I don't have enough evidence to answer this question reliably. "
                   f"{reason} Rather than guessing, I'd recommend checking the "
                   f"codebase directly or refining the question.",
            abstained=True, abstention_reason=reason,
            model=model, context_tokens=pack.total_tokens,
        )

    return generate_answer(question, pack, model=model)


if __name__ == "__main__":
    from retrieval.hybrid_retrieve import hybrid_retrieve

    test_questions = [
        "What does validate_path return?",
        "What is a Python decorator?",
        "What is the architecture of the retrieval system?",
    ]
    for q in test_questions:
        print(f"Q: {q}")
        answer = rag_pipeline_with_routing(q, hybrid_retrieve)
        if answer.abstained:
            print(f"  ABSTAINED: {answer.abstention_reason}")
        print(f"  Answer: {answer.answer[:150]}...")
        print(f"  Citations: {len(answer.citations)}")
        print()

python rag/rag_with_routing.py

Exercises

Build the retrieval service (rag/retrieval_service.py). Test it with the four example questions and verify that each one routes to the expected mode.
Add routing_mode labels to at least 15 of your benchmark questions. Run rag/test_routing.py and measure routing accuracy. If accuracy is below 80%, adjust the signal patterns.
Ask the same question in all four modes (code, docs, hybrid, auto) and compare the answers. For which questions does the mode choice matter most? For which does it barely matter?
Test retrieval skipping on five general-knowledge questions and five codebase-specific questions. Verify that skipping works correctly for general questions and doesn't trigger for codebase questions.
Integrate the router into the full RAG pipeline (rag/rag_with_routing.py). Run your complete benchmark through the routed pipeline and compare answer quality to the non-routed version from Lesson 1. Does routing improve accuracy? For which question categories?

Completion checkpoint

You should now have:

A retrieve(query, mode) service with four modes: code, docs, hybrid, and auto
An auto-routing classifier that picks the best retrieval mode based on question signals
Retrieval skip logic for general knowledge questions that don't need codebase evidence
Routing accuracy measured against labeled benchmark questions (target: 80%+)
A complete RAG pipeline with routing integrated, tested on your full benchmark

Reflection prompts

Which questions did the router misclassify? What would you need to fix those: better heuristics, or an LLM-based classifier?
How much did retrieval skipping save in terms of latency and tokens? Was the answer quality for skipped questions better, worse, or the same as when evidence was retrieved?
Are there question categories where mode routing made a significant quality difference? Categories where it didn't matter?

Connecting to the project

We've built a complete answer pipeline. A question comes in, the router classifies it and selects a retrieval strategy (or skips retrieval entirely), the context compiler produces a structured evidence bundle, and the answer generator produces a grounded response with citations and abstention. Every piece of retrieved evidence has provenance. Every claim in the answer can be traced to a source.

Now we need to make it visible and accountable. The pipeline works, but we can't see inside it during operation. How long does each stage take? How much does each answer cost? When routing goes wrong, how do we detect it? When the model hallucinates despite grounding, how do we catch it? Module 6 will add the observability and evaluation layers that turn this from a working prototype into a system you can operate and improve with confidence.

What's next

Telemetry. You now have a full pipeline with branching behavior; the next lesson makes that behavior visible so you can see routes, tool calls, latency, and failures instead of guessing.

References

Start here

OpenAI: Text generation guide — the generation API underlying both the routed and non-routed answer paths

Build with this

LlamaIndex: Query engine — LlamaIndex's approach to query routing and retrieval strategy selection
Anthropic: Prompt engineering guide — useful for refining the system prompts used in both the routed and skip paths

Deep dive

Semantic Router — an open-source library for semantic query routing, useful if you want to replace heuristic routing with embedding-based classification
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — the original RAG paper; the routing concept extends the paper's single-retriever design to multiple specialized retrievers