Retrieval Fundamentals

In this lesson we'll build the simplest possible retrieval pipeline, stress-test it, and see exactly where it breaks. That hands-on experience with failure modes is what makes the rest of the retrieval curriculum click. You'll know what problems you're solving before we introduce more sophisticated tools.

We're deliberately building something naive here. The goal isn't a production retrieval system. It's building your intuition for what goes wrong when retrieval is too simple, so you don't learn the wrong lesson that "embeddings + vector DB = done."

What you'll learn

Build a minimal embedding-based search over a small text corpus
Explain what chunking is and why chunk boundaries affect retrieval quality
Compare vector search against lexical search on the same queries
Use metadata filters to narrow retrieval results
Identify the specific failure modes of naive retrieval
Write a Retrieval Lab Notes documenting what broke and why

Concepts

Retrieval: the process of finding relevant information from a corpus to include in the model's context. Retrieval is a pattern, not a specific technology. It can be powered by vector search, lexical search, metadata filters, structural indexes, or any combination. The goal is always the same: select the right evidence for the current task.

Embedding: a fixed-length numeric vector representing a piece of text. Texts with similar meanings have nearby embeddings in vector space. You generate embeddings using an embedding model (from OpenAI, Gemini, Hugging Face-hosted providers, Ollama/open-weight models, or another embedding provider), then compare them using distance metrics like cosine similarity.

Vector search: finding the most similar embeddings to a query embedding. You embed the query, compare it against all stored embeddings, and return the closest matches. This is the most common retrieval method in AI systems, and also the most overused when simpler methods would suffice.

Chunking: splitting a document into smaller pieces before embedding. The model's context window is limited, so you cannot retrieve whole documents. How you split matters: chunks that break in the middle of a function, a paragraph, or a logical unit produce bad retrieval results. Chunk boundaries are one of the most common sources of retrieval failure.

Lexical search: finding documents by keyword matching rather than semantic similarity. BM25 is the most common lexical ranking function. Lexical search excels at exact terms, identifiers, and names, cases where vector search often fails because the embedding does not capture the exact string.

Metadata filters: narrowing retrieval results using structured attributes (file path, language, author, date, symbol type) before or after similarity search. Metadata filters are cheap and often more effective than improving the embedding model.

Reranking: a second-pass scoring step that re-orders retrieved results using a more expensive model. The first retrieval pass (vector or lexical) casts a wide net; the reranker picks the best results from that net. Reranking improves precision without changing your index.

RAG (Retrieval-Augmented Generation): a pattern where the model's response is grounded in retrieved evidence rather than relying solely on its training data. RAG is not a database. RAG is not a vector store. RAG is the pipeline: retrieve evidence, package it into context, and generate an answer that stays grounded in what was retrieved. The retrieval step can use any method: vector, lexical, structural, or hybrid.

Walkthrough

Project setup

Provider choice is per capability, not just one global vendor choice. Using one provider for generation and another for embeddings is normal. See Choosing a Provider for the platform distinctions and mixing rules.

Create a retrieval lab as a standalone sidecar project. Pick your provider tab for the setup and embeddings function. Everything else in the lab stays the same regardless of provider.

OpenAI Platform is the most direct path for this lab. The embeddings API is straightforward and well-documented.

mkdir retrieval-lab && cd retrieval-lab
python -m venv .venv && source .venv/bin/activate
pip install openai rank-bm25 numpy
export OPENAI_API_KEY="sk-..."

Create a small sample corpus. Make a corpus/ directory with 3-4 short markdown files:

mkdir corpus

<!-- corpus/api-server.md -->
# API Server

The development server runs on port 8000 by default.
Use `uvicorn app:app --reload` to start it.

## Authentication
Users authenticate via JWT tokens passed in the Authorization header.
The `verify_token` middleware validates tokens on every request.

## Authorization
Authorization is role-based. Roles are: admin, editor, viewer.
The `check_permission` decorator enforces role checks on endpoints.

<!-- corpus/webhooks.md -->
# Webhooks

## process_webhook function
The `process_webhook` function validates incoming webhook payloads,
verifies the HMAC signature, and dispatches to the appropriate handler.

It accepts a `WebhookEvent` Pydantic model and returns a `ProcessResult`.

## Retry policy
Failed webhook deliveries are retried 3 times with exponential backoff.
The retry queue is stored in Redis.

<!-- corpus/deployment.md -->
# Deployment

## Docker
Build with `docker build -t myapp .`
The Dockerfile uses a multi-stage build for smaller images.

## Environment variables
- `DATABASE_URL`: PostgreSQL connection string
- `REDIS_URL`: Redis connection string
- `SECRET_KEY`: JWT signing key (required, no default)

Build the simplest possible retrieval pipeline

Create retrieval.py by copying the full file from your provider tab:

# retrieval.py
import numpy as np
from openai import OpenAI
from pathlib import Path
from rank_bm25 import BM25Okapi

client = OpenAI()


def embed_texts(texts):
    """Generate embeddings for a batch of texts.

    Args:
        texts: Query or chunk texts to embed.

    Returns:
        A list of embedding vectors in the same order as the input texts.
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]


# --- Step 1: Load and chunk the corpus ---

def load_corpus(corpus_dir="corpus"):
    """Load markdown files from disk and split them into naive chunks.

    Args:
        corpus_dir: Directory containing the markdown corpus to index.

    Returns:
        A list of chunk dictionaries with text, source filename, and chunk index.
    """
    chunks = []
    for path in sorted(Path(corpus_dir).glob("*.md")):
        text = path.read_text()
        # Naive chunking: split into ~300-char pieces with overlap
        words = text.split()
        chunk_size = 50  # words per chunk
        overlap = 10
        for i in range(0, len(words), chunk_size - overlap):
            chunk_text = " ".join(words[i : i + chunk_size])
            if chunk_text.strip():
                chunks.append({
                    "text": chunk_text,
                    "source": path.name,
                    "chunk_index": len(chunks),
                })
    return chunks


def cosine_similarity(a, b):
    """Compute cosine similarity between two embedding vectors.

    Args:
        a: First embedding vector.
        b: Second embedding vector.

    Returns:
        Cosine similarity score between the two vectors.
    """
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


# --- Step 3: Vector search ---

def vector_search(query, chunks, chunk_embeddings, top_k=3):
    """Rank chunks by vector similarity to the query embedding.

    Args:
        query: Search query to embed.
        chunks: Chunk records aligned with ``chunk_embeddings``.
        chunk_embeddings: Precomputed embeddings for each chunk.
        top_k: Number of matches to return.

    Returns:
        The top ``top_k`` ``(chunk, score)`` pairs sorted by similarity.
    """
    query_embedding = embed_texts([query])[0]
    scores = [cosine_similarity(query_embedding, ce) for ce in chunk_embeddings]
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return [(chunks[i], score) for i, score in ranked[:top_k]]


# --- Step 4: Lexical search (BM25) ---

def lexical_search(query, chunks, top_k=3):
    """Rank chunks with BM25 keyword search.

    Args:
        query: Search query to score lexically.
        chunks: Chunk records to search.
        top_k: Number of matches to return.

    Returns:
        The top ``top_k`` ``(chunk, score)`` pairs sorted by BM25 score.
    """
    tokenized = [c["text"].lower().split() for c in chunks]
    bm25 = BM25Okapi(tokenized)
    scores = bm25.get_scores(query.lower().split())
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return [(chunks[i], score) for i, score in ranked[:top_k]]


# --- Step 5: Metadata-filtered search ---

def vector_search_filtered(query, chunks, chunk_embeddings, source_filter, top_k=3):
    """Run vector search after restricting candidates to one source file.

    Args:
        query: Search query to embed.
        chunks: Chunk records aligned with ``chunk_embeddings``.
        chunk_embeddings: Precomputed embeddings for each chunk.
        source_filter: Filename to keep during the search.
        top_k: Number of matches to return.

    Returns:
        The top ``top_k`` ``(chunk, score)`` pairs from the filtered source.
    """
    query_embedding = embed_texts([query])[0]
    scored = []
    for i, (chunk, emb) in enumerate(zip(chunks, chunk_embeddings)):
        if chunk["source"] == source_filter:
            scored.append((chunk, cosine_similarity(query_embedding, emb)))
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_k]


# --- Run it ---

if __name__ == "__main__":
    print("Loading corpus...")
    chunks = load_corpus()
    print(f"  {len(chunks)} chunks from {len(set(c['source'] for c in chunks))} files\n")

    print("Generating embeddings...")
    chunk_embeddings = embed_texts([c["text"] for c in chunks])
    print(f"  {len(chunk_embeddings)} embeddings generated\n")

    # --- Test queries ---
    queries = [
        ("Exact-name lookup", "What does the process_webhook function do?"),
        ("Near-duplicate topics", "What is the difference between authentication and authorization?"),
        ("Narrow fact", "What port does the development server run on?"),
    ]

    for label, query in queries:
        print(f"\n{'='*60}")
        print(f"Query [{label}]: {query}")
        print(f"{'='*60}")

        print("\n  VECTOR SEARCH:")
        for chunk, score in vector_search(query, chunks, chunk_embeddings):
            print(f"    [{score:.3f}] ({chunk['source']}) {chunk['text'][:80]}...")

        print("\n  LEXICAL SEARCH (BM25):")
        for chunk, score in lexical_search(query, chunks):
            if score > 0:
                print(f"    [{score:.2f}] ({chunk['source']}) {chunk['text'][:80]}...")

    # --- Metadata filter demo ---
    print(f"\n{'='*60}")
    print("METADATA FILTER: 'process_webhook' filtered to webhooks.md only")
    print(f"{'='*60}")
    for chunk, score in vector_search_filtered(
        "What does process_webhook do?", chunks, chunk_embeddings, "webhooks.md"
    ):
        print(f"  [{score:.3f}] ({chunk['source']}) {chunk['text'][:80]}...")

Run ollama pull embeddinggemma:latest before you test the script.

Run it:

python retrieval.py

What to observe

Compare the three search methods across the three queries:

Exact-name lookup ("process_webhook"): BM25 should find the chunk with the function name directly. Vector search may return chunks about webhooks generally but miss the exact function reference.
Near-duplicate topics (auth vs authz): Vector search will likely return chunks about both authentication and authorization mixed together, so the embeddings are close. BM25 may do better if the exact words appear.
Narrow fact (port 8000): The answer is one line. If the naive chunking split it away from its surrounding context, neither search may rank it highly.

The metadata filter demo shows how filtering to a specific file eliminates irrelevant results entirely. It's a trivially cheap improvement.

A note on retrieval and hallucination

In LLM Mental Models we introduced hallucination as "the model generates confident content that isn't supported by evidence." You might expect that adding retrieval solves this; if the model has evidence, it won't make things up. That's partially true, but retrieval introduces its own failure mode: grounded-looking wrong answers.

If your retrieval pipeline returns the wrong chunks (because of a chunk boundary issue, an embedding miss, or a missing metadata filter), the model will faithfully use that wrong evidence to produce a confident, well-structured, wrong answer. It won't look like a hallucination... it'll look like the model did its job. In that case, the generation step did exactly what it was asked. The retrieval layer is where the error entered.

This is why the lab notes we're keeping are so important. When you document what the retrieval pipeline gets wrong, you're building the foundation for evaluating answer quality in Modules 5 and 6, where we'll teach grounding, citation, faithfulness checks, and the ability to say "I don't know."

Write your Retrieval Lab Notes

Create retrieval-lab-notes.md and document what the pipeline gets wrong:

<!-- retrieval-lab-notes.md -->
# Retrieval Lab Notes

## Query: "What does the process_webhook function do?"
- **Vector search returned:** [paste top results]
- **BM25 returned:** [paste top results]
- **Correct answer:** The function validates incoming webhook payloads, verifies HMAC signature, dispatches to handler. Accepts WebhookEvent, returns ProcessResult.
- **Failure cause:** [chunk boundary? embedding miss? both?]

## Query: "What is the difference between authentication and authorization?"
- **Vector search returned:** [paste top results]
- **BM25 returned:** [paste top results]
- **Correct answer:** Authentication = JWT token verification. Authorization = role-based (admin/editor/viewer) permission checks.
- **Failure cause:** [embeddings too similar? chunks mixed content?]

## Query: "What port does the development server run on?"
- **Vector search returned:** [paste top results]
- **BM25 returned:** [paste top results]
- **Correct answer:** Port 8000.
- **Failure cause:** [narrow fact lost in chunk? low embedding similarity?]

## What the next retrieval layer would need
- [your observations here]

Fill in the actual results from your run. This memo is the requirements document for Module 4, where you will build progressively better retrieval and measure each tier against these same failures.

Exercises

Build the simplest possible embedding search over a small text corpus (10-20 documents).
Stress-test it with at least 5 queries: exact-name lookups, near-duplicate topics, and narrow fact retrieval.
Add metadata filters (at minimum, file path) and compare filtered vs unfiltered results.
Add lexical search and compare against embedding-only retrieval on the same queries.
Write your Retrieval Lab Notes: what the naive retrieval pipeline gets wrong and why. We'll want to keep this document. It'll become the baseline for Module 4.
Verify your provider-specific embed_texts() from the setup tabs above works end to end. If you are Anthropic-first, note which embeddings provider you paired with it and why.

Reflection prompts

What types of queries did the naive pipeline fail on?
For each failure, was the cause a chunk boundary issue, an embedding miss, a missing filter, or something else?
What capability would the next retrieval layer need to add to fix these failures?

Completion checkpoint

You should now be able to:

Run an embedding search over a small corpus and return top-k results
Show at least two queries where vector search fails and explain why
Show at least one query where lexical search outperforms vector search
Show at least one query where metadata filters improve results
Produce a Retrieval Lab Notes that identifies specific failure modes and their causes
Explain how the same retrieval pipeline swaps between OpenAI, Gemini, Hugging Face, and Ollama embeddings, and why an Anthropic-first setup still needs a separate embeddings provider

Connecting to the project

The retrieval pipeline and lab notes you built here are standalone experiments over a small text corpus. Starting in Module 2, you'll choose an anchor repository (a real codebase) and your code assistant will need to retrieve code from it to answer questions.

We'll want to keep those lab notes. In Module 4 (Code Retrieval) they'll become the baseline you measure each retrieval improvement against. Four progressively better tiers, each evaluated on the same failure cases you documented here.

The naive retrieval pipeline you built here will also reappear in Module 5 (RAG) as the baseline you'll improve upon. Every retrieval decision you make later (vector vs lexical vs structural vs hybrid) traces back to the failure modes you just observed.

What's next

Security Basics. Before the model starts touching tools, files, and larger loops, lock in the small safety habits that keep the rest of the build sane.

References

Start here

OpenAI embeddings guide — how to generate and use embeddings for search
Gemini embeddings guide — direct Gemini embeddings, including output dimensionality and task-specific configs
Anthropic embeddings guide — important because it explains Anthropic's mixed-provider embeddings story directly

Build with this

Hugging Face feature extraction docs — embeddings / feature extraction on Hugging Face Inference Providers
Ollama embeddings API — embeddings with local Ollama; this curriculum does not rely on direct Ollama Cloud embeddings
Qdrant docs — a vector database you will use in later modules; skim the concepts now
rank_bm25 (PyPI) — simple BM25 implementation for lexical search comparison

Deep dive

OpenAI retrieval guide — broader retrieval patterns beyond basic embedding search
Retrieval Method Chooser — the full decision matrix for when to use which retrieval method (reference page, covers all 9 retrieval methods)