Flat Chunk/Vector Retrieval (Tier 1)

In the previous lesson, you categorized your benchmark questions by which retrieval method handles them. Some questions (the conceptual and vocabulary-mismatch ones) need semantic search. This lesson builds that semantic search as a naive baseline: chunk your files as plain text, embed the chunks, store them in a vector database, and retrieve the top-k results for each question.

This is an intentional failure lesson. We're building the simplest possible vector retrieval pipeline on purpose, running it against your benchmark, and carefully documenting what breaks. I've found this to be one of the most valuable exercises in my own journey, and why it's emphasized here. The specific ways naive retrieval fails will tell you exactly what to fix in the upcoming Tiers 2, 3, and 4.

What you'll learn

Build a complete chunk-embed-store-retrieve pipeline for a code repository
Use Qdrant as a local vector database to index and query code chunks
Run your benchmark questions through naive vector retrieval and grade the results
Identify five common failure classes in naive code retrieval
Compare vector retrieval against the structured retrieval from the previous lesson on the same benchmark

Concepts

Chunking: the process of splitting documents into smaller pieces for embedding and retrieval. In naive retrieval, you chunk by character count or line count with no regard for code structure. A chunk might split a function in half, merge unrelated code, or separate a docstring from the function it describes. These boundary violations are the primary source of failures in naive retrieval.

Embedding: converting text into a numerical vector that captures its meaning. An embedding model maps text to a point in a high-dimensional space where semantically similar texts are nearby. We'll use an embedding model to convert our code chunks into vectors for storage and search.

Vector database: a database optimized for storing vectors and finding the nearest neighbors to a query vector. It handles the math of similarity search so you can focus on what you store and how you query it.

Top-k retrieval: retrieving the k most similar chunks to a query. With naive retrieval, this is your only knob: retrieve more chunks (higher k) for better recall at the cost of more noise, or fewer chunks (lower k) for precision at the cost of missing relevant code.

Cosine similarity: a measure of how similar two vectors are, based on the angle between them. Values range from -1 (opposite) to 1 (identical direction). Most embedding models are trained so that cosine similarity correlates with semantic similarity.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Need semantic code retrieval	Grep misses conceptual matches; metadata index doesn't cover natural-language questions	The metadata index from the previous lesson	Chunk + embed + vector search (this lesson)
Retrieval returns irrelevant chunks	Top-k results don't contain the code the model needs to answer	Increase k	Better chunking (Tier 2)
Symbol boundaries are split	A function definition is split across two chunks; retrieval returns half a function	Larger chunk size	AST-aware chunking (Tier 2)
Code outranked by comments	Docstrings or comments rank higher than the actual implementation	Filter by file type	Metadata-enriched embeddings (Tier 2)

Default: Qdrant

Why this is the default: Qdrant runs locally with no external dependencies, supports metadata filtering and hybrid search, and scales beyond toy corpora. It's a good fit for the progression we're building. We'll use its filtering capabilities in the AST-aware lesson and its hybrid features in the graph/hybrid lesson.

Portable concept underneath: a retrieval store that accepts vectors, stores them with metadata, and returns nearest neighbors filtered by arbitrary conditions. Any vector database provides this.

Closest alternatives and when to switch:

Chroma: use when you want the absolute simplest local setup and don't need filtering or hybrid features yet
pgvector: use when PostgreSQL is already your center of gravity and you don't want a separate database process
FAISS: use when you need raw speed for in-memory search and don't need persistence or metadata filtering

Walkthrough

Install dependencies

cd anchor-repo

pip install qdrant-client openai

Chunk your repository

Let's start with the simplest possible chunking: split every file into fixed-size text chunks with overlap. This is intentionally naive so we can see the failures it produces.

# retrieval/chunk_files.py
"""Naive text chunking for the anchor repository."""
import json
from pathlib import Path

REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
CHUNK_SIZE = 800  # characters
CHUNK_OVERLAP = 200  # characters
OUTPUT_PATH = Path("retrieval/chunks.jsonl")

# File extensions to index
CODE_EXTENSIONS = {".py", ".js", ".ts", ".jsx", ".tsx", ".go", ".rs", ".java", ".md", ".yaml", ".yml", ".toml", ".json"}


def is_excluded(path: Path) -> bool:
    """Check whether a path should be skipped during indexing.

    Args:
        path: Path relative to the repository root.

    Returns:
        bool: True when the path belongs to an excluded directory.
    """
    return any(part in EXCLUDED_DIRS for part in path.parts)


def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Split text into overlapping chunks by character count.

    Args:
        text: Full file contents to break into retrieval chunks.
        chunk_size: Maximum characters to place in each chunk.
        overlap: Number of characters to carry into the next chunk.

    Returns:
        list[str]: Overlapping text chunks in source order.
    """
    if len(text) <= chunk_size:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks


def build_chunks():
    """Walk the repository and emit naive retrieval chunks.

    Returns:
        list[dict]: Chunk records ready to be written to the JSONL index file.
    """
    all_chunks = []
    chunk_id = 0

    for path in sorted(REPO_ROOT.rglob("*")):
        if not path.is_file():
            continue
        if is_excluded(path.relative_to(REPO_ROOT)):
            continue
        if path.suffix not in CODE_EXTENSIONS:
            continue

        try:
            text = path.read_text(errors="replace")
        except Exception:
            continue

        if not text.strip():
            continue

        rel_path = str(path.relative_to(REPO_ROOT))
        chunks = chunk_text(text)

        for i, chunk in enumerate(chunks):
            all_chunks.append({
                "chunk_id": f"chunk-{chunk_id:05d}",
                "file_path": rel_path,
                "chunk_index": i,
                "total_chunks": len(chunks),
                "text": chunk,
                "char_count": len(chunk),
            })
            chunk_id += 1

    # Write chunks to JSONL
    with open(OUTPUT_PATH, "w") as f:
        for chunk in all_chunks:
            f.write(json.dumps(chunk) + "\n")

    print(f"Created {len(all_chunks)} chunks from {len(set(c['file_path'] for c in all_chunks))} files")
    print(f"Average chunk size: {sum(c['char_count'] for c in all_chunks) // len(all_chunks)} chars")
    print(f"Chunks saved to {OUTPUT_PATH}")
    return all_chunks


if __name__ == "__main__":
    build_chunks()

python retrieval/chunk_files.py

Expected output:

Created 142 chunks from 23 files
Average chunk size: 723 chars
Chunks saved to retrieval/chunks.jsonl

Take a moment to look at the chunks. Open retrieval/chunks.jsonl and scan a few entries. You'll notice chunks that split functions mid-body, merge a docstring from one function with the body of another, or contain a fragment of a class with no context about which class it belongs to. These are the boundary violations we'll fix with AST-aware chunking in the next lesson.

Embed and store in Qdrant

# retrieval/embed_and_store.py
"""Embed chunks and store them in a local Qdrant collection."""
import json
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

CHUNKS_PATH = Path("retrieval/chunks.jsonl")
COLLECTION_NAME = "anchor-repo-naive"
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
BATCH_SIZE = 50

client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")


def load_chunks() -> list[dict]:
    """Load chunk records from the JSONL file.

    Returns:
        list[dict]: Chunk payloads ready for embedding and storage.
    """
    chunks = []
    with open(CHUNKS_PATH) as f:
        for line in f:
            if line.strip():
                chunks.append(json.loads(line))
    return chunks


def embed_texts(texts: list[str]) -> list[list[float]]:
    """Embed a batch of chunk texts with the configured provider model.

    Args:
        texts: Chunk texts to convert into embedding vectors.

    Returns:
        list[list[float]]: Embedding vectors aligned with the input order.
    """
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
    return [item.embedding for item in response.data]


def create_collection():
    """Create or recreate the Qdrant collection used for naive retrieval.

    Returns:
        None
    """
    collections = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME in collections:
        qdrant.delete_collection(COLLECTION_NAME)
        print(f"Deleted existing collection '{COLLECTION_NAME}'")
    qdrant.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
    )
    print(f"Created collection '{COLLECTION_NAME}'")


def embed_and_store():
    """Embed all indexed chunks and upsert them into Qdrant.

    Returns:
        None
    """
    chunks = load_chunks()
    create_collection()
    for batch_start in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[batch_start:batch_start + BATCH_SIZE]
        texts = [c["text"] for c in batch]
        embeddings = embed_texts(texts)
        points = [
            PointStruct(
                id=batch_start + i, vector=embedding,
                payload={"chunk_id": chunk["chunk_id"], "file_path": chunk["file_path"],
                         "chunk_index": chunk["chunk_index"], "total_chunks": chunk["total_chunks"],
                         "text": chunk["text"]},
            )
            for i, (chunk, embedding) in enumerate(zip(batch, embeddings))
        ]
        qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
        print(f"  Stored {batch_start + len(batch)}/{len(chunks)} chunks")
    print(f"\nDone. {len(chunks)} chunks embedded and stored in '{COLLECTION_NAME}'")


if __name__ == "__main__":
    embed_and_store()

python retrieval/embed_and_store.py

Expected output:

Deleted existing collection 'anchor-repo-naive'
Created collection 'anchor-repo-naive'
  Stored 50/142 chunks
  Stored 100/142 chunks
  Stored 142/142 chunks

Done. 142 chunks embedded and stored in 'anchor-repo-naive'

Retrieve and answer benchmark questions

Now we'll build a retrieval function and run it through the benchmark:

# retrieval/naive_retrieve.py
"""Naive vector retrieval: embed the query, find top-k chunks, return them."""
import sys
from openai import OpenAI
from qdrant_client import QdrantClient

COLLECTION_NAME = "anchor-repo-naive"
EMBEDDING_MODEL = "text-embedding-3-small"
TOP_K = 5

client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")

SYSTEM_PROMPT = (
    "You are a code assistant. Answer the question using ONLY the "
    "retrieved code context below. If the context doesn't contain "
    "enough information, say so."
)


def retrieve(query: str, top_k: int = TOP_K) -> list[dict]:
    """Retrieve the top matching chunks for a query.

    Args:
        query: Natural-language or code query to embed and search.
        top_k: Number of top-ranked chunks to return.

    Returns:
        list[dict]: Retrieved chunk metadata with scores and text previews.
    """
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=[query])
    query_vector = response.data[0].embedding
    results = qdrant.query_points(
        collection_name=COLLECTION_NAME, query=query_vector, limit=top_k,
    )
    return [
        {"file_path": hit.payload["file_path"], "chunk_id": hit.payload["chunk_id"],
         "score": round(hit.score, 4), "text": hit.payload["text"]}
        for hit in results.points
    ]


def retrieve_and_answer(question: str, model: str = "gpt-4o-mini") -> dict:
    """Retrieve evidence and generate an answer from the naive vector baseline.

    Args:
        question: User question to answer from retrieved chunks.
        model: Generation model used for the answer step.

    Returns:
        dict: Final answer, supporting chunks, and retrieval-method metadata.
    """
    chunks = retrieve(question)
    context = "\n\n---\n\n".join(
        f"File: {c['file_path']} (score: {c['score']})\n{c['text']}" for c in chunks
    )
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": f"{SYSTEM_PROMPT}\n\nRetrieved context:\n{context}"},
            {"role": "user", "content": question},
        ],
        temperature=0,
    )
    return {"answer": response.choices[0].message.content, "chunks_used": chunks,
            "retrieval_method": "naive_vector"}


if __name__ == "__main__":
    question = sys.argv[1] if len(sys.argv) > 1 else "What are the main modules in this repository?"
    print(f"Question: {question}\n")
    chunks = retrieve(question)
    print(f"Top {len(chunks)} chunks:")
    for c in chunks:
        print(f"  [{c['score']}] {c['file_path']}: {c['text'][:80]}...")
    print()
    result = retrieve_and_answer(question)
    print(f"Answer:\n{result['answer']}")

# Test with a single question
python retrieval/naive_retrieve.py "Where is the main entry point of the application?"

Run the benchmark and grade

# retrieval/run_naive_benchmark.py
"""Run benchmark questions through naive vector retrieval and grade."""
import json
import os
from datetime import datetime, timezone
from pathlib import Path
from retrieval.naive_retrieve import retrieve_and_answer

RUN_ID = "naive-vector-v1-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
OUTPUT_FILE = Path(f"harness/runs/{RUN_ID}.jsonl")
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()


def run_benchmark():
    """Run the benchmark set through the naive vector baseline.

    Returns:
        None
    """
    questions = []
    with open(BENCHMARK_FILE) as f:
        for line in f:
            if line.strip():
                questions.append(json.loads(line))

    print(f"Running {len(questions)} benchmark questions through naive vector retrieval")
    print(f"Run ID: {RUN_ID}\n")

    results = []
    for i, q in enumerate(questions):
        print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
        result = retrieve_and_answer(q["question"], model=MODEL)

        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": result["answer"],
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": list(set(c["file_path"] for c in result["chunks_used"])),
            "chunk_scores": [c["score"] for c in result["chunks_used"]],
            "retrieval_method": "naive_vector",
            "grade": None,
            "failure_label": None,
            "grading_notes": "",
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.2",
        }
        results.append(entry)

    os.makedirs("harness/runs", exist_ok=True)
    with open(OUTPUT_FILE, "w") as f:
        for entry in results:
            f.write(json.dumps(entry) + "\n")

    print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
    print("Next: grade these answers and compare against your Module 3 agent baseline.")


if __name__ == "__main__":
    run_benchmark()

python -m retrieval.run_naive_benchmark

After running, grade the results using the same grade_baseline.py from Module 2, then compare:

python harness/summarize_run.py harness/runs/naive-vector-v1-*-graded.jsonl

What you'll see break

When you grade the results, you'll notice five common failure classes. I've listed them here because I've seen every one of them in real-world code retrieval systems:

Exact symbol misses: The question asks "where is validate_path defined?" and vector search returns chunks about validation in general, but not the specific function. Grep would have found this instantly.
Broken semantic units: A function was split across two chunks. The retrieval returns the bottom half (with the return statement) but not the top half (with the function signature and docstring). The model can't answer "what does this function do?" from half a function.
Irrelevant neighbors: Chunks from unrelated files rank highly because they happen to use similar vocabulary. A question about error handling returns chunks about logging because both mention "error."
No relationship structure: "What calls process_request?" requires knowing the call graph. Vector search finds chunks that mention the function but doesn't know which files call it versus which files define it.
Oversized evidence bundles: You retrieve 5 chunks of 800 characters each. That's 4,000 characters of context, and maybe 400 of them are relevant. The model has to work through noise to find the signal.

These five failures map directly to the retrieval substrates we'll add in the next three lessons:

Failure class	What fixes it	Tier
Exact symbol misses	Combine vector search with lexical search	Tier 3 (hybrid)
Broken semantic units	Chunk by function/class boundaries, not character count	Tier 2 (AST-aware)
Irrelevant neighbors	Better chunk boundaries + reranking	Tiers 2 and 3
No relationship structure	Import and call graph edges	Tier 3 (graph)
Oversized evidence bundles	Context compilation and token budgeting	Tier 4 (context compiler)

Exercises

Build the full pipeline: chunk_files.py → embed_and_store.py → naive_retrieve.py. Verify you can retrieve chunks for a simple query.
Run run_naive_benchmark.py against your full benchmark. Grade at least 15 answers.
For each graded answer, assign a failure label from the five classes above (or add your own if you see a different pattern).
Compare your naive vector results against your Module 3 agent results. Which categories improved? Which got worse? (Exact symbol lookups will often be worse with vector search than with grep. That's expected.)
Open retrieval/chunks.jsonl and find three chunks where the character-boundary splitting produced obviously bad boundaries. Write down what the chunk should contain if you could split on code structure.

Completion checkpoint

You have:

A working Qdrant collection with embedded chunks from your anchor repo
A benchmark run graded and compared against your Module 3 agent baseline
Failure labels assigned to at least 15 graded answers
A clear picture of which failure classes dominate your results
Three specific examples of bad chunk boundaries that you'll fix with AST-aware chunking

Retrieval Lab Notes

Before moving on, write up your observations in retrieval/tier1-lab-notes.md. This is the same practice from Module 1's retrieval fundamentals, now applied to your anchor repo at scale. These notes will be your requirements document for Tiers 2-4.

For each failure class you observed, document one specific benchmark question, what the retrieval returned, what it should have returned, and which failure class it falls into. We'll reference these notes throughout the remaining tiers.

Reflection prompts

Which failure class appeared most often in your graded results? What does that tell you about the biggest gap in naive vector retrieval?
Were there questions where vector search outperformed your Module 3 grep-based agent? What made those questions different?
Were there questions where vector search was worse than grep? What do those questions have in common?
If you could fix only one failure class before moving to AST-aware retrieval, which would have the biggest impact on your benchmark scores?

What's next

AST and Symbol Retrieval. The baseline will show boundary and symbol failures; the next lesson fixes the representation, not just the ranking.

References

Start here

Qdrant documentation — quickstart, local mode, and Python client reference

Build with this

OpenAI Embeddings guide — embedding models, usage, and best practices
qdrant-client on PyPI — the Python client we use throughout this module

Deep dive

Nils Reimers: Sentence Transformers — open-source embedding models as an alternative to OpenAI; particularly useful with Hugging Face Inference Providers or local inference
Chunking strategies for RAG — a survey of chunking methods and their trade-offs