AST and Symbol-Aware Retrieval (Tier 2)
Your naive vector baseline showed you exactly where flat chunking breaks: functions split across chunk boundaries, classes severed from their methods, docstrings separated from the code they describe. You might be tempted to think of these as edge cases, but they're actually the normal behavior of character-based chunking on code. In this lesson, we'll fix those failures by parsing your code into its actual structure and chunking along the boundaries that the language itself defines.
What you'll want to note here is that code has grammar. Unlike prose, where paragraph breaks are soft suggestions, code has hard structural boundaries: functions, classes, methods, module-level blocks. A chunk that respects those boundaries is drastically more useful to a model than one that splits a function at character 800.
What you'll learn
- Parse code files with Tree-sitter to extract the abstract syntax tree (AST)
- Build a symbol table mapping every function, class, and method to its file and line range
- Create code-aware chunks that respect structural boundaries
- Re-index your anchor repo with AST-aware chunks and compare retrieval quality against the naive baseline
- Measure the improvement on symbol lookup and architecture benchmark questions
Concepts
Abstract Syntax Tree (AST): a tree representation of the syntactic structure of source code. Each node represents a construct: a function definition, a class, an if-statement, an import. The AST captures what the code means structurally, independent of formatting. We'll use the AST to know where functions start and end, which methods belong to which classes, and what each file exports.
Tree-sitter: an incremental parsing library that builds ASTs for source code. It supports many languages through grammar files, parses quickly enough for editor-scale use, and produces concrete syntax trees that include every token. We use it because it works across languages and doesn't require the code to be valid (it can parse partial or broken files).
Symbol table: a data structure that maps symbol names (functions, classes, variables) to their locations (file, start line, end line) and metadata (arguments, return types, parent class). In compilers, symbol tables are used for name resolution. In retrieval, we'll use them for precise symbol lookup: "where is X defined?" becomes a table lookup instead of a search query.
Structural chunking: splitting code into chunks that follow the code's own structure. Instead of splitting at every N characters, you split at function boundaries, class boundaries, and module-level blocks. Each chunk is a complete semantic unit that a model can understand without needing context from adjacent chunks.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Broken function boundaries | Retrieved chunks contain half a function | Increase chunk size | Parse with Tree-sitter, chunk by function/class boundary |
| Exact symbol lookup misses | Vector search returns related but wrong symbols | Grep for the symbol name | Symbol table with direct lookup |
| Missing structural context | Model doesn't know which class a method belongs to | Add file path to chunk metadata | Include parent class/module context in chunk |
| Language-specific failures | Python parser doesn't handle TypeScript files | Single-language indexing | Multi-language Tree-sitter grammars |
Default: Tree-sitter
Why this is the default: Tree-sitter parses many languages with a single API, handles broken/partial files gracefully, and runs fast enough to re-index on every commit. It gives us a consistent structural representation regardless of language.
Portable concept underneath: parse code into meaningful structural units instead of treating it as plain text. The specific parser matters less than the principle: code structure should inform chunking.
Closest alternatives and when to switch:
- Python
astmodule: use when your codebase is pure Python and you don't need multi-language support (we used this in the metadata index lesson. It works pretty well for Python-only analysis) - LSP-based symbol extraction: use when you need type information, cross-file resolution, or refactoring-grade accuracy
- ctags / universal-ctags: use when you only need symbol definitions and don't need full AST traversal
Walkthrough
Install Tree-sitter
cd anchor-repo
pip install tree-sitter tree-sitter-pythonIf your anchor repo includes other languages, install those grammars too:
# Only install what you need
pip install tree-sitter-javascript # for JS/JSX projects
pip install tree-sitter-typescript # for TypeScript/TSX projects (separate package)
pip install tree-sitter-go # for Go projectsThe tree-sitter Python package version 0.23+ uses a new API. The code below targets that API. If you're using an older version, the Language import path and parser setup will differ. Check the tree-sitter Python bindings if you're unsure what you need.
Parse files and extract symbols
# retrieval/parse_ast.py
"""Parse code files with Tree-sitter and extract structural metadata."""
import json
from pathlib import Path
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
SYMBOL_TABLE_PATH = Path("retrieval/symbol_table.json")
# Initialize the Python parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
def is_excluded(path: Path) -> bool:
"""Check whether a path should be skipped during repository traversal.
Args:
path: Repository-relative path to evaluate.
Returns:
``True`` when the path lives under an excluded directory, otherwise ``False``.
"""
return any(part in EXCLUDED_DIRS for part in path.parts)
def extract_symbols(file_path: Path) -> list[dict]:
"""Extract function and class symbols from one Python file.
Args:
file_path: Absolute path to the Python file to parse.
Returns:
A list of symbol dictionaries describing discovered classes and functions.
"""
source = file_path.read_bytes()
tree = parser.parse(source)
rel_path = str(file_path.relative_to(REPO_ROOT))
symbols = []
def visit(node, parent_class=None):
if node.type == "function_definition":
name_node = node.child_by_field_name("name")
params_node = node.child_by_field_name("parameters")
name = name_node.text.decode() if name_node else "<anonymous>"
params = params_node.text.decode() if params_node else "()"
# Extract docstring if present
body = node.child_by_field_name("body")
docstring = None
if body and body.children:
first_stmt = body.children[0]
if first_stmt.type == "expression_statement":
expr = first_stmt.children[0] if first_stmt.children else None
if expr and expr.type == "string":
docstring = expr.text.decode().strip("\"'")
symbols.append({
"type": "function",
"name": name,
"qualified_name": f"{parent_class}.{name}" if parent_class else name,
"file": rel_path,
"start_line": node.start_point[0] + 1,
"end_line": node.end_point[0] + 1,
"start_byte": node.start_byte,
"end_byte": node.end_byte,
"params": params,
"docstring": docstring,
"parent_class": parent_class,
})
elif node.type == "class_definition":
name_node = node.child_by_field_name("name")
name = name_node.text.decode() if name_node else "<anonymous>"
body = node.child_by_field_name("body")
# Extract class docstring
docstring = None
if body and body.children:
first_stmt = body.children[0]
if first_stmt.type == "expression_statement":
expr = first_stmt.children[0] if first_stmt.children else None
if expr and expr.type == "string":
docstring = expr.text.decode().strip("\"'")
symbols.append({
"type": "class",
"name": name,
"qualified_name": name,
"file": rel_path,
"start_line": node.start_point[0] + 1,
"end_line": node.end_point[0] + 1,
"start_byte": node.start_byte,
"end_byte": node.end_byte,
"docstring": docstring,
})
# Visit class body with parent context
if body:
for child in body.children:
visit(child, parent_class=name)
return # Don't recurse into children again
for child in node.children:
visit(child, parent_class=parent_class)
visit(tree.root_node)
return symbols
def build_symbol_table() -> dict:
"""Build a repository-wide symbol table from parsed Python files.
Args:
None.
Returns:
A symbol table with full symbol records and lookup indexes.
"""
all_symbols = []
files_parsed = 0
for path in sorted(REPO_ROOT.rglob("*.py")):
if is_excluded(path.relative_to(REPO_ROOT)):
continue
try:
symbols = extract_symbols(path)
all_symbols.extend(symbols)
files_parsed += 1
except Exception as e:
print(f" Warning: failed to parse {path}: {e}")
# Build lookup indexes
by_name = {}
for sym in all_symbols:
name = sym["name"]
if name not in by_name:
by_name[name] = []
by_name[name].append(sym)
table = {
"symbols": all_symbols,
"by_name": by_name,
"files_parsed": files_parsed,
"total_symbols": len(all_symbols),
}
SYMBOL_TABLE_PATH.write_text(json.dumps(table, indent=2))
functions = [s for s in all_symbols if s["type"] == "function"]
classes = [s for s in all_symbols if s["type"] == "class"]
print(f"Parsed {files_parsed} files")
print(f"Found {len(functions)} functions, {len(classes)} classes")
print(f"Symbol table saved to {SYMBOL_TABLE_PATH}")
return table
if __name__ == "__main__":
build_symbol_table()python retrieval/parse_ast.pyExpected output:
Parsed 23 files
Found 68 functions, 12 classes
Symbol table saved to retrieval/symbol_table.json
Create code-aware chunks
Now let's build chunks that follow the code's structure. Each function becomes its own chunk. Each class becomes a chunk (with its methods). Module-level code becomes a chunk. Nothing gets split mid-definition.
# retrieval/chunk_ast.py
"""Create code-aware chunks using Tree-sitter AST boundaries."""
import json
from pathlib import Path
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
REPO_ROOT = Path(".").resolve()
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
CODE_EXTENSIONS = {".py"} # Start with Python; extend for other languages
OUTPUT_PATH = Path("retrieval/chunks_ast.jsonl")
MAX_CHUNK_CHARS = 2000 # If a single function/class exceeds this, we'll include it whole but flag it
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
def is_excluded(path: Path) -> bool:
"""Check whether a path should be skipped during repository traversal.
Args:
path: Repository-relative path to evaluate.
Returns:
``True`` when the path lives under an excluded directory, otherwise ``False``.
"""
return any(part in EXCLUDED_DIRS for part in path.parts)
def structural_chunks(file_path: Path) -> list[dict]:
"""Split one source file into AST-aligned structural chunks.
Args:
file_path: Absolute path to the source file to chunk.
Returns:
A list of chunk dictionaries aligned to module, class, and function boundaries.
"""
source_bytes = file_path.read_bytes()
source_text = source_bytes.decode(errors="replace")
tree = parser.parse(source_bytes)
rel_path = str(file_path.relative_to(REPO_ROOT))
chunks = []
# Collect top-level nodes
root = tree.root_node
module_header_lines = [] # imports, module docstring, etc.
current_header_end = 0
for child in root.children:
if child.type in ("function_definition", "class_definition", "decorated_definition"):
# If there's module-level code above this definition, capture it
if current_header_end < child.start_byte:
header_text = source_bytes[current_header_end:child.start_byte].decode(errors="replace").strip()
if header_text:
module_header_lines.append(header_text)
# Extract the full definition as a chunk
node_text = source_bytes[child.start_byte:child.end_byte].decode(errors="replace")
start_line = child.start_point[0] + 1
end_line = child.end_point[0] + 1
# Determine the symbol name
actual_node = child
if child.type == "decorated_definition":
for sub in child.children:
if sub.type in ("function_definition", "class_definition"):
actual_node = sub
break
name_node = actual_node.child_by_field_name("name")
symbol_name = name_node.text.decode() if name_node else "<anonymous>"
symbol_type = "class" if actual_node.type == "class_definition" else "function"
chunks.append({
"file_path": rel_path,
"symbol_name": symbol_name,
"symbol_type": symbol_type,
"start_line": start_line,
"end_line": end_line,
"text": node_text,
"char_count": len(node_text),
"is_oversized": len(node_text) > MAX_CHUNK_CHARS,
})
current_header_end = child.end_byte
else:
# Module-level code (imports, assignments, etc.)
current_header_end = max(current_header_end, child.end_byte)
# Capture trailing module-level code
trailing = source_bytes[current_header_end:].decode(errors="replace").strip()
if trailing:
module_header_lines.append(trailing)
# Add module-level code as a single chunk
if module_header_lines:
header_text = "\n".join(module_header_lines)
chunks.insert(0, {
"file_path": rel_path,
"symbol_name": "__module__",
"symbol_type": "module",
"start_line": 1,
"end_line": None,
"text": header_text,
"char_count": len(header_text),
"is_oversized": len(header_text) > MAX_CHUNK_CHARS,
})
return chunks
def build_ast_chunks():
"""Build AST-aware chunks for all eligible source files in the repository.
Args:
None.
Returns:
None. Chunk records are written to the AST chunk JSONL file.
"""
all_chunks = []
chunk_id = 0
for path in sorted(REPO_ROOT.rglob("*")):
if not path.is_file():
continue
if is_excluded(path.relative_to(REPO_ROOT)):
continue
if path.suffix not in CODE_EXTENSIONS:
continue
try:
file_chunks = structural_chunks(path)
for chunk in file_chunks:
chunk["chunk_id"] = f"ast-{chunk_id:05d}"
all_chunks.append(chunk)
chunk_id += 1
except Exception as e:
print(f" Warning: failed to parse {path}: {e}")
with open(OUTPUT_PATH, "w") as f:
for chunk in all_chunks:
f.write(json.dumps(chunk) + "\n")
oversized = [c for c in all_chunks if c.get("is_oversized")]
print(f"Created {len(all_chunks)} AST-aware chunks from {len(set(c['file_path'] for c in all_chunks))} files")
print(f" Functions: {len([c for c in all_chunks if c['symbol_type'] == 'function'])}")
print(f" Classes: {len([c for c in all_chunks if c['symbol_type'] == 'class'])}")
print(f" Module-level: {len([c for c in all_chunks if c['symbol_type'] == 'module'])}")
if oversized:
print(f" Oversized chunks (>{MAX_CHUNK_CHARS} chars): {len(oversized)}")
print(f"Chunks saved to {OUTPUT_PATH}")
if __name__ == "__main__":
build_ast_chunks()python retrieval/chunk_ast.pyExpected output:
Created 103 AST-aware chunks from 23 files
Functions: 68
Classes: 12
Module-level: 23
Oversized chunks (>2000 chars): 3
Chunks saved to retrieval/chunks_ast.jsonl
Notice the difference: naive chunking produced 142 arbitrary chunks. AST-aware chunking produces 103 chunks that align with code structure. Each function is whole. Each class is whole. No split definitions.
Re-index with AST-aware chunks
Pick your provider for the embedding script. The AST chunking and Qdrant storage are identical across providers; only the embedding call differs.
# retrieval/embed_ast_chunks.py
"""Embed AST-aware chunks and store in a separate Qdrant collection."""
import json
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
CHUNKS_PATH = Path("retrieval/chunks_ast.jsonl")
COLLECTION_NAME = "anchor-repo-ast"
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
BATCH_SIZE = 50
client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")
def load_chunks():
"""Load AST-aware chunks from the JSONL chunk store.
Args:
None.
Returns:
A list of chunk dictionaries from the chunk store.
"""
chunks = []
with open(CHUNKS_PATH) as f:
for line in f:
if line.strip():
chunks.append(json.loads(line))
return chunks
def embed_texts(texts):
"""Generate embeddings for a batch of AST-aware chunk texts.
Args:
texts: Chunk texts to embed.
Returns:
A list of embedding vectors in the same order as the input texts.
"""
response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
return [item.embedding for item in response.data]
def create_and_store():
"""Embed AST-aware chunks and store them in the Qdrant collection.
Args:
None.
Returns:
None. The vector collection is recreated and populated in place.
"""
chunks = load_chunks()
collections = [c.name for c in qdrant.get_collections().collections]
if COLLECTION_NAME in collections:
qdrant.delete_collection(COLLECTION_NAME)
qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
)
print(f"Created collection '{COLLECTION_NAME}'")
for batch_start in range(0, len(chunks), BATCH_SIZE):
batch = chunks[batch_start:batch_start + BATCH_SIZE]
texts = [f"{c['symbol_type']} {c['symbol_name']} in {c['file_path']}\n\n{c['text']}" for c in batch]
embeddings = embed_texts(texts)
points = [
PointStruct(id=batch_start + i, vector=emb, payload={
"chunk_id": chunk["chunk_id"], "file_path": chunk["file_path"],
"symbol_name": chunk["symbol_name"], "symbol_type": chunk["symbol_type"],
"start_line": chunk["start_line"], "end_line": chunk["end_line"],
"text": chunk["text"],
})
for i, (chunk, emb) in enumerate(zip(batch, embeddings))
]
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
print(f" Stored {batch_start + len(batch)}/{len(chunks)} chunks")
print(f"\nDone. {len(chunks)} AST-aware chunks stored in '{COLLECTION_NAME}'")
if __name__ == "__main__":
create_and_store()python retrieval/embed_ast_chunks.pyCompare naive vs. AST-aware retrieval
Use the same embedding provider you used for indexing:
# retrieval/compare_tiers.py
"""Compare naive vs. AST-aware retrieval on benchmark questions."""
import json
from pathlib import Path
from openai import OpenAI
from qdrant_client import QdrantClient
BENCHMARK_FILE = Path("benchmark-questions.jsonl")
EMBEDDING_MODEL = "text-embedding-3-small"
TOP_K = 5
client = OpenAI()
qdrant = QdrantClient(path="retrieval/qdrant_data")
def retrieve_from(collection, query, top_k=TOP_K):
"""Query one vector collection and return the top AST-aware matches.
Args:
collection: Qdrant collection name to query.
query: User question or lookup string to embed.
top_k: Number of matches to return.
Returns:
A list of ranked retrieval hits with file, symbol, and preview metadata.
"""
response = client.embeddings.create(model=EMBEDDING_MODEL, input=[query])
query_vector = response.data[0].embedding
results = qdrant.query_points(collection_name=collection, query=query_vector, limit=top_k)
return [{"file_path": hit.payload["file_path"], "score": round(hit.score, 4),
"text_preview": hit.payload["text"][:120],
"symbol_name": hit.payload.get("symbol_name", "n/a")}
for hit in results.points]
def compare():
"""Compare naive retrieval against AST-aware retrieval on benchmark questions.
Args:
None.
Returns:
None. Results are printed for manual inspection.
"""
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
print(f"Comparing naive vs. AST-aware retrieval on {min(len(questions), 15)} questions\n")
for q in questions[:15]:
print(f"[{q['category']}] {q['question'][:70]}")
naive = retrieve_from("anchor-repo-naive", q["question"])
ast_aware = retrieve_from("anchor-repo-ast", q["question"])
print(f" Naive: {', '.join(set(r['file_path'] for r in naive))}")
print(f" AST-aware: {', '.join(set(r['file_path'] for r in ast_aware))}")
for r in ast_aware[:3]:
print(f" [{r['score']}] {r['symbol_name']} in {r['file_path']}")
print()
if __name__ == "__main__":
compare()python -m retrieval.compare_tiersYou should see improvements in two areas:
-
Symbol lookup questions: AST-aware retrieval returns the complete function or class, not a fragment. The model gets a whole definition to work with.
-
Architecture questions: because each chunk is a named symbol with metadata, the retrieval results are more meaningful. Instead of "some text from main.py," you get "function
handle_requestinroutes/api.py."
The questions where you won't see much improvement yet are relationship questions ("what calls this function?") and questions requiring cross-file reasoning. Those are what graph and hybrid retrieval will address.
Exercises
- Build the Tree-sitter parser and symbol table (
parse_ast.py). Verify the symbol table includes every function and class in your repo. - Build AST-aware chunks (
chunk_ast.py). Compare the chunk count and average chunk size against the naive baseline. Open both JSONL files and compare three chunks from the same file. - Embed and store AST-aware chunks (
embed_ast_chunks.py). Run the tier comparison script and note which question categories improved. - Run a full benchmark through AST-aware retrieval (modify
run_naive_benchmark.pyto use the AST collection). Grade at least 15 answers and compare against your naive baseline grades. - Find a question where AST-aware retrieval finds the right file but the wrong symbol. What metadata would help the retrieval rank the correct symbol higher?
Completion checkpoint
You now have:
- A working Tree-sitter parser that extracts symbols from your anchor repo
- A symbol table with file, line range, and parent-class metadata for every function and class
- AST-aware chunks stored in a separate Qdrant collection
- A side-by-side comparison showing AST-aware retrieval's improvement over the naive baseline on symbol lookup and architecture questions
- Benchmark grades showing the overall improvement and the remaining failure categories
Reflection prompts
- How much did AST-aware chunking improve your benchmark scores? Was the improvement concentrated in specific question categories?
- Did the symbol table metadata (symbol name, parent class, line range) change which chunks the retrieval returned, or just make the returned chunks more useful?
- Which failure classes from the naive baseline are now resolved? Which ones remain?
- When you look at the remaining failures, do they involve relationships between code entities (imports, calls, dependencies)? That's the pattern we'll address next.
What's next
Graph/Hybrid Retrieval. Structure fixes symbol lookup, but relationship questions still need traversal and exact-match signals. That is the gap the next tier closes.
References
Start here
- Tree-sitter documentation — the official docs for Tree-sitter's API and grammar development
Build with this
- tree-sitter Python bindings — the Python package we use; includes API reference and examples
- Tree-sitter playground — interactive parser to visualize ASTs for different languages
Deep dive
- Tree-sitter GitHub — the core library with links to all available grammars
- Aider: Repository maps — how Aider uses Tree-sitter to build repository maps for LLM context; a practical example of AST-informed retrieval at scale