Suggested Repository Layout
Annotated directory tree showing the anchor project's structure as it emerges across Modules 2-6. Each directory maps to a module's teaching content; each file is introduced in a specific lesson.
This layout isn't prescribed upfront -- it's what you'll have built by the time you finish the curriculum. Use it as a reference when you need to find where something lives or where new code should go.
Full directory tree
anchor-repo/
│
├── benchmark-questions.jsonl # Module 2: 30+ questions with gold answers
│ # Fields: id, question, category, gold_answer
│ # Extended with: expected_files, expected_symbols,
│ # expected_tools, expected_route (Modules 4-6)
│
├── harness/ # Module 2 + Module 6: experiment framework
│ ├── schema.py # Run-log schema definition (SCHEMA_DESCRIPTION)
│ ├── run_baseline.py # Module 2: manual baseline runner
│ ├── run_harness.py # Module 6: unified harness runner
│ ├── grade_baseline.py # Module 2: interactive hand-grading tool
│ ├── summarize_run.py # Module 2: run summary and metrics
│ ├── compare_runs.py # Module 2: compare two graded runs
│ ├── ci_eval.py # Module 6: CI-friendly eval runner
│ ├── graders/ # Module 6: automated grading
│ │ ├── answer_grader.py # LLM-as-judge answer evaluation
│ │ ├── retrieval_grader.py # Rule-based retrieval evaluation
│ │ ├── tool_grader.py # Tool-use evaluation
│ │ ├── trace_labeler.py # Trace-level labeling taxonomy
│ │ └── ragas_metrics.py # RAGAS-style retrieval metrics
│ └── runs/ # Saved JSONL run logs
│ ├── baseline-*.jsonl # Module 2: first baseline
│ ├── traced-*.jsonl # Module 6: traced runs
│ └── harness-*-graded.jsonl # Auto-graded runs
│
├── agent/ # Module 3: tool-calling agent
│ ├── tools.py # Tool definitions (read_file, search_code, etc.)
│ ├── schemas.py # JSON schemas for tool arguments
│ ├── loop.py # Raw tool-calling loop
│ ├── run_benchmark.py # Agent benchmark runner
│ ├── mcp_server.py # MCP server (stdio transport)
│ ├── mcp_server_http.py # MCP server (Streamable HTTP transport)
│ ├── tools_langchain.py # LangChain tool wrappers
│ └── graph_agent.py # LangGraph agent implementation
│
├── retrieval/ # Module 4: code retrieval pipeline
│ ├── chunk_files.py # Tier 1: naive file chunking
│ ├── embed_and_store.py # Tier 1: embedding and Qdrant storage
│ ├── naive_retrieve.py # Tier 1: basic vector retrieval
│ ├── parse_ast.py # Tier 2: AST parsing for Python files
│ ├── chunk_ast.py # Tier 2: AST-aware chunking
│ ├── embed_ast_chunks.py # Tier 2: embed AST chunks
│ ├── build_metadata_index.py # Tier 3: file/symbol metadata index
│ ├── query_metadata.py # Tier 3: metadata-aware queries
│ ├── compare_substrates.py # Tier 3: substrate comparison
│ ├── build_graph.py # Tier 3: call-graph construction
│ ├── query_graph.py # Tier 3: graph traversal queries
│ ├── hybrid_retrieve.py # Tier 3: vector + lexical + graph fusion
│ ├── context_compiler.py # Tier 4: five-stage context compilation
│ ├── detect_context_rot.py # Tier 4: context quality checker
│ ├── run_naive_benchmark.py # Tier 1 benchmark runner
│ ├── compare_tiers.py # Cross-tier comparison
│ └── run_compiled_benchmark.py # Tier 4 benchmark runner
│
├── rag/ # Module 5: RAG pipeline and grounding
│ ├── grounded_answer.py # Grounded answer generation with citations
│ ├── evidence_bundle.py # Evidence bundle schema
│ ├── pack_to_bundle.py # Context pack → evidence bundle converter
│ ├── retrieval_service.py # Routed retrieval service
│ ├── rag_with_routing.py # Full RAG pipeline with routing
│ ├── test_routing.py # Routing accuracy tests
│ └── benchmark_bundles.py # Raw vs. bundled benchmark comparison
│
├── observability/ # Module 6: telemetry and cost tracking
│ ├── traced_pipeline.py # Langfuse-instrumented RAG pipeline
│ ├── traced_benchmark.py # Traced benchmark runner
│ ├── cost_tracker.py # Per-request cost estimation
│ ├── cache_metrics.py # Prompt cache hit rate analysis
│ ├── model_router.py # Model routing by task complexity
│ ├── token_budget.py # Per-request token budget enforcement
│ ├── rate_limit_telemetry.py # Rate-limit event tracking
│ └── success_cost.py # Cost per successful task metric
│
├── orchestration/ # Module 7: multi-agent coordination
│ ├── router.py # Question classifier and routing logic
│ ├── specialists.py # Specialist implementations (code, docs, debug, general)
│ ├── graph.py # LangGraph orchestration graph
│ ├── approval.py # Human approval gate for side-effecting actions
│ └── benchmark_specialists.py # Specialist split vs single-agent comparison
│
├── memory/ # Module 7: memory layers
│ ├── thread_memory.py # Thread/session memory with summarization
│ ├── workflow_state.py # Multi-step workflow state tracking
│ ├── long_term_memory.py # Mem0-backed long-term memory with write policies
│ └── memory_eval.py # Memory usefulness evaluation
│
├── .env # API keys (git-ignored)
├── .gitignore # Ignores .env, .venv/, runs/, etc.
├── requirements.txt # Python dependencies
└── .venv/ # Virtual environment (git-ignored)
Directory-to-module mapping
| Directory | Module | Purpose |
|---|---|---|
harness/ | 2 (Benchmark) + 6 (Observability & Evals) | Experiment framework: run benchmarks, grade results, compare runs |
agent/ | 3 (Agent & Tools) | Tool definitions, the tool-calling loop, MCP servers, framework agents |
retrieval/ | 4 (Code Retrieval) | Four retrieval tiers: naive, AST-aware, hybrid, compiled |
rag/ | 5 (RAG & Grounding) | Evidence bundles, grounded answers, retrieval routing |
observability/ | 6 (Observability & Evals) | Traces, cost tracking, caching, budgets, rate limiting |
Key files by function
| Function | Files |
|---|---|
| Running experiments | harness/run_harness.py, harness/ci_eval.py |
| Grading answers | harness/graders/answer_grader.py, harness/graders/retrieval_grader.py |
| Retrieval | retrieval/hybrid_retrieve.py, retrieval/context_compiler.py |
| Answer generation | rag/grounded_answer.py, rag/rag_with_routing.py |
| Telemetry | observability/traced_pipeline.py, observability/cost_tracker.py |
| Schema definitions | harness/schema.py, rag/evidence_bundle.py, retrieval/context_compiler.py |
How the project grows
The directory structure isn't created all at once. Here's the build-up order:
- Module 2:
benchmark-questions.jsonlandharness/with the baseline runner and grading tools - Module 3:
agent/with tool definitions and the tool-calling loop - Module 4:
retrieval/with four tiers of progressively better retrieval - Module 5:
rag/with evidence bundles, grounded answers, and retrieval routing - Module 6:
observability/with traces and cost tracking;harness/graders/with automated evaluation - Module 7:
orchestration/with specialists and routing;memory/with thread, workflow, and long-term layers
Each module's code builds on the previous modules. The retrieval/ directory feeds into rag/, which feeds into observability/traced_pipeline.py, which feeds into harness/run_harness.py. By Module 7, the full pipeline includes orchestration and memory.
Module 1 project
Module 1 (Foundations) uses a separate project directory:
ai-eng-foundations/
├── main.py # FastAPI application
├── requirements.txt # Dependencies
└── .env # API keys
This foundation project is a learning scaffold. Starting in Module 2, you'll switch to the anchor repository shown above. The Code Continuity Contract in AGENTS.md explains the relationship between the two projects.