Suggested Repository Layout

Annotated directory tree showing the anchor project's structure as it emerges across Modules 2-6. Each directory maps to a module's teaching content; each file is introduced in a specific lesson.

This layout isn't prescribed upfront -- it's what you'll have built by the time you finish the curriculum. Use it as a reference when you need to find where something lives or where new code should go.

Full directory tree

anchor-repo/
│
├── benchmark-questions.jsonl              # Module 2: 30+ questions with gold answers
│                                          #   Fields: id, question, category, gold_answer
│                                          #   Extended with: expected_files, expected_symbols,
│                                          #   expected_tools, expected_route (Modules 4-6)
│
├── harness/                               # Module 2 + Module 6: experiment framework
│   ├── schema.py                          #   Run-log schema definition (SCHEMA_DESCRIPTION)
│   ├── run_baseline.py                    #   Module 2: manual baseline runner
│   ├── run_harness.py                     #   Module 6: unified harness runner
│   ├── grade_baseline.py                  #   Module 2: interactive hand-grading tool
│   ├── summarize_run.py                   #   Module 2: run summary and metrics
│   ├── compare_runs.py                    #   Module 2: compare two graded runs
│   ├── ci_eval.py                         #   Module 6: CI-friendly eval runner
│   ├── graders/                           #   Module 6: automated grading
│   │   ├── answer_grader.py              #     LLM-as-judge answer evaluation
│   │   ├── retrieval_grader.py           #     Rule-based retrieval evaluation
│   │   ├── tool_grader.py               #     Tool-use evaluation
│   │   ├── trace_labeler.py             #     Trace-level labeling taxonomy
│   │   └── ragas_metrics.py             #     RAGAS-style retrieval metrics
│   └── runs/                              #   Saved JSONL run logs
│       ├── baseline-*.jsonl              #     Module 2: first baseline
│       ├── traced-*.jsonl                #     Module 6: traced runs
│       └── harness-*-graded.jsonl        #     Auto-graded runs
│
├── agent/                                 # Module 3: tool-calling agent
│   ├── tools.py                           #   Tool definitions (read_file, search_code, etc.)
│   ├── schemas.py                         #   JSON schemas for tool arguments
│   ├── loop.py                            #   Raw tool-calling loop
│   ├── run_benchmark.py                   #   Agent benchmark runner
│   ├── mcp_server.py                      #   MCP server (stdio transport)
│   ├── mcp_server_http.py                 #   MCP server (Streamable HTTP transport)
│   ├── tools_langchain.py                 #   LangChain tool wrappers
│   └── graph_agent.py                     #   LangGraph agent implementation
│
├── retrieval/                             # Module 4: code retrieval pipeline
│   ├── chunk_files.py                     #   Tier 1: naive file chunking
│   ├── embed_and_store.py                 #   Tier 1: embedding and Qdrant storage
│   ├── naive_retrieve.py                  #   Tier 1: basic vector retrieval
│   ├── parse_ast.py                       #   Tier 2: AST parsing for Python files
│   ├── chunk_ast.py                       #   Tier 2: AST-aware chunking
│   ├── embed_ast_chunks.py                #   Tier 2: embed AST chunks
│   ├── build_metadata_index.py            #   Tier 3: file/symbol metadata index
│   ├── query_metadata.py                  #   Tier 3: metadata-aware queries
│   ├── compare_retrieval_methods.py       #   Tier 3: retrieval method comparison
│   ├── build_graph.py                     #   Tier 3: call-graph construction
│   ├── query_graph.py                     #   Tier 3: graph traversal queries
│   ├── hybrid_retrieve.py                 #   Tier 3: vector + lexical + graph fusion
│   ├── context_compiler.py                #   Tier 4: five-stage context compilation
│   ├── detect_context_rot.py              #   Tier 4: context quality checker
│   ├── run_naive_benchmark.py             #   Tier 1 benchmark runner
│   ├── compare_tiers.py                   #   Cross-tier comparison
│   └── run_compiled_benchmark.py          #   Tier 4 benchmark runner
│
├── rag/                                   # Module 5: RAG pipeline and grounding
│   ├── grounded_answer.py                 #   Grounded answer generation with citations
│   ├── evidence_bundle.py                 #   Evidence bundle schema
│   ├── pack_to_bundle.py                  #   Context pack → evidence bundle converter
│   ├── retrieval_service.py               #   Routed retrieval service
│   ├── rag_with_routing.py                #   Full RAG pipeline with routing
│   ├── test_routing.py                    #   Routing accuracy tests
│   └── benchmark_bundles.py               #   Raw vs. bundled benchmark comparison
│
├── observability/                         # Module 6: telemetry and cost tracking
│   ├── traced_pipeline.py                 #   Langfuse-instrumented RAG pipeline
│   ├── traced_benchmark.py                #   Traced benchmark runner
│   ├── cost_tracker.py                    #   Per-request cost estimation
│   ├── cache_metrics.py                   #   Prompt cache hit rate analysis
│   ├── model_router.py                    #   Model routing by task complexity
│   ├── token_budget.py                    #   Per-request token budget enforcement
│   ├── rate_limit_telemetry.py            #   Rate-limit event tracking
│   └── success_cost.py                    #   Cost per successful task metric
│
├── orchestration/                         # Module 7: multi-agent coordination
│   ├── router.py                          #   Question classifier and routing logic
│   ├── specialists.py                     #   Specialist implementations (code, docs, debug, general)
│   ├── graph.py                           #   LangGraph orchestration graph
│   ├── approval.py                        #   Human approval gate for side-effecting actions
│   └── benchmark_specialists.py           #   Specialist split vs single-agent comparison
│
├── memory/                                # Module 7: memory layers
│   ├── thread_memory.py                   #   Thread/session memory with summarization
│   ├── workflow_state.py                  #   Multi-step workflow state tracking
│   ├── long_term_memory.py                #   Mem0-backed long-term memory with write policies
│   └── memory_eval.py                     #   Memory usefulness evaluation
│
├── .env                                   # API keys (git-ignored)
├── .gitignore                             # Ignores .env, .venv/, runs/, etc.
├── requirements.txt                       # Python dependencies
└── .venv/                                 # Virtual environment (git-ignored)

Directory-to-module mapping

Directory	Module	Purpose
`harness/`	2 (Benchmark) + 6 (Observability & Evals)	Experiment framework: run benchmarks, grade results, compare runs
`agent/`	3 (Agent & Tools)	Tool definitions, the tool-calling loop, MCP servers, framework agents
`retrieval/`	4 (Code Retrieval)	Four retrieval tiers: naive, AST-aware, hybrid, compiled
`rag/`	5 (RAG & Grounding)	Evidence bundles, grounded answers, retrieval routing
`observability/`	6 (Observability & Evals)	Traces, cost tracking, caching, budgets, rate limiting

Key files by function

Function	Files
Running experiments	`harness/run_harness.py`, `harness/ci_eval.py`
Grading answers	`harness/graders/answer_grader.py`, `harness/graders/retrieval_grader.py`
Retrieval	`retrieval/hybrid_retrieve.py`, `retrieval/context_compiler.py`
Answer generation	`rag/grounded_answer.py`, `rag/rag_with_routing.py`
Telemetry	`observability/traced_pipeline.py`, `observability/cost_tracker.py`
Schema definitions	`harness/schema.py`, `rag/evidence_bundle.py`, `retrieval/context_compiler.py`

How the project grows

The directory structure isn't created all at once. Here's the build-up order:

Module 2: benchmark-questions.jsonl and harness/ with the baseline runner and grading tools
Module 3: agent/ with tool definitions and the tool-calling loop
Module 4: retrieval/ with four tiers of progressively better retrieval
Module 5: rag/ with evidence bundles, grounded answers, and retrieval routing
Module 6: observability/ with traces and cost tracking; harness/graders/ with automated evaluation
Module 7: orchestration/ with specialists and routing; memory/ with thread, workflow, and long-term layers

Each module's code builds on the previous modules. The retrieval/ directory feeds into rag/, which feeds into observability/traced_pipeline.py, which feeds into harness/run_harness.py. By Module 7, the full pipeline includes orchestration and memory.

Module 1 project

Module 1 (Foundations) uses a separate project directory:

ai-eng-foundations/
├── main.py                # FastAPI application
├── requirements.txt       # Dependencies
└── .env                   # API keys

This foundation project is a learning scaffold. Starting in Module 2, you'll switch to the anchor repository shown above. The Code Continuity Contract in AGENTS.md explains the relationship between the two projects.