Portfolio Milestones

Five artifacts that demonstrate real AI engineering skill. Not toy demos, but evidence that you can build, measure, and improve AI systems. Each milestone connects to specific modules and can be published as you complete them.

What makes a portfolio piece different from a demo

A demo shows that something works. A portfolio piece shows engineering judgment: the decisions you made, the alternatives you considered, the measurements you took, and the tradeoffs you accepted. Employers and collaborators don't need to see that you can call an API. They need to see that you can make good decisions about which API to call, when to call it, and how to know if it worked.

Every milestone below has three parts: what to build, what to document, and what the artifact demonstrates to someone reviewing your work.

Milestone 1: Benchmarked Code Assistant

When: After completing Modules 1-3.

What to build: A tool-calling code assistant that answers questions about a real repository, with a benchmark suite and graded baseline results.

What to publish:

The assistant code (tool definitions, tool loop or framework agent, system prompt)
The benchmark: 15+ questions with expected answers and grading rubric
A baseline run log with grades and summary metrics
A brief write-up (1-2 pages) covering:
- Why you chose this repository as your anchor project
- How you designed the benchmark questions (what categories, what distribution)
- What the baseline results tell you about where the system fails
- What you'd improve first and why

What it demonstrates: You can build an AI system that does something useful, define what "good" means for that system, and measure where it falls short. This is the foundation. Most AI projects skip measurement entirely.

Milestone 2: Retrieval Pipeline with Eval Coverage

When: After completing Modules 4-5.

What to build: A multi-tier retrieval pipeline (vector + at least one other retrieval method) with a RAG pipeline that produces grounded, cited answers. Retrieval evals that measure precision and recall.

What to publish:

The retrieval pipeline code (indexing, multiple retrieval methods, hybrid retrieval)
The RAG pipeline with evidence bundles and citation
Retrieval eval results showing precision and recall across question types
A comparison showing how each retrieval tier improved (or didn't improve) specific question categories
Retrieval Lab Notes documenting what failed at each tier and why

What it demonstrates: You understand that retrieval is the foundation of grounded AI systems, you can build and evaluate multiple retrieval approaches, and you can make principled decisions about when to use which approach. The Retrieval Lab Notes are the most valuable part because they capture your debugging thought process, not just working code.

Milestone 3: Observable System with Automated Evals

When: After completing Module 6.

What to build: The full harness: one-command benchmark runner with traced, costed, auto-graded run logs. LLM-as-judge grading with a structured rubric.

What to publish:

The harness code (benchmark runner, cost tracking, auto-grading)
A comparison of two runs showing measurable improvement (before and after a specific change)
The grading rubric and a sample of judge grades with your assessment of judge quality
Cost-per-successful-task metrics
A write-up covering:
- How you designed the grading rubric and validated it against human judgment
- What the cost metrics revealed about your system's efficiency
- One specific improvement you made based on what the evals showed you

What it demonstrates: You can build observability into an AI system, automate evaluation, and use measurements to drive improvements. This is the artifact that most clearly separates AI engineering from prompt tinkering.

Milestone 4: Multi-Agent System with Memory

When: After completing Module 7.

What to build: An orchestrated system with at least two specialists, thread memory, and workflow state. Demonstrate that routing and memory improve answer quality on multi-turn interactions.

What to publish:

The orchestration code (router, specialists, memory layer)
Benchmark results comparing single-agent vs. orchestrated performance
A multi-turn conversation trace showing how memory and context improve over the conversation
A write-up covering:
- How you decided which tasks to specialize (with eval evidence)
- How you handle routing errors (the honest "human-in-the-loop" question)
- What memory policies you chose and why
- One example where orchestration made the answer worse and what you learned from it

What it demonstrates: You can design and build multi-component AI systems, make principled specialization decisions based on evidence, and handle the complexity of state management across agents.

Milestone 5: Optimized and Distilled Component

When: After completing Module 8.

What to build: A distilled or fine-tuned model component that demonstrably improves cost, latency, or quality on a specific task. The optimization ladder diagnostic showing why this was the right intervention.

What to publish:

The failure diagnostic output showing the optimization ladder analysis
Training data collection and curation pipeline
Student/fine-tuned model evaluation results vs. the base model
Cost and latency comparison (teacher vs. student, or base vs. fine-tuned)
A postmortem covering:
- What failure cluster or cost problem motivated the optimization
- What cheaper interventions you tried first and why they weren't sufficient
- Whether the optimization was worth the effort (honest assessment)
- The optimization tax: what ongoing maintenance the optimization requires

What it demonstrates: You understand the full optimization ladder and can make disciplined decisions about when to invest in model training. The postmortem is critical because it shows you can evaluate your own work honestly, including admitting when an intervention wasn't worth it.

Publishing guidance

Where to publish: GitHub is the default. Each milestone can be a separate repository or a branch/directory in a monorepo. Include a README that explains the project for someone who hasn't read this curriculum.

What to include in the README:

One-paragraph summary of what the system does
How to run it (setup, dependencies, commands)
Key design decisions and why you made them
Evaluation results with numbers
What you'd do differently with more time

What NOT to include:

API keys, credentials, or secrets (use environment variables)
Raw model outputs without analysis
Code without explanation of the decisions behind it
Claims without measurements

The difference between a good portfolio piece and a great one: A good piece shows working code with eval results. A great piece shows the reasoning behind the code, including why this approach over alternatives, what the measurements told you, and what you learned from failures. The write-up matters as much as the code.

Cross-references

Suggested Repository Layout — directory structure for your project
Run-Log Schema — the schema your benchmark results should follow
Eval Taxonomy — the four eval families your portfolio should demonstrate