Core Concept Map
This page defines the ideas that organize the entire learning path. I'd recommend reading it before starting Module 1. Every concept here will appear throughout your journey, and understanding how they relate to each other will make each lesson click faster.
We'll define each concept in plain language first, then show how it connects to the others.
The core concepts
Model
A model is a program that has been trained on data to predict outputs from inputs. In this curriculum, "model" almost always means a large language model (LLM), a model trained on text that generates text. We won't train models in this learning path (until the very end). You'll use them.
At its core, a model takes in a sequence of tokens and produces a sequence of tokens. Everything else in AI engineering, and everything in this curriculum, is about what you put in, what you do with what comes out, and how you measure whether it was good.
Model families you'll encounter:
| Family | What it is | When to use it | Examples |
|---|---|---|---|
| Workhorse model | A general-purpose LLM optimized for speed and broad capability | Most tasks. When someone says "LLM" without qualification, they usually mean this | GPT-4o-mini, Claude Haiku, Gemini Flash, Llama 3.1 8B |
| Reasoning model | A model optimized for complex multi-step planning, math, and logic. Slower and more expensive | Hard problems that require chained reasoning, code generation with complex constraints | o3, o4-mini, Claude with extended thinking |
| Small language model (SLM) | A compact model (typically 1-7B parameters) that runs on consumer hardware | When privacy, offline operation, predictable cost, or low latency matter more than peak capability | Phi, small Llama variants, Gemma |
Foundation model: A foundation model is a large model pre-trained on broad data and designed to be adapted to many downstream tasks. OpenAI's GPT models, Anthropic's Claude models, and Llama models are all examples of foundation models. In this learning path, the workhorse, reasoning, and SLM families in the table above are not separate from foundation models; they are practical categories for thinking about how different foundation models are used.
What a model is not: A model is not a product. ChatGPT is a product built on top of models. Claude is a product built on top of models. The model is the engine; the product is the car. In this learning path, you'll work with the engine directly through APIs.
Infrastructure vs. model: An inference server like vLLM or Ollama is not a model. It's the software that hosts a model and serves requests. Think of it as the power plant that runs the engine. You'll encounter this distinction in later modules when we discuss self-hosted inference.
Prompt
A prompt is the input you give to a model. At its simplest, it's a question or instruction. In practice, a prompt is a structured message (or sequence of messages) that includes:
- A system message that sets the model's role, behavior, and constraints
- One or more user messages with the actual request
- Optionally, assistant messages (previous model responses) for conversation continuity
Prompt engineering isn't about tricks. I've found it's much closer to writing clear contracts or requirements: telling the model what you want, what format you want it in, what constraints apply, and what evidence to use. If you're good at writing technical specs, you'll find prompt engineering surprisingly natural.
Relationship to other concepts: A prompt is one part of the model's context. It's the part you write directly. The other parts (retrieved evidence, tool results, memory) are assembled by your system.
Context
Context is everything the model sees when it generates a response. This includes the prompt, but it also includes:
- Retrieved documents or code snippets
- Results from tool calls
- Conversation history
- Memory entries
- System instructions
A model's context window is the maximum number of tokens it can process in a single request (input and output combined). Context windows have gotten large (100K+ tokens is common), but bigger is not always better. A model with 200K tokens of irrelevant context will often perform worse than a model with 2K tokens of precisely relevant context.
Relationship to other concepts: Context is the stage where prompt, retrieval, memory, and tool results all meet. Context engineering is the discipline of managing this stage well.
Context engineering
Context engineering is the discipline of selecting, packaging, and budgeting the information a model sees at inference time. In my experience, it's the single most important skill in AI engineering, more impactful than choosing the right model or the right framework.
Context engineering involves:
- Selecting the right evidence for a specific task (not just "everything we have")
- Packaging it in a format the model can use effectively (structure, ordering, deduplication)
- Budgeting tokens so you don't waste context window space on low-value information
- Maintaining context quality over time (avoiding context rot)
This is different from "just make the context window bigger." A longer context window gives you more room, but it doesn't tell you what to put in it. Context engineering is the judgment layer.
Context rot is what happens when context quality degrades over time: stale memory facts, conflicting retrieved evidence, bloated prompt history, accumulated instructions that contradict each other. It's a form of technical debt in AI systems, and context engineering is how you prevent and fix it.
Relationship to other concepts: Context engineering sits on top of retrieval, memory, and prompting. It's the skill that ties them together. You'll practice it throughout the curriculum, starting with simple prompt construction and building toward compiled context packs.
Tool call
A tool call is when the model requests the execution of a specific function with structured arguments, rather than just generating text. The model doesn't run the tool itself. It outputs a structured request ("call function X with arguments Y"), your code executes it, and the result goes back into the model's context.
Tool calling is what turns a language model from a text generator into a component of a larger system. With tools, a model can:
- Look up information it doesn't have
- Execute actions in the real world (send emails, create files, query databases)
- Interact with APIs and external services
Relationship to other concepts: Tool results become part of the model's context. The quality of tool results directly affects the quality of the model's output. This is another place where context engineering matters: not just "did the tool return a result?" but "is this result the right thing to show the model?"
Retrieval
Retrieval is the process of finding relevant information from a larger corpus to include in the model's context. Instead of hoping the model "knows" the answer from training, you find the evidence first and give it to the model explicitly.
Retrieval methods include:
| Method | How it works | Good for |
|---|---|---|
| Lexical search (BM25, keyword) | Matches exact terms | Known identifiers, function names, exact phrases |
| Vector search (embeddings) | Matches by semantic similarity | Natural language queries, conceptual questions |
| AST / symbol index | Uses code structure (syntax trees) | "Where is this function defined?", "What calls this?" |
| Metadata filters | Filters by structured attributes (file type, date, author) | Narrowing results to relevant scope |
| Hybrid | Combines multiple methods and reranks | Most production systems |
Retrieval isn't the same thing as RAG. Retrieval is the act of finding things. RAG (Retrieval-Augmented Generation) is a specific pattern where you retrieve evidence, then generate a response grounded in that evidence. Retrieval is an ingredient; RAG is a recipe.
Relationship to other concepts: Retrieved results become part of the model's context. Bad retrieval leads to bad context, which leads to bad outputs, no matter how good the model is. This is why the curriculum spends an entire module on retrieval before introducing RAG.
Memory
Memory is how an AI system persists information across conversations or tasks. Without memory, every interaction starts from zero.
There are different kinds of memory:
| Type | Scope | Example |
|---|---|---|
| Thread memory | Within one conversation | "The user said they prefer Python" |
| Workflow memory | Within one multi-step task | "Step 2 produced these intermediate results" |
| Long-term memory | Across conversations | "This user's codebase uses FastAPI and PostgreSQL" |
Memory isn't free. Writing to memory creates a commitment: the system will use this information in future contexts. Bad memory (stale facts, incorrect summaries, over-broad generalizations) causes context rot. This is why we'll introduce memory only after we can measure whether it helps or hurts, through evals and telemetry.
Relationship to other concepts: Memory entries become part of the model's context. Memory management is a context engineering problem: what to remember, when to forget, and how to keep memory entries from contradicting each other.
Eval
An eval (evaluation) is a structured test that measures system quality. It's not the same as training. Evals measure; they don't change the model.
Evals answer questions like:
- Did the retrieval step find the right evidence?
- Did the generated answer match the evidence?
- Did the tool call use the correct arguments?
- Did the system complete the task end-to-end?
The curriculum distinguishes several eval families:
| Family | What it measures |
|---|---|
| Retrieval evals | Did we find the right documents? |
| Answer evals | Is the generated response correct and grounded? |
| Tool-use evals | Did the model call the right tools with the right arguments? |
| Trace evals | Did the full system execution path make sense? |
LLM-as-judge is a technique where you use a language model to evaluate or grade the output of another model. It's useful for scaling evaluation beyond manual review, but it requires careful rubric design and human spot-checking. It's not a replacement for exact-match checks where those apply.
Relationship to other concepts: Evals are what make iteration scientific instead of subjective. Without evals, we're guessing whether our changes helped. With evals, we know. This is why the curriculum starts with a benchmark before almost anything else.
Inference
Inference is the act of running a trained model to generate output from input. When you call an API and get a response, that's inference. When you run a model locally and it produces text, that's inference.
Most AI engineering work is inference-time work: building systems around models, not training them. The curriculum uses the word "inference," not "inferencing."
Inference is different from training. Training changes the model's weights. Inference uses the weights as they are. The entire learning path until the Optimization module is inference-time work.
Relationship to other concepts: Everything in this concept map (prompts, context, retrieval, tools, memory, evals) is about making inference better. The model's weights stay fixed; you improve the system around it.
Distillation
Distillation is training a smaller model (the "student") to reproduce the behavior of a larger model (the "teacher") on a specific task. The student doesn't learn from the original training data. It learns from the teacher's outputs.
Why distill? Because large models are expensive and slow. If you have a task where a large model reliably produces good results, you can distill that behavior into a smaller, cheaper, faster model that handles that specific task well enough.
Relationship to other concepts: Distillation only makes sense after you have evals (to measure whether the student matches the teacher) and logs (to collect training examples from the teacher's outputs). This is why it appears near the end of the curriculum.
Fine-tuning
Fine-tuning is updating a model's weights on task-specific data to change its behavior permanently. While distillation and fine-tuning both involve training, they solve different problems: distillation compresses a larger model's behavior into a smaller one, while fine-tuning adapts a model to your specific domain, format, or task. They're separate techniques in this curriculum, not subcategories of each other.
Fine-tuning is the most permanent intervention in this curriculum. It changes the model itself. This is why it comes last: we'll only consider fine-tuning after we've exhausted prompt engineering, retrieval improvements, context engineering, and workflow changes. Those interventions are cheaper, more reversible, and often sufficient.
Common fine-tuning approaches you'll encounter:
| Approach | What it does |
|---|---|
| SFT (Supervised Fine-Tuning) | Trains on input-output pairs where the desired output is known |
| LoRA / QLoRA | Parameter-efficient methods that train small adapter layers instead of updating all weights, dramatically reducing hardware requirements |
| Preference optimization (RLHF, DPO) | Uses human or automated preference signals to improve model behavior |
Relationship to other concepts: Fine-tuning requires evals (to measure improvement), logs (for training data), and a stable system (so you're not fine-tuning around bugs in retrieval or prompting). It's the capstone of the optimization sequence.
How these concepts connect
Here's the big picture of how these concepts relate to each other in a running AI system:
The flow is:
- You write a prompt and your system assembles context from retrieval, tools, and memory
- Context engineering decides what goes in and what stays out
- The model runs inference on the assembled context
- Evals measure whether the output was good
- You improve by fixing context (cheap, reversible) before resorting to distillation or fine-tuning (expensive, permanent)
This loop is what we'll be building throughout the curriculum. Each module teaches you to build one part of it, measure it, and improve it.
What's next
Common Category Mistakes. The concept map gives you the nouns; that page covers the confusions that make those nouns blur together in practice.