Common Category Mistakes

When you're new to AI engineering, you're not just missing answers. If you're like me, you don't know what you don't know, and you're probably unsure what questions you should even be asking. The concepts and terminology have enough overlap that it's easy to conflate things that are actually categorically unrelated, or to separate things that are actually the same idea at different scales.

This page collects many of my own original misunderstandings, and the most common category mistakes I've seen others propagating throughout my journey. Each entry explains why the confusion is quite natural, what distinction you're missing, and gives you a better version of the question to ask instead.

I'd recommend bookmarking this page. You'll likely come back to it more than once.

Training vs. inference

The confused question	"How do I train my AI to do X?"
Why people ask	A lot of content frames all AI work as "training." If the AI isn't doing what you want, surely you need to train it differently.
What distinction you're missing	Training changes the model's weights; it's how the model was built. Inference is running the model to generate output from input. Almost all AI engineering is inference-time work: improving what the model sees (context), how it's instructed (prompts), and what tools it can use. You don't need to train a model to make it behave differently. You need to give it better instructions and better evidence.
A better question	"How do I get better outputs from this model without changing its weights? And when would changing the weights actually be the right move?"

The entire journey until Module 8 (Optimization) is inference-time work. Training only enters the picture when we've exhausted every other option and have the evals to show it's needed.

Context engineering vs. "more tokens"

The confused question	"My model's context window is 200K tokens. Why would I need context engineering?"
Why people ask	Context windows have grown dramatically. If you can fit everything in, why worry about what goes in?
What distinction you're missing	Context window size is capacity. Context engineering is selection. A 200K-token window that's full of irrelevant documents will produce worse results than a 4K-token window with exactly the right evidence. Models are sensitive to what's in their context: irrelevant information dilutes attention, conflicting evidence confuses reasoning, and stale facts produce confident wrong answers.
A better question	"Given that I have a large context window, how do I decide what deserves to be in it for this specific task?"

Context rot vs. "longer is better"

The confused question	"Shouldn't I keep the full conversation history so the model has maximum context?"
Why people ask	Intuition says more information is better. If the model can see everything, it should make better decisions.
What distinction you're missing	Context rot is the degradation that happens when context accumulates without curation. Old conversation turns may contain facts that are no longer true. Retrieved evidence from earlier in a session may contradict newer evidence. Memory entries written hours ago may reflect outdated state. The model doesn't know which parts of its context are stale; it treats everything as equally current. Keeping everything means keeping the wrong things alongside the right things.
A better question	"What's my strategy for removing or demoting stale context as a conversation progresses?"

LLM-as-judge vs. human eval

The confused question	"Can't I just use an LLM to evaluate all my outputs automatically?"
Why people ask	Manual evaluation is slow and doesn't scale. If an LLM can generate, surely it can judge.
What distinction you're missing	LLM-as-judge is a scaling technique, not a replacement for human judgment. It works well when you have a clear rubric, consistent evaluation criteria, and human spot-checks to validate the judge's calibration. It breaks down when the evaluation requires domain expertise the judge model doesn't have, when the rubric is ambiguous, or when you're grading the same model family that's doing the judging (self-evaluation bias). Use LLM-as-judge to cover breadth. Use human eval to ensure depth and catch systematic blind spots.
A better question	"For which of my eval dimensions can an LLM-as-judge reliably substitute for human review, and where do I still need human spot-checks?"

MCP vs. A2A vs. handoff vs. orchestration

The confused question	"What's the difference between MCP and A2A? Aren't they both ways for agents to talk to things?"
Why people ask	Both are protocols in the agent ecosystem. Both involve agents communicating with external capabilities. The naming and marketing blur the boundaries.
What distinction you're missing	These are four different concepts at different levels of abstraction:

Concept	What it does	Analogy
MCP (Model Context Protocol)	Connects an agent to capabilities: tools, resources, and prompts exposed by a server. The agent calls tools; the server executes them.	A worker using equipment. The worker decides what to do; the equipment does the physical task.
A2A (Agent-to-Agent protocol)	Connects peer agents that can negotiate, delegate, and collaborate as equals. Each agent has its own autonomy.	Two colleagues discussing a project and splitting work.
Handoff	Transfers control from one agent or specialist to another within a single runtime or orchestration layer. One agent says "you take it from here."	Passing a baton in a relay race.
Orchestration	The control logic that decides which agent, specialist, or workflow to activate for a given task. It's the routing and coordination layer.	A project manager assigning tasks to team members.

A better question: "Is this component providing tools to an agent (MCP), enabling peer-to-peer agent collaboration (A2A), transferring control within a system (handoff), or routing tasks across the system (orchestration)?"

Knowledge graphs vs. vector search

The confused question	"Should I use a knowledge graph or a vector database for my retrieval?"
Why people ask	Both are retrieval methods. Both get mentioned in the same conversations about RAG. It feels like an either/or choice.
What distinction you're missing	They solve different retrieval problems. Vector search finds items that are semantically similar to a query. It's good for "find me things related to X." Knowledge graphs store explicit relationships between entities (function A calls function B, module X imports module Y). They're good for "what is connected to X, and how?" The right answer is often neither or both. Many production systems use hybrid retrieval that combines vector search, lexical search, metadata filters, and sometimes graph traversal. The decision should be driven by what questions your system needs to answer, not by which technology sounds more advanced.
A better question	"What retrieval questions does my system actually need to answer, and which method handles each question type best?"

See the Retrieval Method Chooser for the full decision matrix.

RAG vs. vector DB

The confused question	"We're using Pinecone/Qdrant/Chroma, so we're doing RAG."
Why people ask	RAG tutorials almost always start with "set up a vector database," so the two concepts feel synonymous.
What distinction you're missing	RAG (Retrieval Augmented Generation) is a pattern: retrieve evidence, then generate a response grounded in that evidence. A vector database is one possible retrieval method. RAG can use lexical search, SQL queries, API calls, file-system lookups, AST parsing, or any combination. The "R" in RAG is "retrieval," not "vector search." Conflating them leads to a real engineering mistake: assuming that if you have a vector DB, you have good retrieval. You might have a vector DB full of badly chunked documents that returns irrelevant results. That's still RAG; it's just bad RAG.
A better question	"Is my retrieval step actually finding the right evidence, regardless of which method it uses?"

Hallucination vs. bad retrieval

The confused question	"The model is hallucinating. Should I fine-tune it to stop?"
Why people ask	"Hallucination" has become a catch-all term for "the model said something wrong." If the model is broken, fix the model.
What distinction you're missing	Hallucination is when a model generates content that sounds confident but isn't supported by the evidence it was given (or fabricates details entirely). Bad retrieval is when the system gave the model the wrong evidence in the first place. These look identical from the outside (the user sees a wrong answer either way) but the fixes are completely different. If retrieval is feeding the model irrelevant or contradictory documents, fixing the model won't help. Fix the retrieval. If retrieval is giving the model the right evidence and it's still making things up, that's a model behavior issue you might address with better prompting, constrained output, or (as a last resort) fine-tuning.
A better question	"Before I blame the model, did the retrieval step actually give it the right evidence to work with?"

LLM vs. SLM vs. reasoning model

The confused question	"What's better, an LLM or an SLM?" or "Should I use a reasoning model for everything?"
Why people ask	It seems like bigger and more capable should always be better. Or conversely, that smaller and cheaper should always be preferred.
What distinction you're missing	These aren't competing options on a single scale. They're different tools for different jobs:

Model type	Strength	Cost/speed profile	When to use it
Workhorse LLM	Broad capability, good tool use, reliable instruction following	Moderate cost, fast	Most tasks. The default choice.
Reasoning model	Complex multi-step reasoning, math, code generation with constraints	Higher cost, slower	Hard problems where the workhorse model fails. Not everything.
SLM	Runs on consumer hardware, predictable cost, low latency, data stays local	Very low cost, very fast	When privacy, offline operation, or cost predictability matters more than peak capability

The question isn't "which is better?" The question is "what does this specific task need, and what constraints am I operating under?" Production systems often use multiple model types: a workhorse for most tasks, a reasoning model for hard problems, and an SLM for high-volume or privacy-sensitive operations.

A better question: "What capability level does this specific task require, and what are my cost, latency, and privacy constraints?"

Vibe coding vs. engineering

The confused question	"I got it working by tweaking the prompt until the output looked right. Isn't that how AI engineering works?"
Why people ask	LLMs are remarkably responsive to prompt changes. It's genuinely possible to get impressive results by iterating on prompts until they "feel right." This works well enough in demos.
What distinction you're missing	Vibe coding is iterating until something looks right without measuring whether it actually is right. Engineering is iterating with evals, benchmarks, and traces so you know whether a change helped, hurt, or was neutral, and so you can reproduce your results. The gap between them isn't visible on day one. It becomes visible when: the prompt that works for 5 test cases fails on the 50th, when you can't explain why last week's version was better, when a teammate changes the prompt and you can't tell what broke. Evals close this gap. That's why the curriculum puts benchmarks and evals before most of the advanced modules.
A better question	"How do I know this change actually improved my system, and not just the five examples I happened to test?"

Anthropic's developer platform vs. Claude on AWS Bedrock

The confused question	"I use Claude through AWS Bedrock. Is this curriculum's Anthropic path the same thing?"
Why people ask	Both paths give you Claude. The model is the same. The lesson code shows `from anthropic import Anthropic`, and it's natural to assume that's the only way to call Claude.
What distinction you're missing	Anthropic's developer platform (`platform.claude.com`) is the direct API with Anthropic-issued API keys. AWS Bedrock is a cloud platform surface that hosts Claude (and other models) behind AWS IAM credentials and the Bedrock Converse API. The model behavior, prompt engineering, and curriculum concepts are identical. The API shape, authentication, and model IDs differ. This curriculum standardizes on the direct provider SDKs for lesson code. If you're using Bedrock, follow the Anthropic tab and apply the translation table in the Cloud Provider Surfaces reference.
A better question	"How do I translate the Anthropic examples in this curriculum to Bedrock's Converse API?"

Direct provider API vs. GitHub Models vs. GitHub Copilot SDK

The confused question	"If GitHub Models and GitHub Copilot SDK are both real APIs, why does the curriculum treat them differently?"
Why people ask	All three surfaces are programmable. All three can be involved in an AI app. From the outside, they can look like interchangeable ways to "call a model."
What distinction you're missing	Direct provider APIs are the model publisher's own contract surface: OpenAI, Gemini, Anthropic. Hosted inference platforms like Hugging Face, Ollama Cloud, and GitHub Models route to or host models from various publishers. GitHub Models is a hosted inference and routing layer that sits in front of multiple publishers, and the curriculum now supports it as a hosted path where that surface fits the lesson contract. GitHub Copilot SDK is an agent runtime that adds sessions, tools, and control flow on top. That is why GitHub Models appears in lesson tabs now while Copilot SDK still does not.
A better question	"Am I trying to learn the direct model API contract, a hosted routing platform, or an agent runtime?"

Quick-reference table

For fast lookup, here are all the confusions in one place:

Confused pair	Key distinction
Training vs. inference	Training changes weights; inference uses them. Almost all AI engineering is inference-time.
Context engineering vs. "more tokens"	Window size is capacity; context engineering is selection.
Context rot vs. "longer is better"	Accumulated context degrades quality. Curation beats accumulation.
LLM-as-judge vs. human eval	LLM-as-judge scales breadth; humans ensure depth. Use both.
MCP vs. A2A vs. handoff vs. orchestration	Four levels: tool access, peer agents, control transfer, routing logic.
Knowledge graphs vs. vector search	Graphs store relationships; vectors find similarity. Often complementary.
RAG vs. vector DB	RAG is a pattern; vector DB is one possible method.
Hallucination vs. bad retrieval	Same symptom, different cause. Check retrieval before blaming the model.
LLM vs. SLM vs. reasoning model	Different tools for different jobs, not a single quality scale.
Vibe coding vs. engineering	Evals are the difference. Measure, don't guess.
Anthropic platform vs. Bedrock	Same model, different API surface. Concepts transfer; SDK calls need translation.
Direct provider API vs. GitHub Models vs. Copilot SDK	Provider API is the base contract; GitHub Models is routing/inference; Copilot SDK is an agent runtime.

What's next

Choosing a Provider. Now that the category boundaries are clearer, pick the developer surface you'll use first.

Then read Choose Your Track to decide whether you're starting cloud-first, local-first, or balanced.