Getting Started Common Category Mistakes

Common Category Mistakes

When you're new to AI engineering, you're not just missing answers. If you're like me, you don't know what you don't know, and you're probably unsure what questions you should even be asking. The concepts and terminology have enough overlap that it's easy to conflate things that are actually categorically unrelated, or to separate things that are actually the same idea at different scales.

This page collects many of my own original misunderstandings, and the most common category mistakes I've seen others propagating throughout my journey. Each entry explains why the confusion is quite natural, what distinction you're missing, and gives you a better version of the question to ask instead.

I'd recommend bookmarking this page. You'll likely come back to it more than once.


Training vs. inference

The confused question"How do I train my AI to do X?"
Why people askA lot of content frames all AI work as "training." If the AI isn't doing what you want, surely you need to train it differently.
What distinction you're missingTraining changes the model's weights; it's how the model was built. Inference is running the model to generate output from input. Almost all AI engineering is inference-time work: improving what the model sees (context), how it's instructed (prompts), and what tools it can use. You don't need to train a model to make it behave differently. You need to give it better instructions and better evidence.
A better question"How do I get better outputs from this model without changing its weights? And when would changing the weights actually be the right move?"

The entire journey until Module 8 (Optimization) is inference-time work. Training only enters the picture when we've exhausted every other option and have the evals to show it's needed.


Context engineering vs. "more tokens"

The confused question"My model's context window is 200K tokens. Why would I need context engineering?"
Why people askContext windows have grown dramatically. If you can fit everything in, why worry about what goes in?
What distinction you're missingContext window size is capacity. Context engineering is selection. A 200K-token window that's full of irrelevant documents will produce worse results than a 4K-token window with exactly the right evidence. Models are sensitive to what's in their context: irrelevant information dilutes attention, conflicting evidence confuses reasoning, and stale facts produce confident wrong answers.
A better question"Given that I have a large context window, how do I decide what deserves to be in it for this specific task?"

Context rot vs. "longer is better"

The confused question"Shouldn't I keep the full conversation history so the model has maximum context?"
Why people askIntuition says more information is better. If the model can see everything, it should make better decisions.
What distinction you're missingContext rot is the degradation that happens when context accumulates without curation. Old conversation turns may contain facts that are no longer true. Retrieved evidence from earlier in a session may contradict newer evidence. Memory entries written hours ago may reflect outdated state. The model doesn't know which parts of its context are stale; it treats everything as equally current. Keeping everything means keeping the wrong things alongside the right things.
A better question"What's my strategy for removing or demoting stale context as a conversation progresses?"

LLM-as-judge vs. human eval

The confused question"Can't I just use an LLM to evaluate all my outputs automatically?"
Why people askManual evaluation is slow and doesn't scale. If an LLM can generate, surely it can judge.
What distinction you're missingLLM-as-judge is a scaling technique, not a replacement for human judgment. It works well when you have a clear rubric, consistent evaluation criteria, and human spot-checks to validate the judge's calibration. It breaks down when the evaluation requires domain expertise the judge model doesn't have, when the rubric is ambiguous, or when you're grading the same model family that's doing the judging (self-evaluation bias). Use LLM-as-judge to cover breadth. Use human eval to ensure depth and catch systematic blind spots.
A better question"For which of my eval dimensions can an LLM-as-judge reliably substitute for human review, and where do I still need human spot-checks?"

MCP vs. A2A vs. handoff vs. orchestration

The confused question"What's the difference between MCP and A2A? Aren't they both ways for agents to talk to things?"
Why people askBoth are protocols in the agent ecosystem. Both involve agents communicating with external capabilities. The naming and marketing blur the boundaries.
What distinction you're missingThese are four different concepts at different levels of abstraction:
ConceptWhat it doesAnalogy
MCP (Model Context Protocol)Connects an agent to capabilities: tools, resources, and prompts exposed by a server. The agent calls tools; the server executes them.A worker using equipment. The worker decides what to do; the equipment does the physical task.
A2A (Agent-to-Agent protocol)Connects peer agents that can negotiate, delegate, and collaborate as equals. Each agent has its own autonomy.Two colleagues discussing a project and splitting work.
HandoffTransfers control from one agent or specialist to another within a single runtime or orchestration layer. One agent says "you take it from here."Passing a baton in a relay race.
OrchestrationThe control logic that decides which agent, specialist, or workflow to activate for a given task. It's the routing and coordination layer.A project manager assigning tasks to team members.

A better question: "Is this component providing tools to an agent (MCP), enabling peer-to-peer agent collaboration (A2A), transferring control within a system (handoff), or routing tasks across the system (orchestration)?"


Knowledge graphs vs. vector search

The confused question"Should I use a knowledge graph or a vector database for my retrieval?"
Why people askBoth are retrieval methods. Both get mentioned in the same conversations about RAG. It feels like an either/or choice.
What distinction you're missingThey solve different retrieval problems. Vector search finds items that are semantically similar to a query. It's good for "find me things related to X." Knowledge graphs store explicit relationships between entities (function A calls function B, module X imports module Y). They're good for "what is connected to X, and how?" The right answer is often neither or both. Many production systems use hybrid retrieval that combines vector search, lexical search, metadata filters, and sometimes graph traversal. The decision should be driven by what questions your system needs to answer, not by which technology sounds more advanced.
A better question"What retrieval questions does my system actually need to answer, and which method handles each question type best?"

See the Retrieval Method Chooser for the full decision matrix.


RAG vs. vector DB

The confused question"We're using Pinecone/Qdrant/Chroma, so we're doing RAG."
Why people askRAG tutorials almost always start with "set up a vector database," so the two concepts feel synonymous.
What distinction you're missingRAG (Retrieval Augmented Generation) is a pattern: retrieve evidence, then generate a response grounded in that evidence. A vector database is one possible retrieval method. RAG can use lexical search, SQL queries, API calls, file-system lookups, AST parsing, or any combination. The "R" in RAG is "retrieval," not "vector search." Conflating them leads to a real engineering mistake: assuming that if you have a vector DB, you have good retrieval. You might have a vector DB full of badly chunked documents that returns irrelevant results. That's still RAG; it's just bad RAG.
A better question"Is my retrieval step actually finding the right evidence, regardless of which method it uses?"

Hallucination vs. bad retrieval

The confused question"The model is hallucinating. Should I fine-tune it to stop?"
Why people ask"Hallucination" has become a catch-all term for "the model said something wrong." If the model is broken, fix the model.
What distinction you're missingHallucination is when a model generates content that sounds confident but isn't supported by the evidence it was given (or fabricates details entirely). Bad retrieval is when the system gave the model the wrong evidence in the first place. These look identical from the outside (the user sees a wrong answer either way) but the fixes are completely different. If retrieval is feeding the model irrelevant or contradictory documents, fixing the model won't help. Fix the retrieval. If retrieval is giving the model the right evidence and it's still making things up, that's a model behavior issue you might address with better prompting, constrained output, or (as a last resort) fine-tuning.
A better question"Before I blame the model, did the retrieval step actually give it the right evidence to work with?"

LLM vs. SLM vs. reasoning model

The confused question"What's better, an LLM or an SLM?" or "Should I use a reasoning model for everything?"
Why people askIt seems like bigger and more capable should always be better. Or conversely, that smaller and cheaper should always be preferred.
What distinction you're missingThese aren't competing options on a single scale. They're different tools for different jobs:
Model typeStrengthCost/speed profileWhen to use it
Workhorse LLMBroad capability, good tool use, reliable instruction followingModerate cost, fastMost tasks. The default choice.
Reasoning modelComplex multi-step reasoning, math, code generation with constraintsHigher cost, slowerHard problems where the workhorse model fails. Not everything.
SLMRuns on consumer hardware, predictable cost, low latency, data stays localVery low cost, very fastWhen privacy, offline operation, or cost predictability matters more than peak capability

The question isn't "which is better?" The question is "what does this specific task need, and what constraints am I operating under?" Production systems often use multiple model types: a workhorse for most tasks, a reasoning model for hard problems, and an SLM for high-volume or privacy-sensitive operations.

A better question: "What capability level does this specific task require, and what are my cost, latency, and privacy constraints?"


Vibe coding vs. engineering

The confused question"I got it working by tweaking the prompt until the output looked right. Isn't that how AI engineering works?"
Why people askLLMs are remarkably responsive to prompt changes. It's genuinely possible to get impressive results by iterating on prompts until they "feel right." This works well enough in demos.
What distinction you're missingVibe coding is iterating until something looks right without measuring whether it actually is right. Engineering is iterating with evals, benchmarks, and traces so you know whether a change helped, hurt, or was neutral, and so you can reproduce your results. The gap between them isn't visible on day one. It becomes visible when: the prompt that works for 5 test cases fails on the 50th, when you can't explain why last week's version was better, when a teammate changes the prompt and you can't tell what broke. Evals close this gap. That's why the curriculum puts benchmarks and evals before most of the advanced modules.
A better question"How do I know this change actually improved my system, and not just the five examples I happened to test?"

Anthropic's developer platform vs. Claude on AWS Bedrock

The confused question"I use Claude through AWS Bedrock. Is this curriculum's Anthropic path the same thing?"
Why people askBoth paths give you Claude. The model is the same. The lesson code shows from anthropic import Anthropic, and it's natural to assume that's the only way to call Claude.
What distinction you're missingAnthropic's developer platform (platform.claude.com) is the direct API with Anthropic-issued API keys. AWS Bedrock is a cloud platform surface that hosts Claude (and other models) behind AWS IAM credentials and the Bedrock Converse API. The model behavior, prompt engineering, and curriculum concepts are identical. The API shape, authentication, and model IDs differ. This curriculum standardizes on the direct provider SDKs for lesson code. If you're using Bedrock, follow the Anthropic tab and apply the translation table in the Cloud Provider Surfaces reference.
A better question"How do I translate the Anthropic examples in this curriculum to Bedrock's Converse API?"

Direct provider API vs. GitHub Models vs. GitHub Copilot SDK

The confused question"If GitHub Models and GitHub Copilot SDK are both real APIs, why does the curriculum treat them differently?"
Why people askAll three surfaces are programmable. All three can be involved in an AI app. From the outside, they can look like interchangeable ways to "call a model."
What distinction you're missingDirect provider APIs are the model publisher's own contract surface: OpenAI, Gemini, Anthropic. Hosted inference platforms like Hugging Face, Ollama Cloud, and GitHub Models route to or host models from various publishers. GitHub Models is a hosted inference and routing layer that sits in front of multiple publishers, and the curriculum now supports it as a hosted path where that surface fits the lesson contract. GitHub Copilot SDK is an agent runtime that adds sessions, tools, and control flow on top. That is why GitHub Models appears in lesson tabs now while Copilot SDK still does not.
A better question"Am I trying to learn the direct model API contract, a hosted routing platform, or an agent runtime?"

Quick-reference table

For fast lookup, here are all the confusions in one place:

Confused pairKey distinction
Training vs. inferenceTraining changes weights; inference uses them. Almost all AI engineering is inference-time.
Context engineering vs. "more tokens"Window size is capacity; context engineering is selection.
Context rot vs. "longer is better"Accumulated context degrades quality. Curation beats accumulation.
LLM-as-judge vs. human evalLLM-as-judge scales breadth; humans ensure depth. Use both.
MCP vs. A2A vs. handoff vs. orchestrationFour levels: tool access, peer agents, control transfer, routing logic.
Knowledge graphs vs. vector searchGraphs store relationships; vectors find similarity. Often complementary.
RAG vs. vector DBRAG is a pattern; vector DB is one possible method.
Hallucination vs. bad retrievalSame symptom, different cause. Check retrieval before blaming the model.
LLM vs. SLM vs. reasoning modelDifferent tools for different jobs, not a single quality scale.
Vibe coding vs. engineeringEvals are the difference. Measure, don't guess.
Anthropic platform vs. BedrockSame model, different API surface. Concepts transfer; SDK calls need translation.
Direct provider API vs. GitHub Models vs. Copilot SDKProvider API is the base contract; GitHub Models is routing/inference; Copilot SDK is an agent runtime.

What's next

Choosing a Provider. Now that the category boundaries are clearer, pick the developer surface you'll use first.

Then read Choose Your Track to decide whether you're starting cloud-first, local-first, or balanced.

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.