LLM Mental Models
This lesson replaces magic with mechanics: a practical mental model for how LLMs actually behave.
Once tokens, context windows, messages, and the difference between weights and runtime context are clear, everything we build later will be much easier to reason about and debug. Most "why did the model do that?" moments trace back to one of the ideas in this lesson.
By the end, you'll have the internal vocabulary you need before writing your first prompt contract or making your first API call.
What you'll learn
- Explain what tokens are and why they determine cost, latency, and context limits
- Describe the difference between model weights (what the model learned during training) and runtime context (what you give it at inference time)
- Explain why model outputs vary between runs and what controls that variation
- Identify whether a problem is a prompt issue, a context issue, or a model limitation
- Explain the difference between training and inference
Concepts
Token: the basic unit a language model reads and produces. A token is roughly 3-4 characters of English text, but varies by language and model. Tokens determine three things you care about:
-
Cost: you pay per token, and input tokens are priced separately from output tokens. Output tokens are typically 3-5x more expensive. The formula:
cost = (input_tokens × input_price) + (output_tokens × output_price). For example, at $3/million input tokens and $15/million output tokens, a request with 10K input tokens and 1K output tokens costs $0.03 + $0.015 = $0.045. -
Latency: more tokens means slower responses, but input and output affect latency differently. Time to first token (TTFT) scales with input token count (the model processes your entire input before generating anything). Generation time scales linearly with output token count (the model generates one token at a time). So:
total_latency ≈ TTFT(input_tokens) + (output_tokens × time_per_token). -
Context limits: every model has a hard ceiling:
input_tokens + output_tokens ≤ context_window. If your system prompt is 2K tokens, your conversation history is 5K, and your retrieved evidence is 10K, you have used 17K tokens of input, and whatever remains is available for the model's response. This is why context budgeting matters from day one.
Most providers offer a tokenizer or token-counting path you can use before sending a request (tiktoken for OpenAI models; Gemini exposes count_tokens; Anthropic exposes token counts in the response usage object and a count-tokens endpoint). Build the habit of checking token counts early. It prevents budget surprises and context overflow.
Context window: the maximum number of tokens a model can process in a single request. This includes everything: your system prompt, the conversation history, retrieved evidence, tool results, and the model's own response. When your input exceeds the context window, the model either truncates or refuses. Bigger context windows do not mean you should fill them. More context is not the same as better context.
Training vs inference: training is the process that creates the model's weights (its learned knowledge and behaviors). Inference is what happens when you send a request to the model and get a response. Training changes the model. Inference uses the model as-is. Most AI engineering work is inference-time work: choosing what context to provide, how to structure prompts, which tools to expose, and how to evaluate outputs. We won't train a model until Modules 7 and 8 of this curriculum.
Weights: the model's learned parameters, fixed at training time. Weights encode general knowledge, language patterns, and reasoning capabilities. You cannot change weights at inference time. When you send a prompt, you are not teaching the model; you are steering it within its existing capabilities.
Runtime context: everything you provide in a single request: system prompt, user message, conversation history, retrieved documents, tool results. This is the only thing you control at inference time. The quality of your system's output depends heavily on what context you select and how you structure it.
Temperature: a parameter that controls randomness in the model's output. Temperature 0 gives the most deterministic output (always picks the highest-probability token). Higher temperatures introduce more variation. Temperature does not make the model "more creative" in a reliable way; it makes it more random.
Top-p (nucleus sampling): an alternative to temperature for controlling output randomness. Instead of scaling all token probabilities, top-p truncates the distribution to the smallest set of tokens whose cumulative probability exceeds the threshold p. In practice, most applications set temperature or top-p, not both.
Message roles: LLM APIs structure conversations as sequences of messages, each with a role:
- System: instructions that frame the model's behavior for the entire conversation
- User: the human's input
- Assistant: the model's previous responses
- Tool: results returned from tool calls
The order, structure, and content of these messages are your primary levers for controlling model behavior.
Model families you'll encounter
The term "LLM" (large language model) is the umbrella for most models you'll work with, but not all models are the same kind of tool. You'll encounter three categories:
Workhorse models (standard LLMs): general-purpose models optimized for speed and broad capability. GPT-4o-mini, Claude Haiku, and Gemini Flash are examples. They handle most tasks well: summarization, extraction, classification, generation. This is what you'll use for most of your work and what the exercises in this lesson use. When someone says "LLM" without qualification, they usually mean this.
Reasoning models: models optimized for complex multi-step planning, math, and logic. OpenAI's o3 and o4-mini are examples. They spend more time "thinking" before answering, which makes them slower and more expensive but significantly better on hard problems. You will hear some people call these "LRMs" (large reasoning models), but "reasoning models" is the more consistent term across provider documentation. The key tradeoff: use reasoning models when the task genuinely requires multi-step planning or complex logic, not as a default for every call.
Small language models (SLMs): compact models (typically 1-7B parameters) that run on consumer hardware with lower cost, lower latency, and better privacy. Microsoft's Phi models, Meta's smaller Llama variants, and Google's Gemma models are examples. SLMs are weaker on broad tasks than frontier models, but they are the right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability. You will encounter SLMs in Module 8 (distillation and fine-tuning) when you compress a capable model's behavior into a smaller one.
A note on vLLM: You may see "vLLM" mentioned in AI engineering discussions. It's not a model, but an inference serving engine that lets you host open-weight models behind an OpenAI-compatible HTTP endpoint. It belongs to the infrastructure layer, not the model layer. You don't need it now. It becomes relevant if you move from calling hosted APIs to self-hosting models, which is an advanced topic covered in the Hardware and Model Size Guide.
For this curriculum, you'll primarily use workhorse models through hosted APIs. The concepts you learn (tokens, context windows, prompt contracts, tool calling, retrieval) apply identically to reasoning models, SLMs, and self-hosted models. The only things that change are cost, latency, capability boundaries, and where the model runs.
If you want the concrete version of this question, not just the theoretical abstract, read the Model-Provider Matrix. That page tracks which exact model/provider combinations actually held the lesson contracts when I smoke-tested the guide.
Walkthrough
Before you start, read Choosing a Provider. In this path, OpenAI Platform, Gemini API, Anthropic's developer platform, Hugging Face, GitHub Models, Ollama Local, and Ollama Cloud are all valid starting paths. You can keep more than one configured. The critical distinction: consumer chat apps (chatgpt.com, gemini.google.com, claude.ai, huggingface.co/chat) are not the same thing as the developer platforms used in these lessons. If you are starting with OpenAI, go to https://platform.openai.com/. If you are starting with Gemini, go to https://aistudio.google.com/ for keys and https://ai.google.dev/ for the API docs. If you are starting with Anthropic, go to https://platform.claude.com/. If you are starting with Hugging Face, go to https://huggingface.co/settings/tokens for HF_TOKEN and https://huggingface.co/docs/inference-providers/index for the API path. If you are starting with GitHub Models, use the GitHub Models inference API docs and a GitHub token. If you are starting with local Ollama, install it from https://ollama.com/download. If you are starting with Ollama Cloud, go to https://ollama.com/api.
Setup: your first model call
Before you can experiment with tokens, temperature, and prompts, you need to be able to make a model call. This is a minimal quickstart, just enough to run the exercises in this lesson. The full API lesson comes next.
The lesson concept is the same across all supported paths; only the SDK surface, auth variable, and response shape differ. Pick your provider tab and use that version all the way through this lesson.
mkdir llm-experiments && cd llm-experiments
python -m venv .venv && source .venv/bin/activate
pip install openai
export OPENAI_API_KEY="sk-..."# first_call.py
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is a token in the context of LLMs?"}
],
temperature=0
)
print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")
print(f" Input: {response.usage.prompt_tokens}")
print(f" Output: {response.usage.completion_tokens}")python first_call.pyIf you see a response and token counts, you are ready. If you get an authentication error, check your API key. If you get a rate limit error, wait a moment and retry.
You'll use this setup throughout the exercises below. The Build with APIs lesson covers structured outputs, multi-turn conversations, tool calling, and error handling in depth.
Tokens are not words
A tokenizer is the tool that splits text into the tokens a specific model actually uses. Different models use different tokenizers, so the same sentence may produce different token counts depending on the model. Try these:
- OpenAI Tokenizer — a web-based tool where you paste text and see exactly how it splits into tokens, color-coded. The fastest way to build intuition. Supports GPT-4o and GPT-3.5 tokenization.
- tiktoken — OpenAI's Python library for counting tokens programmatically. Install with
pip install tiktoken. Use this when you need token counts in your code (e.g., budgeting context before sending a request). - Hugging Face Tokenizers docs — a good reference for how modern tokenizers work across open models, especially if you are learning with Hugging Face or self-hosted models.
- Anthropic token counts — Anthropic does not publish a standalone tokenizer, but every API response includes
usage.input_tokensandusage.output_tokensin the response body. You can also use the/v1/messages/count_tokensendpoint to count tokens before sending a request. - Ollama usage counters — Ollama responses report token-like usage counters such as
prompt_eval_countandeval_count. If you are using Ollama Cloud or local Ollama, those fields are your first place to build token intuition.
Paste a few sentences into the OpenAI Tokenizer and observe: tokens do not align with word boundaries. "Unbelievable" might be 2-3 tokens. A code snippet with unusual variable names might tokenize into many small pieces. This matters because:
- You pay per token, not per word
- Your context window is measured in tokens
- Long variable names, formatting, and whitespace consume tokens
Build intuition for token counts early. You'll need it when budgeting context in retrieval and prompt design.
Weights vs context: the most important distinction
The model's weights are fixed. When you send a prompt, you are not "teaching" the model anything; you are selecting which capabilities to activate by providing context. This distinction matters for every decision you'll make:
- If the model does not know something, adding it to the prompt (context) can help. Training the model (weights) is a different, much heavier intervention.
- If the model "forgets" instructions, it is because the context is too long, not because it lost knowledge. The weights have not changed.
- If the model generates something that sounds confident but isn't supported by evidence, that's a hallucination. We'll cover this in more detail below.
Why outputs vary
Even at temperature 0, models are not perfectly deterministic across API calls (batching, infrastructure changes, and floating-point precision can cause minor variation). At higher temperatures, variation is by design. This means:
- You cannot rely on exact string matching for evaluation
- You need structured output schemas (covered in Build with APIs) to get predictable shapes
- Evaluation must account for acceptable variation
Hallucination: what it is and what it isn't
You'll hear "hallucination" constantly in AI discussions, often used loosely to mean "the model said something wrong." It's worth being more precise, because the cause determines the fix.
Hallucination means the model generates content that sounds confident and plausible but isn't supported by the evidence it was given, or fabricates details that don't exist. A model that invents a function name that isn't in the codebase, cites a paper that was never written, or states a configuration value that doesn't match the actual config is hallucinating.
What hallucination is not:
- It's not "any wrong answer." A model that misinterprets ambiguous instructions gave a bad answer, but it didn't hallucinate; it followed the prompt poorly. That's a prompt issue.
- It's not "the model is lying." The model has no concept of truth. It generates the most probable continuation of its input. When the input doesn't contain enough grounding evidence, the model fills gaps from its weights, and its weights don't always reflect reality.
Common causes:
- Weak or vague prompt: the model wasn't told to stick to provided evidence, so it generated freely
- Missing or insufficient context: the answer required information that wasn't in the context window
- Stale or conflicting context: the model had evidence, but it was outdated or contradictory (this is context rot showing up as hallucination)
- Model limitation: the task requires knowledge or reasoning the model genuinely can't do
- Retrieval failure: the system retrieved the wrong evidence, and the model faithfully grounded its answer in that wrong evidence. This is especially tricky because the answer looks grounded but is still wrong.
The last cause is important: retrieval doesn't "solve" hallucination. Bad retrieval can produce confident, well-cited, wrong answers. We'll revisit this in Retrieval Fundamentals, and the full engineering treatment (grounding, citation, abstention, faithfulness evaluation) will be discussed in Modules 5 and 6.
For now, the key takeaway is: when the model says something wrong, don't just say "it hallucinated." Use the diagnostic below to figure out why it said something wrong, because the fix depends on the cause.
The four-way diagnostic
When the model gives a bad answer, the cause is one of:
- Prompt issue: the instructions are ambiguous, incomplete, or conflicting
- Context issue: the model has the wrong evidence, too much evidence, or stale evidence
- Model limitation: the task exceeds the model's capabilities (e.g., complex math, very long reasoning chains)
- Evaluation issue: the answer is actually fine but your grading criteria are wrong
Learning to quickly identify which category you are in will save you from the most common debugging trap: changing the prompt when the problem is actually the context (or vice versa).
Exercises
-
Send the same prompt at three different temperatures (0, 0.5, 1.0) and compare outputs. Note what changes and what stays the same.
-
Count approximate token growth as you add long chat history to a request. At what point does the context window start to matter?
-
Create three versions of the same task with:
- No system prompt
- A weak, vague system prompt
- A strict, specific system prompt
Compare outputs and write a short note explaining what changed, what stayed stochastic, and what moved from "prompt issue" to "context issue."
-
Find one case where the model gives a wrong answer. Diagnose it using the four-way diagnostic: is it a prompt issue, context issue, model limitation, or evaluation issue? Write one sentence explaining your diagnosis.
Completion checkpoint
You can explain:
- Why longer context is not the same as better context
- The difference between weights (training) and context (inference)
- Why model outputs vary and what controls the variation
- How to tell the difference between a prompt problem and a context problem
Connecting to the project
The experiments you ran here are standalone scripts, and that's intentional. This lesson is focused on building mental models, not building a service. But everything you learned applies directly to the FastAPI project you started in lesson 1 and will extend in lesson 4:
- When your API endpoint calls a model, you'll construct messages with system, user, and tool roles
- When your service slows down, you'll check token counts to find the bottleneck
- When the model returns wrong answers, you'll use the four-way diagnostic to isolate the cause
- When we add retrieval in Module 4, context budgeting will become a daily concern
Keep the llm-experiments/ scripts around, as they'll be useful for quick tests throughout the curriculum.
What's next
Prompt Engineering. Now that the model feels mechanical instead of mystical, the next lesson shows how to shape its inputs with contracts, examples, and decomposition.
References
Start here
- OpenAI: How tokens work — practical introduction to tokenization with examples
Build with this
- OpenAI API docs — message structure, parameters, and response format
- Gemini API quickstart — direct Gemini API setup, auth, and first request
- Anthropic API docs — same concepts, different provider; compare the message formats
- Hugging Face chat completion docs — hosted chat-completion API on Hugging Face, including OpenAI-compatible usage
- Ollama API introduction — local vs cloud base URLs and the core API surface
Deep dive
- Gemini token counting guide — how Gemini counts tokens before and after a request
- Anthropic: Prompt engineering guide — goes deeper on how prompt structure affects model behavior
- Hugging Face Tokenizers docs — tokenizer internals and APIs across open-model ecosystems
- OpenAI: Reasoning model best practices — how reasoning models differ from workhorse models and when to use them
- Microsoft Learn: Local small language models — when SLMs are the right choice (privacy, offline, predictable cost, low latency)
- vLLM: OpenAI-compatible server docs — how self-hosted inference works behind the same API surface (infrastructure reference, not needed now)
- Why Language Models Hallucinate (Kalai et al., 2025) — explains why hallucinations are a statistical consequence of training and evaluation, not a bug to be patched. Useful for understanding the deeper "why" behind the hallucination section in this lesson