Glossary

This glossary defines the technical terms used throughout the learning path. Terms are grouped by when you'll first encounter them, so you can read ahead for the module you're working on without being overwhelmed by terms from later modules.

Use this as a lookup reference, not something you need to read cover-to-cover.

Foundational terms

These terms appear in Module 1 (Python, FastAPI, LLM mental models, APIs, prompt engineering, retrieval fundamentals, and security basics).

API (Application Programming Interface) — A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.

AST (Abstract Syntax Tree) — A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.

BM25 (Best Match 25) — A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.

Chunking — Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.

Context engineering — The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.

Context rot — Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.

Context window — The maximum number of tokens an LLM can process in a single request (input + output combined).

Embedding — A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.

Endpoint — A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).

GGUF — A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.

Hallucination — When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.

Inference — Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."

JSON (JavaScript Object Notation) — A lightweight text format for structured data. The lingua franca of API communication.

Lexical search — Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.

LLM (Large Language Model) — A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.

Metadata — Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.

Neural network — A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.

Reasoning model — A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.

Reranking — A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.

Schema — A formal description of the shape and types of a data structure. Used to validate inputs and outputs.

SLM (small language model) — A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.

System prompt — A special message that sets the model's behavior, role, and constraints for a conversation.

Temperature — A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.

Token — The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.

Top-k — The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.

Top-p (nucleus sampling) — An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.

Vector search — Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.

vLLM (virtual LLM) — An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.

Weights — The learned parameters inside a model. Changed during training, fixed during inference.

Workhorse model — A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.

Benchmark and Harness terms

These terms appear in Module 2 (defining "better," benchmark design, run logs).

Baseline — The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.

Benchmark — A fixed set of questions or tasks with known-good answers, used to measure system performance over time.

Run log — A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.

Agent and Tool Building terms

These terms appear in Module 3 (raw tool loops, framework agents, MCP servers).

A2A (Agent-to-Agent protocol) — An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).

Agent — A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.

Control loop — The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.

Handoff — Passing control from one agent or specialist to another within an orchestrated system.

MCP (Model Context Protocol) — An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.

Tool calling / function calling — The model's ability to request execution of a specific function with structured arguments, rather than just generating text.

Code Retrieval terms

These terms appear in Module 4 (retrieval tiers from naive to graph/hybrid to compiled context).

Context compilation / context packing — The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."

Grounding — Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.

Hybrid retrieval — Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.

Knowledge graph — A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.

RAG (Retrieval-Augmented Generation) — A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.

Symbol table — A mapping of code identifiers (functions, classes, variables) to their locations and metadata.

Tree-sitter — An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.

RAG and Grounded Answers terms

These terms appear in Module 5 (RAG pipeline, evidence bundles, retrieval routing).

Context pack — A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.

Evidence bundle — A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.

Retrieval routing — Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.

Observability and Evals terms

These terms appear in Module 6 (telemetry, cost tracking, harness, retrieval evals, tool/trace evals).

Eval — A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.

Harness (AI harness / eval harness) — The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.

LLM-as-judge — Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.

OpenTelemetry (OTel) — An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.

RAGAS — A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.

Span — A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.

Telemetry — Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.

Trace — A structured record of one complete run through the system, including all steps, tool calls, and decisions.

Orchestration and Memory terms

These terms appear in Module 7 (subagents, specialists, A2A interop, thread memory, long-term memory).

Long-term memory — Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.

Orchestration — Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.

Router — A component that decides which specialist or workflow path to use for a given query.

Specialist — An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.

Thread memory — Conversation state that persists within a single session or thread.

Workflow memory — Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.

Optimization terms

These terms appear in Module 8 (optimization taxonomy, distillation, fine-tuning).

Catastrophic forgetting — When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.

Distillation — Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.

DPO (Direct Preference Optimization) — A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.

Fine-tuning — Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.

Full fine-tuning — Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.

Inference server — Software (like vLLM or Ollama) that hosts a model and serves inference requests.

Instruction tuning — A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.

LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.

Parameter count — The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.

PEFT (Parameter-Efficient Fine-Tuning) — A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.

Preference optimization — Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."

QLoRA (Quantized LoRA) — LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.

Quantization — Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.

Overfitting — When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.

RLHF (Reinforcement Learning from Human Feedback) — A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.

SFT (Supervised Fine-Tuning) — Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.

TRL (Transformer Reinforcement Learning) — A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.

Cross-cutting terms

These terms appear across multiple modules and don't belong to a single phase.

Consumer chat app — The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.

Developer platform — The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.

Hosted API — The provider runs the model for you and you call it over HTTP.

Local inference — You run the model on your own machine.

Provider — The company or service that hosts a model API you call from code.

Prompt caching — Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.

Rate limiting — Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.

Token budget — The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.