Module 7: Orchestration and Memory Orchestration

Orchestration and Subagents

You now have a single-agent system with four eval families, a harness that runs in CI, and trace labels that tell you exactly where behavior breaks down. That measurement foundation changes what's possible. Before evals, adding a second agent was guesswork because you couldn't tell whether the split helped or just added complexity. Now you can.

This lesson covers why and when to add multi-agent coordination. The next lesson will cover how to design the specialists themselves. We're separating these because the decision to split is more important than the implementation. A well-measured single agent beats an unmeasured multi-agent system every time.

What you'll learn

  • Identify when a single-agent system has outgrown its architecture using eval evidence
  • Apply decision criteria for splitting work across multiple agents
  • Build a minimal orchestrator that routes tasks and synthesizes results using LangGraph
  • Trace multi-agent execution paths using the observability tools from Module 6
  • Compare orchestrated results against the single-agent baseline using your existing benchmark

Concepts

Orchestration — explicit control over routing, delegation, state, retries, and synthesis in a multi-agent system. Orchestration is the portable concept underneath every multi-agent framework. Whether you use LangGraph, the OpenAI Agents SDK, or Claude Agent SDK, you're making the same decisions: which agent handles this task, what state do they share, what happens when one fails, and how do their outputs combine. The framework is the implementation; orchestration is the design.

Subagent — an agent that handles a narrower task under the direction of an orchestrator. A subagent has its own system prompt, its own tool access, and (often) its own model configuration. The orchestrator decides when to invoke it, what context to pass, and how to use its output. Think of subagents as specialists called on demand, rather than autonomous peers.

Router — the decision logic that examines an incoming request and determines which agent or path should handle it. We built a simpler version of routing in the retrieval module (code vs. docs vs. hybrid). Multi-agent routing extends the same idea: instead of choosing a retrieval mode, the router chooses which specialist to activate.

Synthesis — the step where the orchestrator combines outputs from one or more subagents into a final response. Synthesis isn't just concatenation. The orchestrator may need to resolve conflicts, fill gaps, or decide that one specialist's answer supersedes another's.

Control flow graph — a directed graph where nodes represent processing steps (agent calls, tool invocations, transformations) and edges represent transitions. LangGraph makes this literal: you define nodes and edges, and the framework handles execution, state passing, and checkpointing. The graph makes control flow visible, which matters when you're debugging multi-agent traces.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
One agent does everything poorlyRouting, synthesis, retrieval, and debugging all blur togetherBetter prompting and tool descriptionsAdd one orchestrator and a small number of specialists
Specialist prompts conflictThe main prompt grows huge and contradictorySubtasks via plain functionsSubagents with isolated prompts
Long workflows need stateMulti-step tasks become brittle or lose contextManual state dictionaryGraph-based stateful orchestration
Can't tell if split helpsYou added agents but quality didn't visibly improveRun the benchmarkCompare single-agent vs. multi-agent on the same eval suite

Walkthrough

When to split: the eval-driven decision

The instinct to split a system into multiple agents usually comes from feeling that one agent is doing too much. That feeling is valid, but it's not sufficient. You need eval evidence.

Look at your trace label distribution from the previous lesson. Three signals suggest a split will help:

  1. Route-dependent quality gaps. If code questions score 85% but docs questions score 45%, the single agent's prompt may be trying to serve two different tasks with incompatible instructions. A specialist for each domain can carry a more focused prompt.

  2. Correct-but-wasteful traces. If 20%+ of traces are correct_but_wasteful, the system is doing unnecessary work. An orchestrator that routes to the right specialist can skip irrelevant retrieval and tool calls.

  3. Conflicting tool-use patterns. If the tool grader shows the system calling search_code for documentation questions (or vice versa), routing logic that activates the right tools for each task type will help more than prompt tweaks.

If none of these signals are present (if your single agent scores well across question types and the waste rate is low), you don't need orchestration yet. The complexity cost of multi-agent coordination is real, and it only pays off when the single-agent system has measurable limitations.

What orchestration actually controls

Regardless of the framework, orchestration makes five things explicit:

  • Routing: which agent or path handles this request
  • Delegation: what context and instructions the subagent receives
  • State: what information persists across steps
  • Retries: what happens when a subagent fails or returns low-quality output
  • Synthesis: how outputs from multiple subagents combine into a final response

These five decisions exist whether you implement them in LangGraph, the OpenAI Agents SDK, plain Python functions, or any other framework. The framework gives you structure; the decisions are yours.

Default: LangGraph

Why this is the default: You've already worked with LangGraph in the framework-agent lesson from Module 3. Now we'll use its graph structure to make multi-agent control flow explicit and inspectable. The graph representation maps naturally to the trace visualization you're already using, which means debugging multi-agent behavior uses the same tools.

Portable concept underneath: Orchestration is explicit control over routing, delegation, state, retries, and synthesis. Any framework that lets you define these five things clearly will work.

Closest alternatives and when to switch:

  • Claude Agent SDK: use when you're building primarily with Claude models and want first-party support for tool use, multi-turn conversations, and agent orchestration within Anthropic's ecosystem. See platform.claude.com/docs for the SDK documentation.
  • OpenAI Agents SDK: use when you want lightweight handoffs, sessions, and built-in tracing with a simpler abstraction than a full graph runtime. Works well for systems where agents pass control linearly rather than in complex branching patterns.
  • No multi-agent split: stay with the single agent if evals don't show real gains from specialization. This is always a valid choice.

Building the orchestrator

We'll start with the simplest useful orchestrator: a router node that classifies the incoming question, specialist nodes that handle each question type, and a synthesis node that assembles the final response.

project/
├── orchestration/
│   ├── __init__.py
│   ├── graph.py            # The orchestration graph
│   ├── router.py           # Routing logic
│   ├── specialists.py      # Specialist agent definitions
│   └── synthesis.py        # Output assembly
├── harness/                # Existing eval harness
├── retrieval/              # Existing retrieval layer
└── observability/          # Existing tracing

First, define the state that flows through the graph:

# orchestration/graph.py
"""Multi-agent orchestration graph.

Routes incoming questions to specialist agents and synthesizes
their outputs into a final response.
"""
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Literal

from langgraph.graph import StateGraph, END


@dataclass
class OrchestratorState:
    """State passed between nodes in the orchestration graph."""
    question: str = ""
    route: str = ""                      # Which specialist handles this
    specialist_output: str = ""          # Raw specialist response
    evidence: list[str] = field(default_factory=list)
    tools_called: list[str] = field(default_factory=list)
    final_answer: str = ""
    confidence: float = 0.0
    retry_count: int = 0

Next, the router. This extends the routing logic you built in the retrieval module, but now it chooses a specialist rather than a retrieval mode:

# orchestration/router.py
"""Route incoming questions to the appropriate specialist.

Uses an LLM classifier to determine which specialist should
handle the question. Falls back to the general path for
ambiguous questions.
"""
from openai import OpenAI

client = OpenAI()

ROUTE_PROMPT = """Classify this question about a code repository.
Choose exactly one route:

- "code": questions about specific code, functions, classes, or implementation details
- "docs": questions about documentation, architecture, or high-level design
- "debug": questions about errors, failures, test output, or debugging
- "general": questions that don't clearly fit the above categories

Question: {question}

Route:"""


def route_question(state: dict) -> dict:
    """Classify the question and set the route."""
    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "user", "content": ROUTE_PROMPT.format(question=state["question"])}
        ],
        max_tokens=10,
    )
    raw_route = response.choices[0].message.content or ""
    route = raw_route.splitlines()[0]
    # Some models answer with `Route: debug` instead of just `debug`.
    if ":" in route:
        route = route.split(":", 1)[1]
    route = route.strip().lower().strip('"')

    valid_routes = {"code", "docs", "debug", "general"}
    if route not in valid_routes:
        route = "general"

    return {**state, "route": route}

Now wire the graph:

# orchestration/graph.py (continued)

from orchestration.router import route_question
from orchestration.specialists import code_specialist, docs_specialist, debug_specialist, general_specialist
from orchestration.synthesis import synthesize


def route_to_specialist(state: dict) -> str:
    """Edge function: route to the appropriate specialist node."""
    return state["route"]


def build_orchestration_graph() -> StateGraph:
    """Build and compile the orchestration graph."""
    graph = StateGraph(OrchestratorState)

    # Nodes
    graph.add_node("router", route_question)
    graph.add_node("code", code_specialist)
    graph.add_node("docs", docs_specialist)
    graph.add_node("debug", debug_specialist)
    graph.add_node("general", general_specialist)
    graph.add_node("synthesize", synthesize)

    # Edges
    graph.set_entry_point("router")
    graph.add_conditional_edges(
        "router",
        route_to_specialist,
        {"code": "code", "docs": "docs", "debug": "debug", "general": "general"},
    )

    # All specialists flow to synthesis
    for specialist in ["code", "docs", "debug", "general"]:
        graph.add_edge(specialist, "synthesize")

    graph.add_edge("synthesize", END)

    return graph.compile()

The specialist implementations go in the next lesson. For now, the key insight is the shape: a routing decision at the front, specialist processing in the middle, and synthesis at the end. This pattern is the same whether you have two specialists or ten.

Tracing the orchestrated system

Multi-agent systems generate more complex traces. Each specialist call is a sub-trace within the larger orchestration trace. We'll extend the tracing from Module 6 to capture this structure:

# observability/traced_orchestration.py
"""Traced orchestration pipeline.

Wraps the orchestration graph with Langfuse tracing so each
routing decision and specialist call is visible in the trace.
"""
from langfuse import Langfuse
from orchestration.graph import build_orchestration_graph

langfuse = Langfuse()

graph = build_orchestration_graph()


def traced_orchestrated_pipeline(
    question: str,
    run_id: str = "",
) -> dict:
    """Run the orchestration graph with full tracing."""
    trace = langfuse.trace(
        name="orchestrated-pipeline",
        metadata={"run_id": run_id},
        input={"question": question},
    )

    # Route
    route_span = trace.span(name="routing")
    state = {"question": question}
    # The graph handles routing internally, but we trace the decision
    result = graph.invoke(state)
    route_span.end(output={"route": result.get("route", "unknown")})

    # Log specialist call
    specialist_span = trace.span(
        name=f"specialist-{result.get('route', 'unknown')}",
        input={"question": question, "route": result.get("route")},
    )
    specialist_span.end(output={"specialist_output": result.get("specialist_output", "")[:200]})

    trace.update(
        output={"answer": result.get("final_answer", ""), "route": result.get("route", "")},
    )

    langfuse.flush()
    return result

When you inspect traces in Langfuse, you'll see the routing decision as a span, the specialist call as a child span, and the synthesis as the final span. This structure makes it straightforward to debug routing errors. You can see exactly which specialist was chosen and what it produced.

Measuring the split

This is the most important step, and the one most often skipped. Run your existing benchmark against the orchestrated system and compare:

# Run the benchmark through the orchestrated pipeline
python harness/run_harness.py --pipeline orchestrated

# Grade with the same graders
python harness/graders/answer_grader.py harness/runs/latest.jsonl
python harness/graders/tool_grader.py harness/runs/latest.jsonl
python harness/graders/trace_labeler.py harness/runs/latest-graded.jsonl

# Compare against the single-agent baseline
python harness/compare_runs.py \
    harness/runs/single-agent-baseline.jsonl \
    harness/runs/latest-graded-traced.jsonl

What to look for in the comparison:

  • Overall accuracy: Did it go up, stay flat, or drop? If it dropped, the split is hurting.
  • Per-route accuracy: Did code questions improve while docs questions stayed the same? That suggests the code specialist is working but the docs specialist needs tuning.
  • Waste rate: Did the correct_but_wasteful percentage drop? If routing works well, specialists should avoid the unnecessary work that inflated the single-agent's traces.
  • Cost and latency: Orchestration adds overhead (the routing call, inter-agent communication). Is the quality improvement worth the cost?

If the orchestrated system doesn't beat the single-agent baseline on at least one eval dimension without degrading the others, reconsider the split. It's always valid to go back to a single agent.

Exercises

  1. Review your trace label distribution from the previous module. Identify which question types have the most wrong_route or correct_but_wasteful labels. Write down two or three categories that would benefit from specialist handling.

  2. Build the routing classifier using the pattern above. Test it on 20 questions from your benchmark and check whether it agrees with your expected routes. What's the routing accuracy?

  3. Wire the orchestration graph with placeholder specialists (each specialist can just return the question with a tag for now). Run it end-to-end and verify that traces show the routing decision and specialist selection correctly.

  4. Run 10 benchmark questions through both the single-agent pipeline and the orchestrated pipeline. Compare the trace structures side-by-side in Langfuse. Where does orchestration add visible overhead?

  5. Write a brief evaluation memo: based on the routing accuracy and trace comparison, does orchestration look like it will improve the system? What specialist would you build first, and why?

Completion checkpoint

You have:

  • A routing classifier that assigns incoming questions to specialist categories with >80% agreement against your manual labels
  • An orchestration graph with routing, specialist nodes (even if they're placeholders), and synthesis
  • Traced execution that shows routing decisions and specialist calls as separate spans
  • A side-by-side comparison of single-agent vs. orchestrated traces for at least 10 questions
  • A written evaluation of whether orchestration is justified for your system

Reflection prompts

  • Looking at your trace labels, which question types have the clearest case for specialist handling? Which types are ambiguous?
  • What's the minimum number of specialists that would improve your system? What's the risk of adding more than that?
  • If the orchestrated system scores the same as the single agent, what would you try before adding more specialists?

What's next

Specialists and Routers. The orchestration skeleton is in place; the next lesson fills in the specialists and router contracts that make it real.

References

Start here

Build with this

  • LangGraph documentation — the graph runtime we use for orchestration, with examples of multi-agent patterns
  • OpenAI Agents SDK — alternative orchestration framework with built-in handoffs and tracing

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.