Module 8: Optimization Distillation

Distillation: Compressing Stable Behaviors

The optimization ladder says distillation is Rung 4. You'll reach for it when prompt engineering, retrieval improvement, and context engineering have been tried and measured, and a bounded task is working well but costing too much or running too slowly. Distillation is how you compress that stable behavior into a smaller, cheaper model.

The key constraint is "stable." Distillation transfers behavior from a teacher to a student. If the teacher's behavior is still changing (you're still tuning prompts, adjusting retrieval, or refining the task definition), distillation will bake in whatever the teacher does today, including the parts you haven't finished fixing. This lesson covers the full workflow: identifying what to distill, collecting teacher outputs, training a student, and verifying that the student preserved what matters.

What you'll learn

  • What distillation actually is: supervised training on teacher-generated outputs, not magic compression
  • How to identify bounded tasks that are good distillation candidates
  • Collect and filter training data from teacher model runs
  • Train a student model using parameter-efficient fine-tuning (PEFT) with TRL
  • Evaluate whether the student matches the teacher on your benchmark
  • Understand the hardware requirements and where to run training

Concepts

Distillation — training a smaller model (the student) to reproduce the outputs of a larger model (the teacher) on a specific task. The student doesn't learn the teacher's general capabilities — it learns to mimic the teacher's behavior on the examples you provide. This is why task boundaries matter: a student trained on retrieval-routing examples won't suddenly become good at code generation.

Teacher model — the larger, more capable model whose behavior you want to compress. In our system, this is the workhorse LLM you've been using for the anchor project (GPT-4o, Claude Sonnet, or similar). The teacher's job during distillation is to produce high-quality outputs over a bounded task set. These outputs become the training data for the student.

Student model — the smaller, cheaper model you're training. Typically an open-weight model in the 1B-14B parameter range: Qwen 2.5, Llama 3, Phi-3, Gemma 2, or similar. The student will be faster and cheaper to run than the teacher, but only on the specific task you trained it for. On everything else, it'll perform like its base model.

Bounded task — a task with clear inputs, clear outputs, and limited scope. Good distillation candidates from our system:

  • Retrieval routing: given a question, classify which retrieval mode to use (code, docs, hybrid, skip)
  • Query rewriting: given a user question and conversation context, produce a standalone search query
  • Evidence summarization: given an evidence bundle and a question, produce a structured answer in a fixed schema
  • Bug summary formatting: given a code diff and error output, produce a structured bug report

Bad candidates: open-ended code generation, multi-step reasoning across many files, tasks where the output format is still changing.

SFT (Supervised Fine-Tuning) — training a model on input-output pairs where the inputs are task prompts and the outputs are teacher-generated completions. This is the simplest training paradigm and the right starting point for distillation. You're not teaching the model new knowledge — you're teaching it to produce outputs that look like the teacher's outputs for this task class.

PEFT (Parameter-Efficient Fine-Tuning) — training techniques that update only a small fraction of the model's parameters instead of all of them. This dramatically reduces the memory and compute required for training. The most common PEFT method is LoRA.

LoRA (Low-Rank Adaptation) — a PEFT technique that adds small trainable matrices to specific layers of the model while keeping the original weights frozen. Instead of updating billions of parameters, you update millions — typically 0.1-1% of the total. The trained LoRA weights are saved as a small adapter file that can be loaded on top of the base model.

QLoRA (Quantized LoRA) — LoRA applied to a model that's been quantized to 4-bit precision. This reduces VRAM requirements substantially: a 7B model that normally needs 14GB of VRAM for full fine-tuning can be trained with QLoRA in roughly 10GB. The tradeoff is slightly lower precision, but for most distillation tasks the quality loss is minimal. See the Hardware Guide for specific VRAM requirements.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Stable task is too expensivePer-request cost is high on a task that rarely failsModel routing to a cheaper hosted modelDistillation to a local student model
Stable task is too slowLatency is high on a task where speed mattersPrompt compression, shorter outputsDistillation to a smaller, faster model
Teacher changes invalidate studentRetrained student drifts from current teacher behaviorLess frequent retrainingOnly distill tasks that are genuinely stable
Student quality drops on edge casesStudent handles common cases but fails on rare onesMore diverse training examplesLarger student model or keep teacher for edge cases
Can't run training locallyNo GPU or insufficient VRAMCloud GPU (see Hardware Guide)Hosted fine-tuning API as alternative

Walkthrough

Step 1: Choose a bounded task

Start with retrieval routing because it's the most bounded task in our system. The input is a user question plus optional conversation context. The output is one of four labels: code, docs, hybrid, or skip. The teacher model already does this well, and we have eval coverage from Module 6.

Why retrieval routing first? It has the tightest input-output contract. The output is a single classification label, not free-form text. This means evaluation is straightforward (exact match), and training data quality is easy to verify. Once you've done this once, the same workflow applies to more complex tasks like evidence summarization.

Step 2: Collect teacher outputs

Run the teacher model over your benchmark questions and any additional examples you have. The goal is to collect input-output pairs where the teacher's output is correct:

# optimization/collect_teacher_data.py
"""Collect teacher model outputs for distillation training data.

Runs the teacher model over a set of examples and filters for
correct outputs to use as training data.
"""

import json
from pathlib import Path


def load_examples(path: str) -> list[dict]:
    """Load examples from a JSONL file.

    Each line should have at minimum: question, and optionally
    conversation_context and expected_label for filtering.
    """
    examples = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))
    return examples


def format_routing_prompt(question: str, context: str = "") -> str:
    """Format a retrieval-routing classification prompt.

    This is the same prompt your pipeline uses for routing.
    """
    prompt = (
        "Classify this question into one of four retrieval modes.\n\n"
        "Modes:\n"
        "- code: question is about specific code, files, or symbols\n"
        "- docs: question is about documentation, README, or guides\n"
        "- hybrid: question spans both code and documentation\n"
        "- skip: question is general knowledge, no retrieval needed\n\n"
    )
    if context:
        prompt += f"Conversation context:\n{context}\n\n"
    prompt += f"Question: {question}\n\nMode:"
    return prompt


def collect_teacher_outputs(
    examples: list[dict],
    generate_fn,
) -> list[dict]:
    """Run the teacher model and collect outputs.

    Args:
        examples: List of dicts with 'question' and optional
            'conversation_context' and 'expected_label'.
        generate_fn: Function that takes a prompt string and
            returns the model's text output.

    Returns:
        List of training examples with teacher outputs.
    """
    results = []

    for ex in examples:
        prompt = format_routing_prompt(
            ex["question"],
            ex.get("conversation_context", ""),
        )
        teacher_output = generate_fn(prompt).strip().lower()

        result = {
            "prompt": prompt,
            "completion": teacher_output,
            "question": ex["question"],
        }

        # If we have expected labels, mark whether teacher was correct
        if "expected_label" in ex:
            result["expected"] = ex["expected_label"]
            result["teacher_correct"] = (
                teacher_output == ex["expected_label"]
            )

        results.append(result)

    return results


def filter_correct(results: list[dict]) -> list[dict]:
    """Keep only examples where the teacher produced the correct output."""
    return [r for r in results if r.get("teacher_correct", True)]


def save_training_data(results: list[dict], output_path: str) -> None:
    """Save filtered results as training data in JSONL format."""
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w") as f:
        for r in results:
            training_example = {
                "prompt": r["prompt"],
                "completion": r["completion"],
            }
            f.write(json.dumps(training_example) + "\n")

    print(f"Saved {len(results)} training examples to {output_path}")

Run data collection with your teacher model:

# Example usage
from openai import OpenAI

client = OpenAI()


def teacher_generate(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10,
        temperature=0,
    )
    return response.choices[0].message.content


examples = load_examples("benchmark/routing_examples.jsonl")
results = collect_teacher_outputs(examples, teacher_generate)
filtered = filter_correct(results)
save_training_data(filtered, "optimization/training_data/routing.jsonl")

How many examples? For a classification task like retrieval routing, 200-500 high-quality examples is a reasonable starting point. For more complex generation tasks like evidence summarization, you'll want 500-1,000+. Quality matters more than quantity — 300 correct examples beat 1,000 noisy ones.

Step 3: Train the student

We'll use PEFT with TRL (Transformer Reinforcement Learning library) and Unsloth for efficient training. Unsloth provides optimized training kernels that reduce memory usage and speed up training significantly.

First, install the training dependencies:

pip install peft trl unsloth transformers datasets
Hardware note

QLoRA training on a 3B model requires roughly 6-8GB of VRAM. A 7B model needs roughly 10-12GB. If you don't have a local GPU, see the Hardware Guide for cloud GPU options. You can also use a hosted fine-tuning API — the workflow concepts are the same, but you'll upload data instead of running training locally.

# optimization/train_student.py
"""Train a student model via QLoRA distillation.

Uses Unsloth + TRL for efficient parameter-efficient fine-tuning
on teacher-generated training data.
"""

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset


def format_for_training(example: dict) -> dict:
    """Format a prompt-completion pair for SFT training."""
    text = (
        f"### Instruction:\n{example['prompt']}\n\n"
        f"### Response:\n{example['completion']}"
    )
    return {"text": text}


def train(
    model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
    training_data_path: str = "optimization/training_data/routing.jsonl",
    output_dir: str = "optimization/models/routing-student",
    max_steps: int = 200,
    learning_rate: float = 2e-4,
    lora_rank: int = 16,
):
    """Run QLoRA training on teacher-generated data.

    Args:
        model_name: Hugging Face model ID. Unsloth provides
            pre-quantized 4-bit versions for efficient training.
        training_data_path: Path to JSONL training data.
        output_dir: Where to save the trained LoRA adapter.
        max_steps: Training steps. Start small and increase if
            validation loss is still decreasing.
        learning_rate: Learning rate. 2e-4 is a good default for
            QLoRA; lower if training is unstable.
        lora_rank: LoRA rank. Higher rank = more capacity but more
            memory. 16 is a good starting point.
    """
    # Load the base model with 4-bit quantization
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        load_in_4bit=True,
    )

    # Add LoRA adapters to the model
    model = FastLanguageModel.get_peft_model(
        model,
        r=lora_rank,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=lora_rank,  # Common default: alpha == rank
        lora_dropout=0,
        use_gradient_checkpointing="unsloth",
    )

    # Load and format training data
    dataset = load_dataset("json", data_files=training_data_path, split="train")
    dataset = dataset.map(format_for_training)

    # Configure training
    training_config = SFTConfig(
        output_dir=output_dir,
        max_steps=max_steps,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        logging_steps=10,
        save_steps=50,
        warmup_steps=20,
        fp16=True,
        dataset_text_field="text",
        max_seq_length=2048,
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_config,
    )

    # Train
    print(f"Training on {len(dataset)} examples for {max_steps} steps...")
    trainer.train()

    # Save the LoRA adapter (not the full model)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Saved LoRA adapter to {output_dir}")


if __name__ == "__main__":
    train()

Run training:

python optimization/train_student.py

Training on 300 examples for 200 steps takes roughly 10-20 minutes on a consumer GPU (RTX 3060 or better). You'll see training loss logged every 10 steps — it should decrease and stabilize.

Step 4: Evaluate the student

The student needs to match the teacher on the eval suite, not just on the training data. Load the trained adapter and run it against your benchmark:

# optimization/eval_student.py
"""Evaluate student model against teacher on the benchmark.

Compares student accuracy, latency, and cost against the teacher
to determine whether distillation was worthwhile.
"""

import json
import time
from pathlib import Path

from unsloth import FastLanguageModel


def load_student(
    model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
    adapter_path: str = "optimization/models/routing-student",
):
    """Load the base model with the trained LoRA adapter."""
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        load_in_4bit=True,
    )
    # Load LoRA weights on top of base model
    model.load_adapter(adapter_path)
    FastLanguageModel.for_inference(model)
    return model, tokenizer


def evaluate(
    model,
    tokenizer,
    eval_path: str = "benchmark/routing_examples.jsonl",
) -> dict:
    """Run student model on eval set and compute metrics."""
    examples = []
    with open(eval_path) as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))

    correct = 0
    total = 0
    latencies = []
    errors = []

    for ex in examples:
        if "expected_label" not in ex:
            continue

        prompt = (
            f"### Instruction:\n"
            f"{format_routing_prompt(ex['question'], ex.get('conversation_context', ''))}\n\n"
            f"### Response:\n"
        )

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        start = time.perf_counter()
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            temperature=0,
        )
        elapsed = time.perf_counter() - start

        response = tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        ).strip().lower()

        total += 1
        latencies.append(elapsed)

        if response == ex["expected_label"]:
            correct += 1
        else:
            errors.append({
                "question": ex["question"],
                "expected": ex["expected_label"],
                "got": response,
            })

    return {
        "accuracy": correct / total if total > 0 else 0,
        "total": total,
        "correct": correct,
        "avg_latency_ms": sum(latencies) / len(latencies) * 1000 if latencies else 0,
        "errors": errors[:10],  # First 10 errors for inspection
    }


def format_routing_prompt(question: str, context: str = "") -> str:
    """Same formatting as teacher data collection."""
    prompt = (
        "Classify this question into one of four retrieval modes.\n\n"
        "Modes:\n"
        "- code: question is about specific code, files, or symbols\n"
        "- docs: question is about documentation, README, or guides\n"
        "- hybrid: question spans both code and documentation\n"
        "- skip: question is general knowledge, no retrieval needed\n\n"
    )
    if context:
        prompt += f"Conversation context:\n{context}\n\n"
    prompt += f"Question: {question}\n\nMode:"
    return prompt


if __name__ == "__main__":
    print("Loading student model...")
    model, tokenizer = load_student()

    print("Evaluating...")
    results = evaluate(model, tokenizer)

    print(f"\nStudent Evaluation Results:")
    print(f"  Accuracy: {results['accuracy']:.1%}")
    print(f"  Correct: {results['correct']}/{results['total']}")
    print(f"  Avg latency: {results['avg_latency_ms']:.0f}ms")

    if results["errors"]:
        print(f"\nSample errors:")
        for err in results["errors"][:5]:
            print(f"  Q: {err['question'][:60]}...")
            print(f"    Expected: {err['expected']}, Got: {err['got']}")

Run the evaluation:

python optimization/eval_student.py

Expected output:

Loading student model...
Evaluating...

Student Evaluation Results:
  Accuracy: 91.3%
  Correct: 42/46
  Avg latency: 45ms

Sample errors:
  Q: What does the authentication middleware do and where is i...
    Expected: hybrid, Got: code
  Q: How do I set up the project for the first time?...
    Expected: docs, Got: hybrid

Step 5: Compare student vs teacher

The eval results only matter in comparison. Build a side-by-side report:

# optimization/compare_models.py
"""Compare student and teacher performance side by side."""


def compare(teacher_results: dict, student_results: dict) -> None:
    """Print a comparison table."""
    print(f"\n{'Metric':<25} {'Teacher':>12} {'Student':>12} {'Delta':>12}")
    print("-" * 65)

    t_acc = teacher_results["accuracy"]
    s_acc = student_results["accuracy"]
    print(f"{'Accuracy':<25} {t_acc:>11.1%} {s_acc:>11.1%} {s_acc - t_acc:>+11.1%}")

    t_lat = teacher_results["avg_latency_ms"]
    s_lat = student_results["avg_latency_ms"]
    print(f"{'Avg latency (ms)':<25} {t_lat:>11.0f} {s_lat:>11.0f} {s_lat - t_lat:>+11.0f}")

    # Cost comparison depends on your setup
    # For hosted teacher vs local student, the student cost approaches zero
    print(f"\n{'Verdict':}")
    print("-" * 40)

    quality_drop = t_acc - s_acc
    speed_gain = t_lat / s_lat if s_lat > 0 else float("inf")

    if quality_drop <= 0.05 and speed_gain >= 2:
        print("  Student is viable: <5% quality drop with significant speed gain.")
    elif quality_drop <= 0.10:
        print("  Student is marginal: 5-10% quality drop. Consider more training data.")
    else:
        print("  Student is not ready: >10% quality drop. Needs more data or larger student.")

What to look for

When comparing student to teacher, track these three things:

  1. Accuracy drop. Under 5% is a clear win. Between 5-10%, consider whether the cost savings justify the quality loss. Over 10%, the student needs more data, a larger base model, or the task isn't as bounded as you thought.

  2. Error inheritance vs. compression errors. Some student errors will be the same ones the teacher makes (inherited errors). Others are new errors the student introduces through compression. Inherited errors tell you the teacher needs fixing. Compression errors tell you the student needs more training data on those edge cases.

  3. Latency and cost. A local 3B student running on a consumer GPU will be 10-50x cheaper per request than a hosted teacher model. The question is whether the quality-cost tradeoff is worth it for this specific task.

Hosted alternatives

Not everyone has a local GPU, and not every team wants to manage model artifacts. Hosted fine-tuning APIs provide the same conceptual workflow with less infrastructure:

ProviderServiceWhat you uploadWhat you get back
OpenAIFine-tuning APIJSONL training dataFine-tuned model ID you call via the same API
Together AIFine-tuning APIJSONL training dataFine-tuned model endpoint
Google CloudVertex AI fine-tuningJSONL training dataFine-tuned model on Vertex AI
Hugging FaceJobs + TRLDataset + base model on the HubTrained model on the Hub

What about Anthropic? Anthropic does not currently offer fine-tuning through their direct API. Fine-tuning of Claude models is available through AWS Bedrock as a managed service, not through Anthropic's platform directly. This may change, so check Anthropic's support documentation for current availability.

The data collection and evaluation steps are identical. You're still collecting teacher outputs, filtering for quality, and evaluating the student. The training step is a hosted API call instead of local GPU work.

Tradeoff

Hosted fine-tuning is simpler but less transparent. You can't inspect training dynamics, adjust hyperparameters mid-run, or iterate as quickly. For learning, local training teaches you more. For production, hosted training reduces operational burden.

Exercises

  1. Distill retrieval routing. Follow the full walkthrough: collect teacher outputs, filter, train, evaluate. What accuracy does your student achieve?

  2. Distill a harder task. Try evidence summarization: given an evidence bundle and a question, produce a structured JSON answer. How does the data collection change? How does evaluation change?

  3. Error analysis. For your student's errors, classify each as inherited (teacher also got it wrong) or compression-introduced (teacher got it right, student got it wrong). What does the ratio tell you?

  4. Vary the student size. Train both a 1.5B and a 7B student on the same data. How does the accuracy-latency tradeoff change? Where is the sweet spot for your task?

Completion checkpoint

You're done with this lesson when you can:

  • Identify bounded tasks in your system that are candidates for distillation
  • Collect and filter teacher outputs into clean training data
  • Train a student model using QLoRA with PEFT and TRL
  • Evaluate the student against the teacher on your benchmark
  • Articulate when distillation is worth the effort and when it's premature

What's next

Fine-Tuning. Distillation makes a good behavior cheaper; the next lesson is for behavior that still fails after the cheaper interventions have been exhausted.

References

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.