Distillation: Compressing Stable Behaviors

The optimization ladder says distillation is Rung 4. You'll reach for it when prompt engineering, retrieval improvement, and context engineering have been tried and measured, and a bounded task is working well but costing too much or running too slowly. Distillation is how you compress that stable behavior into a smaller, cheaper model.

The key constraint is "stable." Distillation transfers behavior from a teacher to a student. If the teacher's behavior is still changing (you're still tuning prompts, adjusting retrieval, or refining the task definition), distillation will bake in whatever the teacher does today, including the parts you haven't finished fixing. This lesson covers the full workflow: identifying what to distill, collecting teacher outputs, training a student, and verifying that the student preserved what matters.

What you'll learn

What distillation actually is: supervised training on teacher-generated outputs, not magic compression
How to identify bounded tasks that are good distillation candidates
Collect and filter training data from teacher model runs
Train a student model using parameter-efficient fine-tuning (PEFT) with TRL
Evaluate whether the student matches the teacher on your benchmark
Understand the hardware requirements and where to run training

Concepts

Distillation — training a smaller model (the student) to reproduce the outputs of a larger model (the teacher) on a specific task. The student doesn't learn the teacher's general capabilities — it learns to mimic the teacher's behavior on the examples you provide. This is why task boundaries matter: a student trained on retrieval-routing examples won't suddenly become good at code generation.

Teacher model — the larger, more capable model whose behavior you want to compress. In our system, this is the workhorse LLM you've been using for the anchor project (GPT-4o, Claude Sonnet, or similar). The teacher's job during distillation is to produce high-quality outputs over a bounded task set. These outputs become the training data for the student.

Student model — the smaller, cheaper model you're training. Typically an open-weight model in the 1B-14B parameter range: Qwen 2.5, Llama 3, Phi-3, Gemma 2, or similar. The student will be faster and cheaper to run than the teacher, but only on the specific task you trained it for. On everything else, it'll perform like its base model.

Bounded task — a task with clear inputs, clear outputs, and limited scope. Good distillation candidates from our system:

Retrieval routing: given a question, classify which retrieval mode to use (code, docs, hybrid, skip)
Query rewriting: given a user question and conversation context, produce a standalone search query
Evidence summarization: given an evidence bundle and a question, produce a structured answer in a fixed schema
Bug summary formatting: given a code diff and error output, produce a structured bug report

Bad candidates: open-ended code generation, multi-step reasoning across many files, tasks where the output format is still changing.

SFT (Supervised Fine-Tuning) — training a model on input-output pairs where the inputs are task prompts and the outputs are teacher-generated completions. This is the simplest training paradigm and the right starting point for distillation. You're not teaching the model new knowledge — you're teaching it to produce outputs that look like the teacher's outputs for this task class.

PEFT (Parameter-Efficient Fine-Tuning) — training techniques that update only a small fraction of the model's parameters instead of all of them. This dramatically reduces the memory and compute required for training. The most common PEFT method is LoRA.

LoRA (Low-Rank Adaptation) — a PEFT technique that adds small trainable matrices to specific layers of the model while keeping the original weights frozen. Instead of updating billions of parameters, you update millions — typically 0.1-1% of the total. The trained LoRA weights are saved as a small adapter file that can be loaded on top of the base model.

QLoRA (Quantized LoRA) — LoRA applied to a model that's been quantized to 4-bit precision. This reduces VRAM requirements substantially: a 7B model that normally needs 14GB of VRAM for full fine-tuning can be trained with QLoRA in roughly 10GB. The tradeoff is slightly lower precision, but for most distillation tasks the quality loss is minimal. See the Hardware Guide for specific VRAM requirements.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Stable task is too expensive	Per-request cost is high on a task that rarely fails	Model routing to a cheaper hosted model	Distillation to a local student model
Stable task is too slow	Latency is high on a task where speed matters	Prompt compression, shorter outputs	Distillation to a smaller, faster model
Teacher changes invalidate student	Retrained student drifts from current teacher behavior	Less frequent retraining	Only distill tasks that are genuinely stable
Student quality drops on edge cases	Student handles common cases but fails on rare ones	More diverse training examples	Larger student model or keep teacher for edge cases
Can't run training locally	No GPU or insufficient VRAM	Cloud GPU (see Hardware Guide)	Hosted fine-tuning API as alternative

Walkthrough

Step 1: Choose a bounded task

Start with retrieval routing because it's the most bounded task in our system. The input is a user question plus optional conversation context. The output is one of four labels: code, docs, hybrid, or skip. The teacher model already does this well, and we have eval coverage from Module 6.

Why retrieval routing first? It has the tightest input-output contract. The output is a single classification label, not free-form text. This means evaluation is straightforward (exact match), and training data quality is easy to verify. Once you've done this once, the same workflow applies to more complex tasks like evidence summarization.

Step 2: Collect teacher outputs

Run the teacher model over your benchmark questions and any additional examples you have. The goal is to collect input-output pairs where the teacher's output is correct:

# optimization/collect_teacher_data.py
"""Collect teacher model outputs for distillation training data.

Runs the teacher model over a set of examples and filters for
correct outputs to use as training data.
"""

import json
from pathlib import Path


def load_examples(path: str) -> list[dict]:
    """Load examples from a JSONL file.

    Each line should have at minimum: question, and optionally
    conversation_context and expected_label for filtering.
    """
    examples = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))
    return examples


def format_routing_prompt(question: str, context: str = "") -> str:
    """Format a retrieval-routing classification prompt.

    This is the same prompt your pipeline uses for routing.
    """
    prompt = (
        "Classify this question into one of four retrieval modes.\n\n"
        "Modes:\n"
        "- code: question is about specific code, files, or symbols\n"
        "- docs: question is about documentation, README, or guides\n"
        "- hybrid: question spans both code and documentation\n"
        "- skip: question is general knowledge, no retrieval needed\n\n"
    )
    if context:
        prompt += f"Conversation context:\n{context}\n\n"
    prompt += f"Question: {question}\n\nMode:"
    return prompt


def collect_teacher_outputs(
    examples: list[dict],
    generate_fn,
) -> list[dict]:
    """Run the teacher model and collect outputs.

    Args:
        examples: List of dicts with 'question' and optional
            'conversation_context' and 'expected_label'.
        generate_fn: Function that takes a prompt string and
            returns the model's text output.

    Returns:
        List of training examples with teacher outputs.
    """
    results = []

    for ex in examples:
        prompt = format_routing_prompt(
            ex["question"],
            ex.get("conversation_context", ""),
        )
        teacher_output = generate_fn(prompt).strip().lower()

        result = {
            "prompt": prompt,
            "completion": teacher_output,
            "question": ex["question"],
        }

        # If we have expected labels, mark whether teacher was correct
        if "expected_label" in ex:
            result["expected"] = ex["expected_label"]
            result["teacher_correct"] = (
                teacher_output == ex["expected_label"]
            )

        results.append(result)

    return results


def filter_correct(results: list[dict]) -> list[dict]:
    """Keep only examples where the teacher produced the correct output."""
    return [r for r in results if r.get("teacher_correct", True)]


def save_training_data(results: list[dict], output_path: str) -> None:
    """Save filtered results as training data in JSONL format."""
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w") as f:
        for r in results:
            training_example = {
                "prompt": r["prompt"],
                "completion": r["completion"],
            }
            f.write(json.dumps(training_example) + "\n")

    print(f"Saved {len(results)} training examples to {output_path}")

Run data collection with your teacher model:

# Example usage
from openai import OpenAI

client = OpenAI()


def teacher_generate(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10,
        temperature=0,
    )
    return response.choices[0].message.content


examples = load_examples("benchmark/routing_examples.jsonl")
results = collect_teacher_outputs(examples, teacher_generate)
filtered = filter_correct(results)
save_training_data(filtered, "optimization/training_data/routing.jsonl")

How many examples? For a classification task like retrieval routing, 200-500 high-quality examples is a reasonable starting point. For more complex generation tasks like evidence summarization, you'll want 500-1,000+. Quality matters more than quantity — 300 correct examples beat 1,000 noisy ones.

Step 3: Train the student

We'll use PEFT with TRL (Transformer Reinforcement Learning library) and Unsloth for efficient training. Unsloth provides optimized training kernels that reduce memory usage and speed up training significantly.

First, install the training dependencies:

pip install peft trl unsloth transformers datasets

Hardware note

QLoRA training on a 3B model requires roughly 6-8GB of VRAM. A 7B model needs roughly 10-12GB. If you don't have a local GPU, see the Hardware Guide for cloud GPU options. You can also use a hosted fine-tuning API — the workflow concepts are the same, but you'll upload data instead of running training locally.

# optimization/train_student.py
"""Train a student model via QLoRA distillation.

Uses Unsloth + TRL for efficient parameter-efficient fine-tuning
on teacher-generated training data.
"""

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset


def format_for_training(example: dict) -> dict:
    """Format a prompt-completion pair for SFT training."""
    text = (
        f"### Instruction:\n{example['prompt']}\n\n"
        f"### Response:\n{example['completion']}"
    )
    return {"text": text}


def train(
    model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
    training_data_path: str = "optimization/training_data/routing.jsonl",
    output_dir: str = "optimization/models/routing-student",
    max_steps: int = 200,
    learning_rate: float = 2e-4,
    lora_rank: int = 16,
):
    """Run QLoRA training on teacher-generated data.

    Args:
        model_name: Hugging Face model ID. Unsloth provides
            pre-quantized 4-bit versions for efficient training.
        training_data_path: Path to JSONL training data.
        output_dir: Where to save the trained LoRA adapter.
        max_steps: Training steps. Start small and increase if
            validation loss is still decreasing.
        learning_rate: Learning rate. 2e-4 is a good default for
            QLoRA; lower if training is unstable.
        lora_rank: LoRA rank. Higher rank = more capacity but more
            memory. 16 is a good starting point.
    """
    # Load the base model with 4-bit quantization
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        load_in_4bit=True,
    )

    # Add LoRA adapters to the model
    model = FastLanguageModel.get_peft_model(
        model,
        r=lora_rank,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=lora_rank,  # Common default: alpha == rank
        lora_dropout=0,
        use_gradient_checkpointing="unsloth",
    )

    # Load and format training data
    dataset = load_dataset("json", data_files=training_data_path, split="train")
    dataset = dataset.map(format_for_training)

    # Configure training
    training_config = SFTConfig(
        output_dir=output_dir,
        max_steps=max_steps,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        logging_steps=10,
        save_steps=50,
        warmup_steps=20,
        fp16=True,
        dataset_text_field="text",
        max_seq_length=2048,
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_config,
    )

    # Train
    print(f"Training on {len(dataset)} examples for {max_steps} steps...")
    trainer.train()

    # Save the LoRA adapter (not the full model)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Saved LoRA adapter to {output_dir}")


if __name__ == "__main__":
    train()

Run training:

python optimization/train_student.py

Training on 300 examples for 200 steps takes roughly 10-20 minutes on a consumer GPU (RTX 3060 or better). You'll see training loss logged every 10 steps — it should decrease and stabilize.

Step 4: Evaluate the student

The student needs to match the teacher on the eval suite, not just on the training data. Load the trained adapter and run it against your benchmark:

# optimization/eval_student.py
"""Evaluate student model against teacher on the benchmark.

Compares student accuracy, latency, and cost against the teacher
to determine whether distillation was worthwhile.
"""

import json
import time
from pathlib import Path

from unsloth import FastLanguageModel


def load_student(
    model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
    adapter_path: str = "optimization/models/routing-student",
):
    """Load the base model with the trained LoRA adapter."""
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        load_in_4bit=True,
    )
    # Load LoRA weights on top of base model
    model.load_adapter(adapter_path)
    FastLanguageModel.for_inference(model)
    return model, tokenizer


def evaluate(
    model,
    tokenizer,
    eval_path: str = "benchmark/routing_examples.jsonl",
) -> dict:
    """Run student model on eval set and compute metrics."""
    examples = []
    with open(eval_path) as f:
        for line in f:
            line = line.strip()
            if line:
                examples.append(json.loads(line))

    correct = 0
    total = 0
    latencies = []
    errors = []

    for ex in examples:
        if "expected_label" not in ex:
            continue

        prompt = (
            f"### Instruction:\n"
            f"{format_routing_prompt(ex['question'], ex.get('conversation_context', ''))}\n\n"
            f"### Response:\n"
        )

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        start = time.perf_counter()
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            temperature=0,
        )
        elapsed = time.perf_counter() - start

        response = tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        ).strip().lower()

        total += 1
        latencies.append(elapsed)

        if response == ex["expected_label"]:
            correct += 1
        else:
            errors.append({
                "question": ex["question"],
                "expected": ex["expected_label"],
                "got": response,
            })

    return {
        "accuracy": correct / total if total > 0 else 0,
        "total": total,
        "correct": correct,
        "avg_latency_ms": sum(latencies) / len(latencies) * 1000 if latencies else 0,
        "errors": errors[:10],  # First 10 errors for inspection
    }


def format_routing_prompt(question: str, context: str = "") -> str:
    """Same formatting as teacher data collection."""
    prompt = (
        "Classify this question into one of four retrieval modes.\n\n"
        "Modes:\n"
        "- code: question is about specific code, files, or symbols\n"
        "- docs: question is about documentation, README, or guides\n"
        "- hybrid: question spans both code and documentation\n"
        "- skip: question is general knowledge, no retrieval needed\n\n"
    )
    if context:
        prompt += f"Conversation context:\n{context}\n\n"
    prompt += f"Question: {question}\n\nMode:"
    return prompt


if __name__ == "__main__":
    print("Loading student model...")
    model, tokenizer = load_student()

    print("Evaluating...")
    results = evaluate(model, tokenizer)

    print(f"\nStudent Evaluation Results:")
    print(f"  Accuracy: {results['accuracy']:.1%}")
    print(f"  Correct: {results['correct']}/{results['total']}")
    print(f"  Avg latency: {results['avg_latency_ms']:.0f}ms")

    if results["errors"]:
        print(f"\nSample errors:")
        for err in results["errors"][:5]:
            print(f"  Q: {err['question'][:60]}...")
            print(f"    Expected: {err['expected']}, Got: {err['got']}")

Run the evaluation:

python optimization/eval_student.py

Expected output:

Loading student model...
Evaluating...

Student Evaluation Results:
  Accuracy: 91.3%
  Correct: 42/46
  Avg latency: 45ms

Sample errors:
  Q: What does the authentication middleware do and where is i...
    Expected: hybrid, Got: code
  Q: How do I set up the project for the first time?...
    Expected: docs, Got: hybrid

Step 5: Compare student vs teacher

The eval results only matter in comparison. Build a side-by-side report:

# optimization/compare_models.py
"""Compare student and teacher performance side by side."""


def compare(teacher_results: dict, student_results: dict) -> None:
    """Print a comparison table."""
    print(f"\n{'Metric':<25} {'Teacher':>12} {'Student':>12} {'Delta':>12}")
    print("-" * 65)

    t_acc = teacher_results["accuracy"]
    s_acc = student_results["accuracy"]
    print(f"{'Accuracy':<25} {t_acc:>11.1%} {s_acc:>11.1%} {s_acc - t_acc:>+11.1%}")

    t_lat = teacher_results["avg_latency_ms"]
    s_lat = student_results["avg_latency_ms"]
    print(f"{'Avg latency (ms)':<25} {t_lat:>11.0f} {s_lat:>11.0f} {s_lat - t_lat:>+11.0f}")

    # Cost comparison depends on your setup
    # For hosted teacher vs local student, the student cost approaches zero
    print(f"\n{'Verdict':}")
    print("-" * 40)

    quality_drop = t_acc - s_acc
    speed_gain = t_lat / s_lat if s_lat > 0 else float("inf")

    if quality_drop <= 0.05 and speed_gain >= 2:
        print("  Student is viable: <5% quality drop with significant speed gain.")
    elif quality_drop <= 0.10:
        print("  Student is marginal: 5-10% quality drop. Consider more training data.")
    else:
        print("  Student is not ready: >10% quality drop. Needs more data or larger student.")

What to look for

When comparing student to teacher, track these three things:

Accuracy drop. Under 5% is a clear win. Between 5-10%, consider whether the cost savings justify the quality loss. Over 10%, the student needs more data, a larger base model, or the task isn't as bounded as you thought.
Error inheritance vs. compression errors. Some student errors will be the same ones the teacher makes (inherited errors). Others are new errors the student introduces through compression. Inherited errors tell you the teacher needs fixing. Compression errors tell you the student needs more training data on those edge cases.
Latency and cost. A local 3B student running on a consumer GPU will be 10-50x cheaper per request than a hosted teacher model. The question is whether the quality-cost tradeoff is worth it for this specific task.

Hosted alternatives

Not everyone has a local GPU, and not every team wants to manage model artifacts. Hosted fine-tuning APIs provide the same conceptual workflow with less infrastructure:

Provider	Service	What you upload	What you get back
OpenAI	Fine-tuning API	JSONL training data	Fine-tuned model ID you call via the same API
Together AI	Fine-tuning API	JSONL training data	Fine-tuned model endpoint
Google Cloud	Vertex AI fine-tuning	JSONL training data	Fine-tuned model on Vertex AI
Hugging Face	Jobs + TRL	Dataset + base model on the Hub	Trained model on the Hub

What about Anthropic? Anthropic does not currently offer fine-tuning through their direct API. Fine-tuning of Claude models is available through AWS Bedrock as a managed service, not through Anthropic's platform directly. This may change, so check Anthropic's support documentation for current availability.

The data collection and evaluation steps are identical. You're still collecting teacher outputs, filtering for quality, and evaluating the student. The training step is a hosted API call instead of local GPU work.

Tradeoff

Hosted fine-tuning is simpler but less transparent. You can't inspect training dynamics, adjust hyperparameters mid-run, or iterate as quickly. For learning, local training teaches you more. For production, hosted training reduces operational burden.

Exercises

Distill retrieval routing. Follow the full walkthrough: collect teacher outputs, filter, train, evaluate. What accuracy does your student achieve?
Distill a harder task. Try evidence summarization: given an evidence bundle and a question, produce a structured JSON answer. How does the data collection change? How does evaluation change?
Error analysis. For your student's errors, classify each as inherited (teacher also got it wrong) or compression-introduced (teacher got it right, student got it wrong). What does the ratio tell you?
Vary the student size. Train both a 1.5B and a 7B student on the same data. How does the accuracy-latency tradeoff change? Where is the sweet spot for your task?

Completion checkpoint

You're done with this lesson when you can:

Identify bounded tasks in your system that are candidates for distillation
Collect and filter teacher outputs into clean training data
Train a student model using QLoRA with PEFT and TRL
Evaluate the student against the teacher on your benchmark
Articulate when distillation is worth the effort and when it's premature

What's next

Fine-Tuning. Distillation makes a good behavior cheaper; the next lesson is for behavior that still fails after the cheaper interventions have been exhausted.

References

PEFT documentation — parameter-efficient fine-tuning library Build with this
TRL documentation — training library for SFT, DPO, and more Build with this
Unsloth documentation — optimized training kernels for efficient fine-tuning Build with this
Hardware and Model Size Guide — VRAM requirements and GPU options Start here
Model Selection and Serving — choosing the right student model Deep dive
The Optimization Ladder — decision framework for when distillation is the right rung Start here