Distillation: Compressing Stable Behaviors
The optimization ladder says distillation is Rung 4. You'll reach for it when prompt engineering, retrieval improvement, and context engineering have been tried and measured, and a bounded task is working well but costing too much or running too slowly. Distillation is how you compress that stable behavior into a smaller, cheaper model.
The key constraint is "stable." Distillation transfers behavior from a teacher to a student. If the teacher's behavior is still changing (you're still tuning prompts, adjusting retrieval, or refining the task definition), distillation will bake in whatever the teacher does today, including the parts you haven't finished fixing. This lesson covers the full workflow: identifying what to distill, collecting teacher outputs, training a student, and verifying that the student preserved what matters.
What you'll learn
- What distillation actually is: supervised training on teacher-generated outputs, not magic compression
- How to identify bounded tasks that are good distillation candidates
- Collect and filter training data from teacher model runs
- Train a student model using parameter-efficient fine-tuning (PEFT) with TRL
- Evaluate whether the student matches the teacher on your benchmark
- Understand the hardware requirements and where to run training
Concepts
Distillation — training a smaller model (the student) to reproduce the outputs of a larger model (the teacher) on a specific task. The student doesn't learn the teacher's general capabilities — it learns to mimic the teacher's behavior on the examples you provide. This is why task boundaries matter: a student trained on retrieval-routing examples won't suddenly become good at code generation.
Teacher model — the larger, more capable model whose behavior you want to compress. In our system, this is the workhorse LLM you've been using for the anchor project (GPT-4o, Claude Sonnet, or similar). The teacher's job during distillation is to produce high-quality outputs over a bounded task set. These outputs become the training data for the student.
Student model — the smaller, cheaper model you're training. Typically an open-weight model in the 1B-14B parameter range: Qwen 2.5, Llama 3, Phi-3, Gemma 2, or similar. The student will be faster and cheaper to run than the teacher, but only on the specific task you trained it for. On everything else, it'll perform like its base model.
Bounded task — a task with clear inputs, clear outputs, and limited scope. Good distillation candidates from our system:
- Retrieval routing: given a question, classify which retrieval mode to use (code, docs, hybrid, skip)
- Query rewriting: given a user question and conversation context, produce a standalone search query
- Evidence summarization: given an evidence bundle and a question, produce a structured answer in a fixed schema
- Bug summary formatting: given a code diff and error output, produce a structured bug report
Bad candidates: open-ended code generation, multi-step reasoning across many files, tasks where the output format is still changing.
SFT (Supervised Fine-Tuning) — training a model on input-output pairs where the inputs are task prompts and the outputs are teacher-generated completions. This is the simplest training paradigm and the right starting point for distillation. You're not teaching the model new knowledge — you're teaching it to produce outputs that look like the teacher's outputs for this task class.
PEFT (Parameter-Efficient Fine-Tuning) — training techniques that update only a small fraction of the model's parameters instead of all of them. This dramatically reduces the memory and compute required for training. The most common PEFT method is LoRA.
LoRA (Low-Rank Adaptation) — a PEFT technique that adds small trainable matrices to specific layers of the model while keeping the original weights frozen. Instead of updating billions of parameters, you update millions — typically 0.1-1% of the total. The trained LoRA weights are saved as a small adapter file that can be loaded on top of the base model.
QLoRA (Quantized LoRA) — LoRA applied to a model that's been quantized to 4-bit precision. This reduces VRAM requirements substantially: a 7B model that normally needs 14GB of VRAM for full fine-tuning can be trained with QLoRA in roughly 10GB. The tradeoff is slightly lower precision, but for most distillation tasks the quality loss is minimal. See the Hardware Guide for specific VRAM requirements.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Stable task is too expensive | Per-request cost is high on a task that rarely fails | Model routing to a cheaper hosted model | Distillation to a local student model |
| Stable task is too slow | Latency is high on a task where speed matters | Prompt compression, shorter outputs | Distillation to a smaller, faster model |
| Teacher changes invalidate student | Retrained student drifts from current teacher behavior | Less frequent retraining | Only distill tasks that are genuinely stable |
| Student quality drops on edge cases | Student handles common cases but fails on rare ones | More diverse training examples | Larger student model or keep teacher for edge cases |
| Can't run training locally | No GPU or insufficient VRAM | Cloud GPU (see Hardware Guide) | Hosted fine-tuning API as alternative |
Walkthrough
Step 1: Choose a bounded task
Start with retrieval routing because it's the most bounded task in our system. The input is a user question plus optional conversation context. The output is one of four labels: code, docs, hybrid, or skip. The teacher model already does this well, and we have eval coverage from Module 6.
Why retrieval routing first? It has the tightest input-output contract. The output is a single classification label, not free-form text. This means evaluation is straightforward (exact match), and training data quality is easy to verify. Once you've done this once, the same workflow applies to more complex tasks like evidence summarization.
Step 2: Collect teacher outputs
Run the teacher model over your benchmark questions and any additional examples you have. The goal is to collect input-output pairs where the teacher's output is correct:
# optimization/collect_teacher_data.py
"""Collect teacher model outputs for distillation training data.
Runs the teacher model over a set of examples and filters for
correct outputs to use as training data.
"""
import json
from pathlib import Path
def load_examples(path: str) -> list[dict]:
"""Load examples from a JSONL file.
Each line should have at minimum: question, and optionally
conversation_context and expected_label for filtering.
"""
examples = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
examples.append(json.loads(line))
return examples
def format_routing_prompt(question: str, context: str = "") -> str:
"""Format a retrieval-routing classification prompt.
This is the same prompt your pipeline uses for routing.
"""
prompt = (
"Classify this question into one of four retrieval modes.\n\n"
"Modes:\n"
"- code: question is about specific code, files, or symbols\n"
"- docs: question is about documentation, README, or guides\n"
"- hybrid: question spans both code and documentation\n"
"- skip: question is general knowledge, no retrieval needed\n\n"
)
if context:
prompt += f"Conversation context:\n{context}\n\n"
prompt += f"Question: {question}\n\nMode:"
return prompt
def collect_teacher_outputs(
examples: list[dict],
generate_fn,
) -> list[dict]:
"""Run the teacher model and collect outputs.
Args:
examples: List of dicts with 'question' and optional
'conversation_context' and 'expected_label'.
generate_fn: Function that takes a prompt string and
returns the model's text output.
Returns:
List of training examples with teacher outputs.
"""
results = []
for ex in examples:
prompt = format_routing_prompt(
ex["question"],
ex.get("conversation_context", ""),
)
teacher_output = generate_fn(prompt).strip().lower()
result = {
"prompt": prompt,
"completion": teacher_output,
"question": ex["question"],
}
# If we have expected labels, mark whether teacher was correct
if "expected_label" in ex:
result["expected"] = ex["expected_label"]
result["teacher_correct"] = (
teacher_output == ex["expected_label"]
)
results.append(result)
return results
def filter_correct(results: list[dict]) -> list[dict]:
"""Keep only examples where the teacher produced the correct output."""
return [r for r in results if r.get("teacher_correct", True)]
def save_training_data(results: list[dict], output_path: str) -> None:
"""Save filtered results as training data in JSONL format."""
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
for r in results:
training_example = {
"prompt": r["prompt"],
"completion": r["completion"],
}
f.write(json.dumps(training_example) + "\n")
print(f"Saved {len(results)} training examples to {output_path}")Run data collection with your teacher model:
# Example usage
from openai import OpenAI
client = OpenAI()
def teacher_generate(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=10,
temperature=0,
)
return response.choices[0].message.content
examples = load_examples("benchmark/routing_examples.jsonl")
results = collect_teacher_outputs(examples, teacher_generate)
filtered = filter_correct(results)
save_training_data(filtered, "optimization/training_data/routing.jsonl")How many examples? For a classification task like retrieval routing, 200-500 high-quality examples is a reasonable starting point. For more complex generation tasks like evidence summarization, you'll want 500-1,000+. Quality matters more than quantity — 300 correct examples beat 1,000 noisy ones.
Step 3: Train the student
We'll use PEFT with TRL (Transformer Reinforcement Learning library) and Unsloth for efficient training. Unsloth provides optimized training kernels that reduce memory usage and speed up training significantly.
First, install the training dependencies:
pip install peft trl unsloth transformers datasetsQLoRA training on a 3B model requires roughly 6-8GB of VRAM. A 7B model needs roughly 10-12GB. If you don't have a local GPU, see the Hardware Guide for cloud GPU options. You can also use a hosted fine-tuning API — the workflow concepts are the same, but you'll upload data instead of running training locally.
# optimization/train_student.py
"""Train a student model via QLoRA distillation.
Uses Unsloth + TRL for efficient parameter-efficient fine-tuning
on teacher-generated training data.
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
def format_for_training(example: dict) -> dict:
"""Format a prompt-completion pair for SFT training."""
text = (
f"### Instruction:\n{example['prompt']}\n\n"
f"### Response:\n{example['completion']}"
)
return {"text": text}
def train(
model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
training_data_path: str = "optimization/training_data/routing.jsonl",
output_dir: str = "optimization/models/routing-student",
max_steps: int = 200,
learning_rate: float = 2e-4,
lora_rank: int = 16,
):
"""Run QLoRA training on teacher-generated data.
Args:
model_name: Hugging Face model ID. Unsloth provides
pre-quantized 4-bit versions for efficient training.
training_data_path: Path to JSONL training data.
output_dir: Where to save the trained LoRA adapter.
max_steps: Training steps. Start small and increase if
validation loss is still decreasing.
learning_rate: Learning rate. 2e-4 is a good default for
QLoRA; lower if training is unstable.
lora_rank: LoRA rank. Higher rank = more capacity but more
memory. 16 is a good starting point.
"""
# Load the base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
model,
r=lora_rank,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=lora_rank, # Common default: alpha == rank
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)
# Load and format training data
dataset = load_dataset("json", data_files=training_data_path, split="train")
dataset = dataset.map(format_for_training)
# Configure training
training_config = SFTConfig(
output_dir=output_dir,
max_steps=max_steps,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=learning_rate,
logging_steps=10,
save_steps=50,
warmup_steps=20,
fp16=True,
dataset_text_field="text",
max_seq_length=2048,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_config,
)
# Train
print(f"Training on {len(dataset)} examples for {max_steps} steps...")
trainer.train()
# Save the LoRA adapter (not the full model)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Saved LoRA adapter to {output_dir}")
if __name__ == "__main__":
train()Run training:
python optimization/train_student.pyTraining on 300 examples for 200 steps takes roughly 10-20 minutes on a consumer GPU (RTX 3060 or better). You'll see training loss logged every 10 steps — it should decrease and stabilize.
Step 4: Evaluate the student
The student needs to match the teacher on the eval suite, not just on the training data. Load the trained adapter and run it against your benchmark:
# optimization/eval_student.py
"""Evaluate student model against teacher on the benchmark.
Compares student accuracy, latency, and cost against the teacher
to determine whether distillation was worthwhile.
"""
import json
import time
from pathlib import Path
from unsloth import FastLanguageModel
def load_student(
model_name: str = "unsloth/Qwen2.5-3B-bnb-4bit",
adapter_path: str = "optimization/models/routing-student",
):
"""Load the base model with the trained LoRA adapter."""
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=2048,
load_in_4bit=True,
)
# Load LoRA weights on top of base model
model.load_adapter(adapter_path)
FastLanguageModel.for_inference(model)
return model, tokenizer
def evaluate(
model,
tokenizer,
eval_path: str = "benchmark/routing_examples.jsonl",
) -> dict:
"""Run student model on eval set and compute metrics."""
examples = []
with open(eval_path) as f:
for line in f:
line = line.strip()
if line:
examples.append(json.loads(line))
correct = 0
total = 0
latencies = []
errors = []
for ex in examples:
if "expected_label" not in ex:
continue
prompt = (
f"### Instruction:\n"
f"{format_routing_prompt(ex['question'], ex.get('conversation_context', ''))}\n\n"
f"### Response:\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
start = time.perf_counter()
outputs = model.generate(
**inputs,
max_new_tokens=10,
temperature=0,
)
elapsed = time.perf_counter() - start
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
).strip().lower()
total += 1
latencies.append(elapsed)
if response == ex["expected_label"]:
correct += 1
else:
errors.append({
"question": ex["question"],
"expected": ex["expected_label"],
"got": response,
})
return {
"accuracy": correct / total if total > 0 else 0,
"total": total,
"correct": correct,
"avg_latency_ms": sum(latencies) / len(latencies) * 1000 if latencies else 0,
"errors": errors[:10], # First 10 errors for inspection
}
def format_routing_prompt(question: str, context: str = "") -> str:
"""Same formatting as teacher data collection."""
prompt = (
"Classify this question into one of four retrieval modes.\n\n"
"Modes:\n"
"- code: question is about specific code, files, or symbols\n"
"- docs: question is about documentation, README, or guides\n"
"- hybrid: question spans both code and documentation\n"
"- skip: question is general knowledge, no retrieval needed\n\n"
)
if context:
prompt += f"Conversation context:\n{context}\n\n"
prompt += f"Question: {question}\n\nMode:"
return prompt
if __name__ == "__main__":
print("Loading student model...")
model, tokenizer = load_student()
print("Evaluating...")
results = evaluate(model, tokenizer)
print(f"\nStudent Evaluation Results:")
print(f" Accuracy: {results['accuracy']:.1%}")
print(f" Correct: {results['correct']}/{results['total']}")
print(f" Avg latency: {results['avg_latency_ms']:.0f}ms")
if results["errors"]:
print(f"\nSample errors:")
for err in results["errors"][:5]:
print(f" Q: {err['question'][:60]}...")
print(f" Expected: {err['expected']}, Got: {err['got']}")Run the evaluation:
python optimization/eval_student.pyExpected output:
Loading student model...
Evaluating...
Student Evaluation Results:
Accuracy: 91.3%
Correct: 42/46
Avg latency: 45ms
Sample errors:
Q: What does the authentication middleware do and where is i...
Expected: hybrid, Got: code
Q: How do I set up the project for the first time?...
Expected: docs, Got: hybrid
Step 5: Compare student vs teacher
The eval results only matter in comparison. Build a side-by-side report:
# optimization/compare_models.py
"""Compare student and teacher performance side by side."""
def compare(teacher_results: dict, student_results: dict) -> None:
"""Print a comparison table."""
print(f"\n{'Metric':<25} {'Teacher':>12} {'Student':>12} {'Delta':>12}")
print("-" * 65)
t_acc = teacher_results["accuracy"]
s_acc = student_results["accuracy"]
print(f"{'Accuracy':<25} {t_acc:>11.1%} {s_acc:>11.1%} {s_acc - t_acc:>+11.1%}")
t_lat = teacher_results["avg_latency_ms"]
s_lat = student_results["avg_latency_ms"]
print(f"{'Avg latency (ms)':<25} {t_lat:>11.0f} {s_lat:>11.0f} {s_lat - t_lat:>+11.0f}")
# Cost comparison depends on your setup
# For hosted teacher vs local student, the student cost approaches zero
print(f"\n{'Verdict':}")
print("-" * 40)
quality_drop = t_acc - s_acc
speed_gain = t_lat / s_lat if s_lat > 0 else float("inf")
if quality_drop <= 0.05 and speed_gain >= 2:
print(" Student is viable: <5% quality drop with significant speed gain.")
elif quality_drop <= 0.10:
print(" Student is marginal: 5-10% quality drop. Consider more training data.")
else:
print(" Student is not ready: >10% quality drop. Needs more data or larger student.")What to look for
When comparing student to teacher, track these three things:
-
Accuracy drop. Under 5% is a clear win. Between 5-10%, consider whether the cost savings justify the quality loss. Over 10%, the student needs more data, a larger base model, or the task isn't as bounded as you thought.
-
Error inheritance vs. compression errors. Some student errors will be the same ones the teacher makes (inherited errors). Others are new errors the student introduces through compression. Inherited errors tell you the teacher needs fixing. Compression errors tell you the student needs more training data on those edge cases.
-
Latency and cost. A local 3B student running on a consumer GPU will be 10-50x cheaper per request than a hosted teacher model. The question is whether the quality-cost tradeoff is worth it for this specific task.
Hosted alternatives
Not everyone has a local GPU, and not every team wants to manage model artifacts. Hosted fine-tuning APIs provide the same conceptual workflow with less infrastructure:
| Provider | Service | What you upload | What you get back |
|---|---|---|---|
| OpenAI | Fine-tuning API | JSONL training data | Fine-tuned model ID you call via the same API |
| Together AI | Fine-tuning API | JSONL training data | Fine-tuned model endpoint |
| Google Cloud | Vertex AI fine-tuning | JSONL training data | Fine-tuned model on Vertex AI |
| Hugging Face | Jobs + TRL | Dataset + base model on the Hub | Trained model on the Hub |
What about Anthropic? Anthropic does not currently offer fine-tuning through their direct API. Fine-tuning of Claude models is available through AWS Bedrock as a managed service, not through Anthropic's platform directly. This may change, so check Anthropic's support documentation for current availability.
The data collection and evaluation steps are identical. You're still collecting teacher outputs, filtering for quality, and evaluating the student. The training step is a hosted API call instead of local GPU work.
Hosted fine-tuning is simpler but less transparent. You can't inspect training dynamics, adjust hyperparameters mid-run, or iterate as quickly. For learning, local training teaches you more. For production, hosted training reduces operational burden.
Exercises
-
Distill retrieval routing. Follow the full walkthrough: collect teacher outputs, filter, train, evaluate. What accuracy does your student achieve?
-
Distill a harder task. Try evidence summarization: given an evidence bundle and a question, produce a structured JSON answer. How does the data collection change? How does evaluation change?
-
Error analysis. For your student's errors, classify each as inherited (teacher also got it wrong) or compression-introduced (teacher got it right, student got it wrong). What does the ratio tell you?
-
Vary the student size. Train both a 1.5B and a 7B student on the same data. How does the accuracy-latency tradeoff change? Where is the sweet spot for your task?
Completion checkpoint
You're done with this lesson when you can:
- Identify bounded tasks in your system that are candidates for distillation
- Collect and filter teacher outputs into clean training data
- Train a student model using QLoRA with PEFT and TRL
- Evaluate the student against the teacher on your benchmark
- Articulate when distillation is worth the effort and when it's premature
What's next
Fine-Tuning. Distillation makes a good behavior cheaper; the next lesson is for behavior that still fails after the cheaper interventions have been exhausted.
References
- PEFT documentation — parameter-efficient fine-tuning library
Build with this - TRL documentation — training library for SFT, DPO, and more
Build with this - Unsloth documentation — optimized training kernels for efficient fine-tuning
Build with this - Hardware and Model Size Guide — VRAM requirements and GPU options
Start here - Model Selection and Serving — choosing the right student model
Deep dive - The Optimization Ladder — decision framework for when distillation is the right rung
Start here