Module 8: Optimization Fine-Tuning

Fine-Tuning: Persistent Model Adaptation

Fine-tuning is the last rung on the optimization ladder. You reach for it when a repeated failure cluster survives prompt engineering, retrieval improvement, context restructuring, and (if applicable) distillation. The model has the right evidence, clear instructions, well-structured context, and it still gets the answer wrong in the same way. That pattern needs to be baked into the weights.

This is not where most learners should spend most of their time. The curriculum has deliberately placed fine-tuning last because the overwhelming majority of AI engineering problems resolve at earlier rungs. But when the failure diagnostic from the previous lessons consistently points to model-attributed failures, fine-tuning is the tool that addresses them. We'll cover supervised fine-tuning (SFT) with LoRA/QLoRA, data preparation from your run logs, evaluation against the base model, and preference optimization (DPO) as an advanced technique.

What you'll learn

  • When fine-tuning is the right choice and when it's premature
  • Prepare training data from your run logs and eval results
  • Run SFT with LoRA/QLoRA on a local model using PEFT and TRL
  • Evaluate the fine-tuned model against the base model on your benchmark
  • Understand preference optimization (DPO) and when it's appropriate
  • Use hosted fine-tuning APIs when local training isn't practical

Concepts

Fine-tuning vs. distillation — Distillation compresses a working behavior into a cheaper model. Fine-tuning fixes a broken behavior by training on corrected examples. The data sources are different: distillation uses teacher-generated outputs, fine-tuning uses human-curated or grader-verified correct outputs. The goals are different: distillation preserves quality at lower cost, fine-tuning improves quality on specific failure patterns.

SFT (Supervised Fine-Tuning) — the same technique from the distillation lesson, but applied to a different problem. In distillation, SFT inputs are teacher prompts and teacher outputs. In fine-tuning, SFT inputs are the prompts that caused failures and the corrected outputs you want the model to produce instead. The training mechanics are identical — the difference is where the training data comes from.

Failure cluster — a group of benchmark questions or production queries that fail in the same way. Examples from our anchor project:

  • The model consistently puts the file path in the wrong field of the structured output
  • The model ignores evidence from test files when answering "what does this function do?"
  • The model produces vague summaries when the evidence contains multiple conflicting code patterns

A failure cluster is a fine-tuning candidate when: (1) the failures repeat across runs, (2) the pattern is consistent enough to label, and (3) you can produce correct examples for each failure case.

Training data curation — the process of turning failure clusters into training examples. This is the hardest part of fine-tuning. You need:

  • Inputs: the exact prompts that triggered the failure (from your run logs)
  • Correct outputs: what the model should have produced (written by you or verified by a grader)
  • Negative examples filtered out: prompts where the base model already succeeds (training on these wastes capacity and can cause regression)

Regression — when fine-tuning on one task degrades performance on other tasks. A model fine-tuned to always cite test files might start over-citing irrelevant test files on questions that don't need them. Regression is the primary risk of fine-tuning and the reason you need a broad eval suite, not just a targeted one.

DPO (Direct Preference Optimization) — a training technique that learns from pairs of outputs: one preferred, one rejected. Instead of showing the model "produce this output," DPO shows the model "output A is better than output B." This is useful when you can rank outputs but can't easily write the perfect gold output. DPO requires a reliable way to judge which output is better — either human annotators or an LLM-as-judge with a strong rubric.

Catastrophic forgetting — when fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at other tasks it previously handled well. This is different from regression (which is a measurable performance drop on your benchmark) — catastrophic forgetting can affect general capabilities you weren't testing for. PEFT methods like LoRA reduce catastrophic forgetting because they freeze the original weights and only train small adapter layers, leaving most of the model's knowledge intact.

Overfitting — when the model memorizes the training examples instead of learning the underlying pattern. An overfitted model performs well on training data but poorly on new inputs. Signs: training loss drops to near-zero while validation loss stops improving or increases. Mitigation: use a validation split, stop training when validation loss plateaus (early stopping), and keep training data diverse.

The fine-tuning landscape

"Fine-tuning" is an umbrella term that covers several distinct techniques. If you search for it, you'll find a confusing mix of approaches. Here's how they relate to each other:

TechniqueWhat it doesData requirementWhat we teach
Full fine-tuningUpdates all model parametersLarge dataset, large GPUMentioned for context; too expensive for most curriculum work
SFT (Supervised Fine-Tuning)Trains on input-output pairsHundreds to thousands of labeled pairsYes — our primary technique (with LoRA/QLoRA)
Instruction tuningSFT specifically on instruction-response pairsInstruction-formatted datasetConceptually — SFT on instructions is how chat models are made
PEFT / LoRA / QLoRAUpdates a small fraction of parametersSame as SFT, less computeYes — our default training method
DPOLearns from preferred vs. rejected output pairsPreference pairsYes — covered as an advanced option
RLHFTrains a reward model, then optimizes against itHuman preference data + reward modelMentioned for context; requires more infrastructure than DPO
ORPO / KTO / SimPONewer preference methods with simpler trainingSimilar to DPONot covered — the landscape is evolving fast; DPO teaches the core concept

The techniques in this lesson — SFT with LoRA/QLoRA and DPO — cover the approaches that are practical for individual engineers and small teams. Full fine-tuning requires multi-GPU setups that are outside the scope of the curriculum's hardware assumptions. RLHF requires training a separate reward model, adding complexity that isn't justified until you're operating at scale. The newer preference methods (ORPO, KTO, SimPO) are worth watching but still stabilizing — DPO teaches the core concept of preference-based optimization, and the mechanics transfer.

What about "domain-specific fine-tuning" and "multi-task learning"? These are applications of the techniques above, not separate techniques. Domain-specific fine-tuning is SFT applied to domain data. Multi-task learning is SFT on multiple tasks simultaneously. The technique is the same — what changes is the data curation strategy.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Repeated format failuresModel puts data in wrong fields despite schema constraintsStricter output schema + few-shot examplesSFT on corrected format examples
Consistent evidence misuseModel ignores relevant evidence from specific file typesBetter context structure, evidence orderingSFT on examples with correct evidence usage
Weak instruction followingModel doesn't follow constraints even with explicit rulesMore explicit prompt, decompose instructionsSFT on instruction-correct pairs
Can grade but can't specify gold outputsYou know what's better but can't write the perfect answerHuman-curated examplesDPO with preferred/rejected pairs
Fine-tuned model regresses on other tasksNew failures appear on previously passing questionsBroader training dataInclude passing examples in training set, reduce learning rate

Walkthrough

Step 1: Identify a failure cluster

Use the failure diagnostic from the optimization ladder lesson to find model-attributed failures. Then group them by pattern:

# optimization/identify_clusters.py
"""Group model-attributed failures into clusters.

Reads a graded run log, filters to model-attributed failures,
and groups them by failure pattern for fine-tuning data curation.
"""

import json
from collections import defaultdict


def load_model_failures(run_log_path: str) -> list[dict]:
    """Load entries where the model failed despite good retrieval."""
    failures = []
    with open(run_log_path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            entry = json.loads(line)

            # Model failure: retrieval hit but answer is wrong
            if (
                entry.get("retrieval_hit") is True
                and entry.get("grade") not in ("correct", "acceptable")
            ):
                failures.append(entry)

    return failures


def cluster_by_label(failures: list[dict]) -> dict[str, list[dict]]:
    """Group failures by their failure label."""
    clusters = defaultdict(list)
    for f in failures:
        label = f.get("failure_label", "unlabeled")
        clusters[label].append(f)
    return dict(clusters)


def print_clusters(clusters: dict[str, list[dict]]) -> None:
    """Print cluster summary for fine-tuning candidate selection."""
    print(f"\nModel-Attributed Failure Clusters")
    print("=" * 50)

    for label, entries in sorted(
        clusters.items(), key=lambda x: -len(x[1])
    ):
        print(f"\n{label}: {len(entries)} failures")
        for entry in entries[:3]:
            q = entry.get("question", "")[:70]
            print(f"  - {q}...")

    total = sum(len(v) for v in clusters.values())
    print(f"\nTotal model-attributed failures: {total}")
    print(f"Clusters: {len(clusters)}")
    print(
        "\nFine-tuning candidates: clusters with 5+ failures "
        "that share a consistent pattern."
    )

Step 2: Curate training data

For each failure in the cluster, you need the original prompt and a corrected output. The prompt comes from your run logs. The corrected output requires human judgment:

# optimization/curate_training_data.py
"""Build fine-tuning training data from failure clusters.

For each failure, extracts the original prompt from the run log
and pairs it with a human-corrected output.
"""

import json
from pathlib import Path


def extract_training_prompt(entry: dict) -> str:
    """Reconstruct the prompt that was sent to the model.

    Uses the question, evidence, and system prompt from the run log.
    Adapt this to match your pipeline's actual prompt structure.
    """
    system = entry.get("system_prompt", "You are a code assistant.")
    evidence = entry.get("evidence_text", "")
    question = entry.get("question", "")

    # Reconstruct the prompt the model actually saw
    prompt = f"{system}\n\nEvidence:\n{evidence}\n\nQuestion: {question}"
    return prompt


def create_training_example(
    entry: dict,
    corrected_output: str,
) -> dict:
    """Create a single training example from a failure + correction."""
    return {
        "prompt": extract_training_prompt(entry),
        "completion": corrected_output,
        "question_id": entry.get("question_id", "unknown"),
        "original_answer": entry.get("answer", ""),
        "failure_label": entry.get("failure_label", ""),
    }


def save_for_review(
    failures: list[dict],
    output_path: str,
) -> None:
    """Save failures in a format for human correction.

    Creates a JSONL file where each entry has the prompt and
    original (incorrect) answer. A human reviewer adds the
    corrected_output field.
    """
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w") as f:
        for entry in failures:
            review_item = {
                "question_id": entry.get("question_id", "unknown"),
                "question": entry.get("question", ""),
                "evidence_files": entry.get("evidence_files", []),
                "original_answer": entry.get("answer", ""),
                "failure_label": entry.get("failure_label", ""),
                "corrected_output": "",  # Human fills this in
            }
            f.write(json.dumps(review_item) + "\n")

    print(f"Saved {len(failures)} items for human review at {output_path}")
    print("Edit corrected_output for each entry, then run build_dataset.py")


def build_dataset(
    reviewed_path: str,
    run_log_path: str,
    output_path: str,
) -> None:
    """Build the final training dataset from reviewed corrections.

    Merges human-corrected outputs with the original run log entries
    to create prompt-completion pairs for SFT.
    """
    # Load reviewed corrections
    corrections = {}
    with open(reviewed_path) as f:
        for line in f:
            item = json.loads(line.strip())
            if item.get("corrected_output"):
                corrections[item["question_id"]] = item["corrected_output"]

    # Load original run log for full prompt reconstruction
    entries = {}
    with open(run_log_path) as f:
        for line in f:
            entry = json.loads(line.strip())
            entries[entry.get("question_id", "")] = entry

    # Build training examples
    training_data = []
    for qid, corrected in corrections.items():
        if qid in entries:
            example = create_training_example(entries[qid], corrected)
            training_data.append(example)

    # Save
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w") as f:
        for ex in training_data:
            f.write(json.dumps({
                "prompt": ex["prompt"],
                "completion": ex["completion"],
            }) + "\n")

    print(f"Built {len(training_data)} training examples at {output_path}")

The workflow:

# 1. Export failures for human review
python optimization/curate_training_data.py export

# 2. Human reviews and adds corrected_output to each entry
# (edit the JSONL file in your editor)

# 3. Build the training dataset from reviewed corrections
python optimization/curate_training_data.py build

Quality over quantity. 100 carefully corrected examples will outperform 1,000 sloppy ones. For each failure, make sure the corrected output is genuinely correct — uses the right evidence, follows the output schema, and answers the question accurately. If you're not sure what the correct output should be, skip that example.

Step 3: Train with SFT

The training code is nearly identical to the distillation lesson. The difference is the training data source (human-corrected failures vs. teacher outputs) and potentially the training configuration (you may want a lower learning rate and fewer steps for fine-tuning, since you're making targeted corrections rather than teaching a broad behavior):

# optimization/fine_tune.py
"""Fine-tune a model on curated failure corrections.

Uses the same PEFT + TRL + Unsloth stack as distillation,
but with corrected training data from failure clusters.
"""

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset


def format_for_training(example: dict) -> dict:
    """Format a prompt-completion pair for SFT training."""
    text = (
        f"### Instruction:\n{example['prompt']}\n\n"
        f"### Response:\n{example['completion']}"
    )
    return {"text": text}


def fine_tune(
    model_name: str = "unsloth/Qwen2.5-7B-bnb-4bit",
    training_data_path: str = "optimization/training_data/failure_corrections.jsonl",
    output_dir: str = "optimization/models/fine-tuned",
    max_steps: int = 100,
    learning_rate: float = 1e-4,
    lora_rank: int = 32,
):
    """Run fine-tuning on corrected failure examples.

    Uses a lower learning rate and fewer steps than distillation
    because we're making targeted corrections, not teaching a
    broad task from scratch.
    """
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=4096,  # Longer context for full prompts
        load_in_4bit=True,
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r=lora_rank,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=lora_rank,
        lora_dropout=0,
        use_gradient_checkpointing="unsloth",
    )

    dataset = load_dataset(
        "json", data_files=training_data_path, split="train"
    )
    dataset = dataset.map(format_for_training)

    training_config = SFTConfig(
        output_dir=output_dir,
        max_steps=max_steps,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=learning_rate,
        logging_steps=10,
        save_steps=25,
        warmup_steps=10,
        fp16=True,
        dataset_text_field="text",
        max_seq_length=4096,
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_config,
    )

    print(f"Fine-tuning on {len(dataset)} corrected examples...")
    trainer.train()

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Saved fine-tuned adapter to {output_dir}")


if __name__ == "__main__":
    fine_tune()
python optimization/fine_tune.py

Step 4: Evaluate for improvement AND regression

This is the critical step that separates fine-tuning from wishful thinking. You need to check two things:

  1. Did the failure cluster improve? Run the fine-tuned model on the specific questions that triggered the failures.
  2. Did anything else break? Run the fine-tuned model on the entire benchmark, including questions the base model already passes.
# optimization/eval_fine_tuned.py
"""Evaluate fine-tuned model for both improvement and regression.

Compares the fine-tuned model against the base model on:
1. The targeted failure cluster (did it improve?)
2. The full benchmark (did anything regress?)
"""

import json

from unsloth import FastLanguageModel


def evaluate_both(
    base_model_name: str,
    adapter_path: str,
    benchmark_path: str,
    failure_ids: set[str],
) -> dict:
    """Run base and fine-tuned models on the full benchmark.

    Returns comparison metrics for the failure cluster and
    the rest of the benchmark separately.
    """
    # Load base model
    base_model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=base_model_name,
        max_seq_length=4096,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(base_model)

    # Load fine-tuned model (base + adapter)
    ft_model, _ = FastLanguageModel.from_pretrained(
        model_name=base_model_name,
        max_seq_length=4096,
        load_in_4bit=True,
    )
    ft_model.load_adapter(adapter_path)
    FastLanguageModel.for_inference(ft_model)

    # Load benchmark
    entries = []
    with open(benchmark_path) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line.strip()))

    results = {
        "cluster_base_correct": 0,
        "cluster_ft_correct": 0,
        "cluster_total": 0,
        "other_base_correct": 0,
        "other_ft_correct": 0,
        "other_total": 0,
        "regressions": [],  # Questions base got right but FT got wrong
    }

    for entry in entries:
        qid = entry.get("question_id", "")
        expected = entry.get("expected_answer", "")
        is_cluster = qid in failure_ids

        # Run both models (simplified — adapt to your pipeline)
        base_answer = run_model(base_model, tokenizer, entry)
        ft_answer = run_model(ft_model, tokenizer, entry)

        base_correct = grade(base_answer, expected)
        ft_correct = grade(ft_answer, expected)

        if is_cluster:
            results["cluster_total"] += 1
            results["cluster_base_correct"] += int(base_correct)
            results["cluster_ft_correct"] += int(ft_correct)
        else:
            results["other_total"] += 1
            results["other_base_correct"] += int(base_correct)
            results["other_ft_correct"] += int(ft_correct)

        # Track regressions: base passed, fine-tuned failed
        if base_correct and not ft_correct:
            results["regressions"].append(qid)

    return results


def run_model(model, tokenizer, entry: dict) -> str:
    """Generate an answer from a model. Adapt to your pipeline."""
    # Placeholder — replace with your actual generation logic
    prompt = entry.get("prompt", entry.get("question", ""))
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0)
    return tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True,
    )


def grade(answer: str, expected: str) -> bool:
    """Simple grading. Replace with your actual grading logic."""
    return expected.lower() in answer.lower()


def print_comparison(results: dict) -> None:
    """Print the improvement/regression comparison."""
    print(f"\n{'='*60}")
    print("Fine-Tuning Evaluation: Improvement vs Regression")
    print(f"{'='*60}")

    ct = results["cluster_total"]
    if ct > 0:
        base_pct = results["cluster_base_correct"] / ct
        ft_pct = results["cluster_ft_correct"] / ct
        print(f"\nTargeted failure cluster ({ct} questions):")
        print(f"  Base model:       {base_pct:.1%}")
        print(f"  Fine-tuned model: {ft_pct:.1%}")
        print(f"  Improvement:      {ft_pct - base_pct:+.1%}")

    ot = results["other_total"]
    if ot > 0:
        base_pct = results["other_base_correct"] / ot
        ft_pct = results["other_ft_correct"] / ot
        print(f"\nRest of benchmark ({ot} questions):")
        print(f"  Base model:       {base_pct:.1%}")
        print(f"  Fine-tuned model: {ft_pct:.1%}")
        print(f"  Regression:       {base_pct - ft_pct:+.1%}")

    reg = results["regressions"]
    if reg:
        print(f"\nRegressions ({len(reg)} questions):")
        for qid in reg[:10]:
            print(f"  - {qid}")
    else:
        print(f"\nNo regressions detected.")

Reading the results

The evaluation tells you one of four things:

Cluster improved?Regressions?Verdict
YesNoneShip it. The fine-tune solved the problem without side effects.
YesSomeInvestigate regressions. Consider including passing examples in training data to anchor existing behavior.
NoNoneTraining data quality issue. Review corrections and add more examples.
NoSomeStop. The fine-tune made things worse. Re-examine whether this is a model problem at all.

Preference optimization with DPO

SFT requires gold outputs — the exact text the model should produce. For some tasks, you can't write the perfect answer, but you can reliably judge which of two answers is better. DPO (Direct Preference Optimization) handles this case.

# optimization/dpo_example.py
"""Example DPO training data format and training setup.

DPO learns from preference pairs: for each prompt, a preferred
response and a rejected response.
"""

# DPO training data format
dpo_example = {
    "prompt": "Given the evidence above, explain what the validate_path function does.",
    "chosen": (
        "The validate_path function in utils/security.py checks whether "
        "a given file path is contained within the allowed project directory. "
        "It resolves the path using pathlib.Path.resolve() and then calls "
        ".relative_to() to confirm containment. If the path escapes the "
        "project root, it raises a ValueError."
    ),
    "rejected": (
        "The validate_path function validates file paths. It checks if "
        "the path is valid and returns True or False."
    ),
}

# The "chosen" response is grounded in specific evidence (file path,
# method names, behavior). The "rejected" response is vague and
# doesn't reference the evidence. DPO teaches the model to prefer
# the grounded style.

DPO training with TRL:

from trl import DPOTrainer, DPOConfig

# DPO configuration — similar to SFT but with preference-specific params
dpo_config = DPOConfig(
    output_dir="optimization/models/dpo-tuned",
    max_steps=100,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,  # Lower than SFT — DPO is more sensitive
    beta=0.1,  # Controls preference strength. Lower = stronger preference signal
    logging_steps=10,
    fp16=True,
    max_length=4096,
    max_prompt_length=2048,
)
When to use DPO over SFT

When you have a reliable grading signal (LLM-as-judge, human preferences) but writing exact gold outputs is impractical. For structured tasks like retrieval routing or format correction, SFT is simpler and more direct. For open-ended generation tasks like code explanations, DPO can capture nuanced quality distinctions that are hard to specify as exact outputs.

Hosted fine-tuning

Pick the provider tab that matches the path you've been using in the guide:

OpenAI is the most direct hosted fine-tuning path in this curriculum. The workflow is the same as the local SFT path above, except you upload curated failure-correction data and let OpenAI run the training job for you.

# Example: OpenAI fine-tuning API
from openai import OpenAI

client = OpenAI()

# Upload training data
with open("optimization/training_data/failure_corrections.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

# Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3},
)

print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")

The training data format for OpenAI's API uses messages instead of prompt/completion:

{"messages": [{"role": "system", "content": "You are a code assistant."}, {"role": "user", "content": "What does validate_path do?"}, {"role": "assistant", "content": "The corrected answer here..."}]}

Together AI and Google Cloud Vertex AI offer hosted fine-tuning options as well. See the Model Selection and Serving reference if you want a broader provider comparison after you've finished the core lesson.

The fine-tuning postmortem

After every fine-tuning run, write a brief postmortem answering these questions:

  1. What failure cluster did you target? Name it and quantify it (e.g., "wrong_format failures, 12 out of 46 benchmark questions").
  2. How many training examples did you use? And how many did you discard during curation?
  3. Did the cluster improve? By how much?
  4. Did anything regress? If yes, what pattern do the regressions share?
  5. Was fine-tuning the right rung? In hindsight, could a cheaper intervention have solved this? Sometimes running the experiment is what proves the problem was actually at a different layer.

In production engineering, postmortems turn incidents into institutional knowledge that prevents recurrence. Fine-tuning postmortems do the same thing, capturing what worked and what didn't so you can decide whether to keep the fine-tuned model or try a different approach. A fine-tune that improves the cluster by 30% with no regressions is a clear win. A fine-tune that improves the cluster by 5% with 3 regressions is a signal to try a different approach.

Exercises

  1. Identify a failure cluster. Run the cluster identification script on your most recent graded run log. Do you have a cluster with 5+ failures that share a consistent pattern?

  2. Curate training data. For your largest cluster, export the failures and write corrected outputs. How long does correction take per example? This time cost is part of the optimization tax.

  3. Train and evaluate. Run SFT on your corrected data. Did the cluster improve? Did anything regress?

  4. Write a postmortem. Answer the five postmortem questions. Was fine-tuning the right rung for this problem?

  5. Compare SFT vs. DPO (stretch). For the same failure cluster, create DPO preference pairs (use the original wrong answer as rejected, your correction as chosen). Train with DPO. Does it produce different results than SFT?

Completion checkpoint

You're done with this lesson when you can:

  • Explain when fine-tuning is justified and when it's premature
  • Curate training data from failure clusters in your run logs
  • Run SFT with LoRA/QLoRA on corrected examples
  • Evaluate for both improvement on the targeted cluster and regression on the broader benchmark
  • Articulate the difference between SFT and DPO and when each is appropriate
  • Write a fine-tuning postmortem that justifies keeping or discarding the result

What's next

Keep Decision Rules nearby as you keep building, and use Portfolio Milestones when you're ready to turn the work into something public.

You now have the full path behind you. Keep measuring before optimizing, and keep choosing the cheapest intervention that actually addresses the problem.

References

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.