Fine-Tuning: Persistent Model Adaptation

Fine-tuning is the last rung on the optimization ladder. You reach for it when a repeated failure cluster survives prompt engineering, retrieval improvement, context restructuring, and (if applicable) distillation. The model has the right evidence, clear instructions, well-structured context, and it still gets the answer wrong in the same way. That pattern needs to be baked into the weights.

This is not where most learners should spend most of their time. The curriculum has deliberately placed fine-tuning last because the overwhelming majority of AI engineering problems resolve at earlier rungs. But when the failure diagnostic from the previous lessons consistently points to model-attributed failures, fine-tuning is the tool that addresses them. We'll cover supervised fine-tuning (SFT) with LoRA/QLoRA, data preparation from your run logs, evaluation against the base model, and preference optimization (DPO) as an advanced technique.

What you'll learn

When fine-tuning is the right choice and when it's premature
Prepare training data from your run logs and eval results
Run SFT with LoRA/QLoRA on a local model using PEFT and TRL
Evaluate the fine-tuned model against the base model on your benchmark
Understand preference optimization (DPO) and when it's appropriate
Use hosted fine-tuning APIs when local training isn't practical

Concepts

Fine-tuning vs. distillation — Distillation compresses a working behavior into a cheaper model. Fine-tuning fixes a broken behavior by training on corrected examples. The data sources are different: distillation uses teacher-generated outputs, fine-tuning uses human-curated or grader-verified correct outputs. The goals are different: distillation preserves quality at lower cost, fine-tuning improves quality on specific failure patterns.

SFT (Supervised Fine-Tuning) — the same technique from the distillation lesson, but applied to a different problem. In distillation, SFT inputs are teacher prompts and teacher outputs. In fine-tuning, SFT inputs are the prompts that caused failures and the corrected outputs you want the model to produce instead. The training mechanics are identical — the difference is where the training data comes from.

Failure cluster — a group of benchmark questions or production queries that fail in the same way. Examples from our anchor project:

The model consistently puts the file path in the wrong field of the structured output
The model ignores evidence from test files when answering "what does this function do?"
The model produces vague summaries when the evidence contains multiple conflicting code patterns

A failure cluster is a fine-tuning candidate when: (1) the failures repeat across runs, (2) the pattern is consistent enough to label, and (3) you can produce correct examples for each failure case.

Training data curation — the process of turning failure clusters into training examples. This is the hardest part of fine-tuning. You need:

Inputs: the exact prompts that triggered the failure (from your run logs)
Correct outputs: what the model should have produced (written by you or verified by a grader)
Negative examples filtered out: prompts where the base model already succeeds (training on these wastes capacity and can cause regression)

Regression — when fine-tuning on one task degrades performance on other tasks. A model fine-tuned to always cite test files might start over-citing irrelevant test files on questions that don't need them. Regression is the primary risk of fine-tuning and the reason you need a broad eval suite, not just a targeted one.

DPO (Direct Preference Optimization) — a training technique that learns from pairs of outputs: one preferred, one rejected. Instead of showing the model "produce this output," DPO shows the model "output A is better than output B." This is useful when you can rank outputs but can't easily write the perfect gold output. DPO requires a reliable way to judge which output is better — either human annotators or an LLM-as-judge with a strong rubric.

Catastrophic forgetting — when fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at other tasks it previously handled well. This is different from regression (which is a measurable performance drop on your benchmark) — catastrophic forgetting can affect general capabilities you weren't testing for. PEFT methods like LoRA reduce catastrophic forgetting because they freeze the original weights and only train small adapter layers, leaving most of the model's knowledge intact.

Overfitting — when the model memorizes the training examples instead of learning the underlying pattern. An overfitted model performs well on training data but poorly on new inputs. Signs: training loss drops to near-zero while validation loss stops improving or increases. Mitigation: use a validation split, stop training when validation loss plateaus (early stopping), and keep training data diverse.

The fine-tuning landscape

"Fine-tuning" is an umbrella term that covers several distinct techniques. If you search for it, you'll find a confusing mix of approaches. Here's how they relate to each other:

Technique	What it does	Data requirement	What we teach
Full fine-tuning	Updates all model parameters	Large dataset, large GPU	Mentioned for context; too expensive for most curriculum work
SFT (Supervised Fine-Tuning)	Trains on input-output pairs	Hundreds to thousands of labeled pairs	Yes — our primary technique (with LoRA/QLoRA)
Instruction tuning	SFT specifically on instruction-response pairs	Instruction-formatted dataset	Conceptually — SFT on instructions is how chat models are made
PEFT / LoRA / QLoRA	Updates a small fraction of parameters	Same as SFT, less compute	Yes — our default training method
DPO	Learns from preferred vs. rejected output pairs	Preference pairs	Yes — covered as an advanced option
RLHF	Trains a reward model, then optimizes against it	Human preference data + reward model	Mentioned for context; requires more infrastructure than DPO
ORPO / KTO / SimPO	Newer preference methods with simpler training	Similar to DPO	Not covered — the landscape is evolving fast; DPO teaches the core concept

The techniques in this lesson — SFT with LoRA/QLoRA and DPO — cover the approaches that are practical for individual engineers and small teams. Full fine-tuning requires multi-GPU setups that are outside the scope of the curriculum's hardware assumptions. RLHF requires training a separate reward model, adding complexity that isn't justified until you're operating at scale. The newer preference methods (ORPO, KTO, SimPO) are worth watching but still stabilizing — DPO teaches the core concept of preference-based optimization, and the mechanics transfer.

What about "domain-specific fine-tuning" and "multi-task learning"? These are applications of the techniques above, not separate techniques. Domain-specific fine-tuning is SFT applied to domain data. Multi-task learning is SFT on multiple tasks simultaneously. The technique is the same — what changes is the data curation strategy.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Repeated format failures	Model puts data in wrong fields despite schema constraints	Stricter output schema + few-shot examples	SFT on corrected format examples
Consistent evidence misuse	Model ignores relevant evidence from specific file types	Better context structure, evidence ordering	SFT on examples with correct evidence usage
Weak instruction following	Model doesn't follow constraints even with explicit rules	More explicit prompt, decompose instructions	SFT on instruction-correct pairs
Can grade but can't specify gold outputs	You know what's better but can't write the perfect answer	Human-curated examples	DPO with preferred/rejected pairs
Fine-tuned model regresses on other tasks	New failures appear on previously passing questions	Broader training data	Include passing examples in training set, reduce learning rate

Walkthrough

Step 1: Identify a failure cluster

Use the failure diagnostic from the optimization ladder lesson to find model-attributed failures. Then group them by pattern:

# optimization/identify_clusters.py
"""Group model-attributed failures into clusters.

Reads a graded run log, filters to model-attributed failures,
and groups them by failure pattern for fine-tuning data curation.
"""

import json
from collections import defaultdict


def load_model_failures(run_log_path: str) -> list[dict]:
    """Load entries where the model failed despite good retrieval."""
    failures = []
    with open(run_log_path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            entry = json.loads(line)

            # Model failure: retrieval hit but answer is wrong
            if (
                entry.get("retrieval_hit") is True
                and entry.get("grade") not in ("correct", "acceptable")
            ):
                failures.append(entry)

    return failures


def cluster_by_label(failures: list[dict]) -> dict[str, list[dict]]:
    """Group failures by their failure label."""
    clusters = defaultdict(list)
    for f in failures:
        label = f.get("failure_label", "unlabeled")
        clusters[label].append(f)
    return dict(clusters)


def print_clusters(clusters: dict[str, list[dict]]) -> None:
    """Print cluster summary for fine-tuning candidate selection."""
    print(f"\nModel-Attributed Failure Clusters")
    print("=" * 50)

    for label, entries in sorted(
        clusters.items(), key=lambda x: -len(x[1])
    ):
        print(f"\n{label}: {len(entries)} failures")
        for entry in entries[:3]:
            q = entry.get("question", "")[:70]
            print(f"  - {q}...")

    total = sum(len(v) for v in clusters.values())
    print(f"\nTotal model-attributed failures: {total}")
    print(f"Clusters: {len(clusters)}")
    print(
        "\nFine-tuning candidates: clusters with 5+ failures "
        "that share a consistent pattern."
    )

Step 2: Curate training data

For each failure in the cluster, you need the original prompt and a corrected output. The prompt comes from your run logs. The corrected output requires human judgment:

# optimization/curate_training_data.py
"""Build fine-tuning training data from failure clusters.

For each failure, extracts the original prompt from the run log
and pairs it with a human-corrected output.
"""

import json
from pathlib import Path


def extract_training_prompt(entry: dict) -> str:
    """Reconstruct the prompt that was sent to the model.

    Uses the question, evidence, and system prompt from the run log.
    Adapt this to match your pipeline's actual prompt structure.
    """
    system = entry.get("system_prompt", "You are a code assistant.")
    evidence = entry.get("evidence_text", "")
    question = entry.get("question", "")

    # Reconstruct the prompt the model actually saw
    prompt = f"{system}\n\nEvidence:\n{evidence}\n\nQuestion: {question}"
    return prompt


def create_training_example(
    entry: dict,
    corrected_output: str,
) -> dict:
    """Create a single training example from a failure + correction."""
    return {
        "prompt": extract_training_prompt(entry),
        "completion": corrected_output,
        "question_id": entry.get("question_id", "unknown"),
        "original_answer": entry.get("answer", ""),
        "failure_label": entry.get("failure_label", ""),
    }


def save_for_review(
    failures: list[dict],
    output_path: str,
) -> None:
    """Save failures in a format for human correction.

    Creates a JSONL file where each entry has the prompt and
    original (incorrect) answer. A human reviewer adds the
    corrected_output field.
    """
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w") as f:
        for entry in failures:
            review_item = {
                "question_id": entry.get("question_id", "unknown"),
                "question": entry.get("question", ""),
                "evidence_files": entry.get("evidence_files", []),
                "original_answer": entry.get("answer", ""),
                "failure_label": entry.get("failure_label", ""),
                "corrected_output": "",  # Human fills this in
            }
            f.write(json.dumps(review_item) + "\n")

    print(f"Saved {len(failures)} items for human review at {output_path}")
    print("Edit corrected_output for each entry, then run build_dataset.py")


def build_dataset(
    reviewed_path: str,
    run_log_path: str,
    output_path: str,
) -> None:
    """Build the final training dataset from reviewed corrections.

    Merges human-corrected outputs with the original run log entries
    to create prompt-completion pairs for SFT.
    """
    # Load reviewed corrections
    corrections = {}
    with open(reviewed_path) as f:
        for line in f:
            item = json.loads(line.strip())
            if item.get("corrected_output"):
                corrections[item["question_id"]] = item["corrected_output"]

    # Load original run log for full prompt reconstruction
    entries = {}
    with open(run_log_path) as f:
        for line in f:
            entry = json.loads(line.strip())
            entries[entry.get("question_id", "")] = entry

    # Build training examples
    training_data = []
    for qid, corrected in corrections.items():
        if qid in entries:
            example = create_training_example(entries[qid], corrected)
            training_data.append(example)

    # Save
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w") as f:
        for ex in training_data:
            f.write(json.dumps({
                "prompt": ex["prompt"],
                "completion": ex["completion"],
            }) + "\n")

    print(f"Built {len(training_data)} training examples at {output_path}")

The workflow:

# 1. Export failures for human review
python optimization/curate_training_data.py export

# 2. Human reviews and adds corrected_output to each entry
# (edit the JSONL file in your editor)

# 3. Build the training dataset from reviewed corrections
python optimization/curate_training_data.py build

Quality over quantity. 100 carefully corrected examples will outperform 1,000 sloppy ones. For each failure, make sure the corrected output is genuinely correct — uses the right evidence, follows the output schema, and answers the question accurately. If you're not sure what the correct output should be, skip that example.

Step 3: Train with SFT

The training code is nearly identical to the distillation lesson. The difference is the training data source (human-corrected failures vs. teacher outputs) and potentially the training configuration (you may want a lower learning rate and fewer steps for fine-tuning, since you're making targeted corrections rather than teaching a broad behavior):

# optimization/fine_tune.py
"""Fine-tune a model on curated failure corrections.

Uses the same PEFT + TRL + Unsloth stack as distillation,
but with corrected training data from failure clusters.
"""

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset


def format_for_training(example: dict) -> dict:
    """Format a prompt-completion pair for SFT training."""
    text = (
        f"### Instruction:\n{example['prompt']}\n\n"
        f"### Response:\n{example['completion']}"
    )
    return {"text": text}


def fine_tune(
    model_name: str = "unsloth/Qwen2.5-7B-bnb-4bit",
    training_data_path: str = "optimization/training_data/failure_corrections.jsonl",
    output_dir: str = "optimization/models/fine-tuned",
    max_steps: int = 100,
    learning_rate: float = 1e-4,
    lora_rank: int = 32,
):
    """Run fine-tuning on corrected failure examples.

    Uses a lower learning rate and fewer steps than distillation
    because we're making targeted corrections, not teaching a
    broad task from scratch.
    """
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=4096,  # Longer context for full prompts
        load_in_4bit=True,
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r=lora_rank,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=lora_rank,
        lora_dropout=0,
        use_gradient_checkpointing="unsloth",
    )

    dataset = load_dataset(
        "json", data_files=training_data_path, split="train"
    )
    dataset = dataset.map(format_for_training)

    training_config = SFTConfig(
        output_dir=output_dir,
        max_steps=max_steps,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=learning_rate,
        logging_steps=10,
        save_steps=25,
        warmup_steps=10,
        fp16=True,
        dataset_text_field="text",
        max_seq_length=4096,
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_config,
    )

    print(f"Fine-tuning on {len(dataset)} corrected examples...")
    trainer.train()

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Saved fine-tuned adapter to {output_dir}")


if __name__ == "__main__":
    fine_tune()

python optimization/fine_tune.py

Step 4: Evaluate for improvement AND regression

This is the critical step that separates fine-tuning from wishful thinking. You need to check two things:

Did the failure cluster improve? Run the fine-tuned model on the specific questions that triggered the failures.
Did anything else break? Run the fine-tuned model on the entire benchmark, including questions the base model already passes.

# optimization/eval_fine_tuned.py
"""Evaluate fine-tuned model for both improvement and regression.

Compares the fine-tuned model against the base model on:
1. The targeted failure cluster (did it improve?)
2. The full benchmark (did anything regress?)
"""

import json

from unsloth import FastLanguageModel


def evaluate_both(
    base_model_name: str,
    adapter_path: str,
    benchmark_path: str,
    failure_ids: set[str],
) -> dict:
    """Run base and fine-tuned models on the full benchmark.

    Returns comparison metrics for the failure cluster and
    the rest of the benchmark separately.
    """
    # Load base model
    base_model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=base_model_name,
        max_seq_length=4096,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(base_model)

    # Load fine-tuned model (base + adapter)
    ft_model, _ = FastLanguageModel.from_pretrained(
        model_name=base_model_name,
        max_seq_length=4096,
        load_in_4bit=True,
    )
    ft_model.load_adapter(adapter_path)
    FastLanguageModel.for_inference(ft_model)

    # Load benchmark
    entries = []
    with open(benchmark_path) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line.strip()))

    results = {
        "cluster_base_correct": 0,
        "cluster_ft_correct": 0,
        "cluster_total": 0,
        "other_base_correct": 0,
        "other_ft_correct": 0,
        "other_total": 0,
        "regressions": [],  # Questions base got right but FT got wrong
    }

    for entry in entries:
        qid = entry.get("question_id", "")
        expected = entry.get("expected_answer", "")
        is_cluster = qid in failure_ids

        # Run both models (simplified — adapt to your pipeline)
        base_answer = run_model(base_model, tokenizer, entry)
        ft_answer = run_model(ft_model, tokenizer, entry)

        base_correct = grade(base_answer, expected)
        ft_correct = grade(ft_answer, expected)

        if is_cluster:
            results["cluster_total"] += 1
            results["cluster_base_correct"] += int(base_correct)
            results["cluster_ft_correct"] += int(ft_correct)
        else:
            results["other_total"] += 1
            results["other_base_correct"] += int(base_correct)
            results["other_ft_correct"] += int(ft_correct)

        # Track regressions: base passed, fine-tuned failed
        if base_correct and not ft_correct:
            results["regressions"].append(qid)

    return results


def run_model(model, tokenizer, entry: dict) -> str:
    """Generate an answer from a model. Adapt to your pipeline."""
    # Placeholder — replace with your actual generation logic
    prompt = entry.get("prompt", entry.get("question", ""))
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0)
    return tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True,
    )


def grade(answer: str, expected: str) -> bool:
    """Simple grading. Replace with your actual grading logic."""
    return expected.lower() in answer.lower()


def print_comparison(results: dict) -> None:
    """Print the improvement/regression comparison."""
    print(f"\n{'='*60}")
    print("Fine-Tuning Evaluation: Improvement vs Regression")
    print(f"{'='*60}")

    ct = results["cluster_total"]
    if ct > 0:
        base_pct = results["cluster_base_correct"] / ct
        ft_pct = results["cluster_ft_correct"] / ct
        print(f"\nTargeted failure cluster ({ct} questions):")
        print(f"  Base model:       {base_pct:.1%}")
        print(f"  Fine-tuned model: {ft_pct:.1%}")
        print(f"  Improvement:      {ft_pct - base_pct:+.1%}")

    ot = results["other_total"]
    if ot > 0:
        base_pct = results["other_base_correct"] / ot
        ft_pct = results["other_ft_correct"] / ot
        print(f"\nRest of benchmark ({ot} questions):")
        print(f"  Base model:       {base_pct:.1%}")
        print(f"  Fine-tuned model: {ft_pct:.1%}")
        print(f"  Regression:       {base_pct - ft_pct:+.1%}")

    reg = results["regressions"]
    if reg:
        print(f"\nRegressions ({len(reg)} questions):")
        for qid in reg[:10]:
            print(f"  - {qid}")
    else:
        print(f"\nNo regressions detected.")

Reading the results

The evaluation tells you one of four things:

Cluster improved?	Regressions?	Verdict
Yes	None	Ship it. The fine-tune solved the problem without side effects.
Yes	Some	Investigate regressions. Consider including passing examples in training data to anchor existing behavior.
No	None	Training data quality issue. Review corrections and add more examples.
No	Some	Stop. The fine-tune made things worse. Re-examine whether this is a model problem at all.

Preference optimization with DPO

SFT requires gold outputs — the exact text the model should produce. For some tasks, you can't write the perfect answer, but you can reliably judge which of two answers is better. DPO (Direct Preference Optimization) handles this case.

# optimization/dpo_example.py
"""Example DPO training data format and training setup.

DPO learns from preference pairs: for each prompt, a preferred
response and a rejected response.
"""

# DPO training data format
dpo_example = {
    "prompt": "Given the evidence above, explain what the validate_path function does.",
    "chosen": (
        "The validate_path function in utils/security.py checks whether "
        "a given file path is contained within the allowed project directory. "
        "It resolves the path using pathlib.Path.resolve() and then calls "
        ".relative_to() to confirm containment. If the path escapes the "
        "project root, it raises a ValueError."
    ),
    "rejected": (
        "The validate_path function validates file paths. It checks if "
        "the path is valid and returns True or False."
    ),
}

# The "chosen" response is grounded in specific evidence (file path,
# method names, behavior). The "rejected" response is vague and
# doesn't reference the evidence. DPO teaches the model to prefer
# the grounded style.

DPO training with TRL:

from trl import DPOTrainer, DPOConfig

# DPO configuration — similar to SFT but with preference-specific params
dpo_config = DPOConfig(
    output_dir="optimization/models/dpo-tuned",
    max_steps=100,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,  # Lower than SFT — DPO is more sensitive
    beta=0.1,  # Controls preference strength. Lower = stronger preference signal
    logging_steps=10,
    fp16=True,
    max_length=4096,
    max_prompt_length=2048,
)

When to use DPO over SFT

When you have a reliable grading signal (LLM-as-judge, human preferences) but writing exact gold outputs is impractical. For structured tasks like retrieval routing or format correction, SFT is simpler and more direct. For open-ended generation tasks like code explanations, DPO can capture nuanced quality distinctions that are hard to specify as exact outputs.

Hosted fine-tuning

Pick the provider tab that matches the path you've been using in the guide:

OpenAI is the most direct hosted fine-tuning path in this curriculum. The workflow is the same as the local SFT path above, except you upload curated failure-correction data and let OpenAI run the training job for you.

# Example: OpenAI fine-tuning API
from openai import OpenAI

client = OpenAI()

# Upload training data
with open("optimization/training_data/failure_corrections.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

# Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3},
)

print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")

The training data format for OpenAI's API uses messages instead of prompt/completion:

{"messages": [{"role": "system", "content": "You are a code assistant."}, {"role": "user", "content": "What does validate_path do?"}, {"role": "assistant", "content": "The corrected answer here..."}]}

Together AI and Google Cloud Vertex AI offer hosted fine-tuning options as well. See the Model Selection and Serving reference if you want a broader provider comparison after you've finished the core lesson.

The fine-tuning postmortem

After every fine-tuning run, write a brief postmortem answering these questions:

What failure cluster did you target? Name it and quantify it (e.g., "wrong_format failures, 12 out of 46 benchmark questions").
How many training examples did you use? And how many did you discard during curation?
Did the cluster improve? By how much?
Did anything regress? If yes, what pattern do the regressions share?
Was fine-tuning the right rung? In hindsight, could a cheaper intervention have solved this? Sometimes running the experiment is what proves the problem was actually at a different layer.

In production engineering, postmortems turn incidents into institutional knowledge that prevents recurrence. Fine-tuning postmortems do the same thing, capturing what worked and what didn't so you can decide whether to keep the fine-tuned model or try a different approach. A fine-tune that improves the cluster by 30% with no regressions is a clear win. A fine-tune that improves the cluster by 5% with 3 regressions is a signal to try a different approach.

Exercises

Identify a failure cluster. Run the cluster identification script on your most recent graded run log. Do you have a cluster with 5+ failures that share a consistent pattern?
Curate training data. For your largest cluster, export the failures and write corrected outputs. How long does correction take per example? This time cost is part of the optimization tax.
Train and evaluate. Run SFT on your corrected data. Did the cluster improve? Did anything regress?
Write a postmortem. Answer the five postmortem questions. Was fine-tuning the right rung for this problem?
Compare SFT vs. DPO (stretch). For the same failure cluster, create DPO preference pairs (use the original wrong answer as rejected, your correction as chosen). Train with DPO. Does it produce different results than SFT?

Completion checkpoint

You're done with this lesson when you can:

Explain when fine-tuning is justified and when it's premature
Curate training data from failure clusters in your run logs
Run SFT with LoRA/QLoRA on corrected examples
Evaluate for both improvement on the targeted cluster and regression on the broader benchmark
Articulate the difference between SFT and DPO and when each is appropriate
Write a fine-tuning postmortem that justifies keeping or discarding the result

What's next

Keep Decision Rules nearby as you keep building, and use Portfolio Milestones when you're ready to turn the work into something public.

You now have the full path behind you. Keep measuring before optimizing, and keep choosing the cheapest intervention that actually addresses the problem.

References

PEFT documentation — parameter-efficient fine-tuning library Build with this
TRL documentation — SFT and DPO training Build with this
Unsloth documentation — optimized training kernels Build with this
Hardware and Model Size Guide — VRAM requirements for fine-tuning Start here
Model Selection and Serving — choosing models and hosting options Deep dive
Decision Rules — the full optimization decision framework Start here
OpenAI Fine-tuning Guide — hosted fine-tuning API Deep dive
Google Cloud: Fine-tuning LLMs — broad overview of fine-tuning types, tradeoffs, and best practices Deep dive