Fine-Tuning: Persistent Model Adaptation
Fine-tuning is the last rung on the optimization ladder. You reach for it when a repeated failure cluster survives prompt engineering, retrieval improvement, context restructuring, and (if applicable) distillation. The model has the right evidence, clear instructions, well-structured context, and it still gets the answer wrong in the same way. That pattern needs to be baked into the weights.
This is not where most learners should spend most of their time. The curriculum has deliberately placed fine-tuning last because the overwhelming majority of AI engineering problems resolve at earlier rungs. But when the failure diagnostic from the previous lessons consistently points to model-attributed failures, fine-tuning is the tool that addresses them. We'll cover supervised fine-tuning (SFT) with LoRA/QLoRA, data preparation from your run logs, evaluation against the base model, and preference optimization (DPO) as an advanced technique.
What you'll learn
- When fine-tuning is the right choice and when it's premature
- Prepare training data from your run logs and eval results
- Run SFT with LoRA/QLoRA on a local model using PEFT and TRL
- Evaluate the fine-tuned model against the base model on your benchmark
- Understand preference optimization (DPO) and when it's appropriate
- Use hosted fine-tuning APIs when local training isn't practical
Concepts
Fine-tuning vs. distillation — Distillation compresses a working behavior into a cheaper model. Fine-tuning fixes a broken behavior by training on corrected examples. The data sources are different: distillation uses teacher-generated outputs, fine-tuning uses human-curated or grader-verified correct outputs. The goals are different: distillation preserves quality at lower cost, fine-tuning improves quality on specific failure patterns.
SFT (Supervised Fine-Tuning) — the same technique from the distillation lesson, but applied to a different problem. In distillation, SFT inputs are teacher prompts and teacher outputs. In fine-tuning, SFT inputs are the prompts that caused failures and the corrected outputs you want the model to produce instead. The training mechanics are identical — the difference is where the training data comes from.
Failure cluster — a group of benchmark questions or production queries that fail in the same way. Examples from our anchor project:
- The model consistently puts the file path in the wrong field of the structured output
- The model ignores evidence from test files when answering "what does this function do?"
- The model produces vague summaries when the evidence contains multiple conflicting code patterns
A failure cluster is a fine-tuning candidate when: (1) the failures repeat across runs, (2) the pattern is consistent enough to label, and (3) you can produce correct examples for each failure case.
Training data curation — the process of turning failure clusters into training examples. This is the hardest part of fine-tuning. You need:
- Inputs: the exact prompts that triggered the failure (from your run logs)
- Correct outputs: what the model should have produced (written by you or verified by a grader)
- Negative examples filtered out: prompts where the base model already succeeds (training on these wastes capacity and can cause regression)
Regression — when fine-tuning on one task degrades performance on other tasks. A model fine-tuned to always cite test files might start over-citing irrelevant test files on questions that don't need them. Regression is the primary risk of fine-tuning and the reason you need a broad eval suite, not just a targeted one.
DPO (Direct Preference Optimization) — a training technique that learns from pairs of outputs: one preferred, one rejected. Instead of showing the model "produce this output," DPO shows the model "output A is better than output B." This is useful when you can rank outputs but can't easily write the perfect gold output. DPO requires a reliable way to judge which output is better — either human annotators or an LLM-as-judge with a strong rubric.
Catastrophic forgetting — when fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at other tasks it previously handled well. This is different from regression (which is a measurable performance drop on your benchmark) — catastrophic forgetting can affect general capabilities you weren't testing for. PEFT methods like LoRA reduce catastrophic forgetting because they freeze the original weights and only train small adapter layers, leaving most of the model's knowledge intact.
Overfitting — when the model memorizes the training examples instead of learning the underlying pattern. An overfitted model performs well on training data but poorly on new inputs. Signs: training loss drops to near-zero while validation loss stops improving or increases. Mitigation: use a validation split, stop training when validation loss plateaus (early stopping), and keep training data diverse.
The fine-tuning landscape
"Fine-tuning" is an umbrella term that covers several distinct techniques. If you search for it, you'll find a confusing mix of approaches. Here's how they relate to each other:
| Technique | What it does | Data requirement | What we teach |
|---|---|---|---|
| Full fine-tuning | Updates all model parameters | Large dataset, large GPU | Mentioned for context; too expensive for most curriculum work |
| SFT (Supervised Fine-Tuning) | Trains on input-output pairs | Hundreds to thousands of labeled pairs | Yes — our primary technique (with LoRA/QLoRA) |
| Instruction tuning | SFT specifically on instruction-response pairs | Instruction-formatted dataset | Conceptually — SFT on instructions is how chat models are made |
| PEFT / LoRA / QLoRA | Updates a small fraction of parameters | Same as SFT, less compute | Yes — our default training method |
| DPO | Learns from preferred vs. rejected output pairs | Preference pairs | Yes — covered as an advanced option |
| RLHF | Trains a reward model, then optimizes against it | Human preference data + reward model | Mentioned for context; requires more infrastructure than DPO |
| ORPO / KTO / SimPO | Newer preference methods with simpler training | Similar to DPO | Not covered — the landscape is evolving fast; DPO teaches the core concept |
The techniques in this lesson — SFT with LoRA/QLoRA and DPO — cover the approaches that are practical for individual engineers and small teams. Full fine-tuning requires multi-GPU setups that are outside the scope of the curriculum's hardware assumptions. RLHF requires training a separate reward model, adding complexity that isn't justified until you're operating at scale. The newer preference methods (ORPO, KTO, SimPO) are worth watching but still stabilizing — DPO teaches the core concept of preference-based optimization, and the mechanics transfer.
What about "domain-specific fine-tuning" and "multi-task learning"? These are applications of the techniques above, not separate techniques. Domain-specific fine-tuning is SFT applied to domain data. Multi-task learning is SFT on multiple tasks simultaneously. The technique is the same — what changes is the data curation strategy.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Repeated format failures | Model puts data in wrong fields despite schema constraints | Stricter output schema + few-shot examples | SFT on corrected format examples |
| Consistent evidence misuse | Model ignores relevant evidence from specific file types | Better context structure, evidence ordering | SFT on examples with correct evidence usage |
| Weak instruction following | Model doesn't follow constraints even with explicit rules | More explicit prompt, decompose instructions | SFT on instruction-correct pairs |
| Can grade but can't specify gold outputs | You know what's better but can't write the perfect answer | Human-curated examples | DPO with preferred/rejected pairs |
| Fine-tuned model regresses on other tasks | New failures appear on previously passing questions | Broader training data | Include passing examples in training set, reduce learning rate |
Walkthrough
Step 1: Identify a failure cluster
Use the failure diagnostic from the optimization ladder lesson to find model-attributed failures. Then group them by pattern:
# optimization/identify_clusters.py
"""Group model-attributed failures into clusters.
Reads a graded run log, filters to model-attributed failures,
and groups them by failure pattern for fine-tuning data curation.
"""
import json
from collections import defaultdict
def load_model_failures(run_log_path: str) -> list[dict]:
"""Load entries where the model failed despite good retrieval."""
failures = []
with open(run_log_path) as f:
for line in f:
line = line.strip()
if not line:
continue
entry = json.loads(line)
# Model failure: retrieval hit but answer is wrong
if (
entry.get("retrieval_hit") is True
and entry.get("grade") not in ("correct", "acceptable")
):
failures.append(entry)
return failures
def cluster_by_label(failures: list[dict]) -> dict[str, list[dict]]:
"""Group failures by their failure label."""
clusters = defaultdict(list)
for f in failures:
label = f.get("failure_label", "unlabeled")
clusters[label].append(f)
return dict(clusters)
def print_clusters(clusters: dict[str, list[dict]]) -> None:
"""Print cluster summary for fine-tuning candidate selection."""
print(f"\nModel-Attributed Failure Clusters")
print("=" * 50)
for label, entries in sorted(
clusters.items(), key=lambda x: -len(x[1])
):
print(f"\n{label}: {len(entries)} failures")
for entry in entries[:3]:
q = entry.get("question", "")[:70]
print(f" - {q}...")
total = sum(len(v) for v in clusters.values())
print(f"\nTotal model-attributed failures: {total}")
print(f"Clusters: {len(clusters)}")
print(
"\nFine-tuning candidates: clusters with 5+ failures "
"that share a consistent pattern."
)Step 2: Curate training data
For each failure in the cluster, you need the original prompt and a corrected output. The prompt comes from your run logs. The corrected output requires human judgment:
# optimization/curate_training_data.py
"""Build fine-tuning training data from failure clusters.
For each failure, extracts the original prompt from the run log
and pairs it with a human-corrected output.
"""
import json
from pathlib import Path
def extract_training_prompt(entry: dict) -> str:
"""Reconstruct the prompt that was sent to the model.
Uses the question, evidence, and system prompt from the run log.
Adapt this to match your pipeline's actual prompt structure.
"""
system = entry.get("system_prompt", "You are a code assistant.")
evidence = entry.get("evidence_text", "")
question = entry.get("question", "")
# Reconstruct the prompt the model actually saw
prompt = f"{system}\n\nEvidence:\n{evidence}\n\nQuestion: {question}"
return prompt
def create_training_example(
entry: dict,
corrected_output: str,
) -> dict:
"""Create a single training example from a failure + correction."""
return {
"prompt": extract_training_prompt(entry),
"completion": corrected_output,
"question_id": entry.get("question_id", "unknown"),
"original_answer": entry.get("answer", ""),
"failure_label": entry.get("failure_label", ""),
}
def save_for_review(
failures: list[dict],
output_path: str,
) -> None:
"""Save failures in a format for human correction.
Creates a JSONL file where each entry has the prompt and
original (incorrect) answer. A human reviewer adds the
corrected_output field.
"""
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
for entry in failures:
review_item = {
"question_id": entry.get("question_id", "unknown"),
"question": entry.get("question", ""),
"evidence_files": entry.get("evidence_files", []),
"original_answer": entry.get("answer", ""),
"failure_label": entry.get("failure_label", ""),
"corrected_output": "", # Human fills this in
}
f.write(json.dumps(review_item) + "\n")
print(f"Saved {len(failures)} items for human review at {output_path}")
print("Edit corrected_output for each entry, then run build_dataset.py")
def build_dataset(
reviewed_path: str,
run_log_path: str,
output_path: str,
) -> None:
"""Build the final training dataset from reviewed corrections.
Merges human-corrected outputs with the original run log entries
to create prompt-completion pairs for SFT.
"""
# Load reviewed corrections
corrections = {}
with open(reviewed_path) as f:
for line in f:
item = json.loads(line.strip())
if item.get("corrected_output"):
corrections[item["question_id"]] = item["corrected_output"]
# Load original run log for full prompt reconstruction
entries = {}
with open(run_log_path) as f:
for line in f:
entry = json.loads(line.strip())
entries[entry.get("question_id", "")] = entry
# Build training examples
training_data = []
for qid, corrected in corrections.items():
if qid in entries:
example = create_training_example(entries[qid], corrected)
training_data.append(example)
# Save
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
for ex in training_data:
f.write(json.dumps({
"prompt": ex["prompt"],
"completion": ex["completion"],
}) + "\n")
print(f"Built {len(training_data)} training examples at {output_path}")The workflow:
# 1. Export failures for human review
python optimization/curate_training_data.py export
# 2. Human reviews and adds corrected_output to each entry
# (edit the JSONL file in your editor)
# 3. Build the training dataset from reviewed corrections
python optimization/curate_training_data.py buildQuality over quantity. 100 carefully corrected examples will outperform 1,000 sloppy ones. For each failure, make sure the corrected output is genuinely correct — uses the right evidence, follows the output schema, and answers the question accurately. If you're not sure what the correct output should be, skip that example.
Step 3: Train with SFT
The training code is nearly identical to the distillation lesson. The difference is the training data source (human-corrected failures vs. teacher outputs) and potentially the training configuration (you may want a lower learning rate and fewer steps for fine-tuning, since you're making targeted corrections rather than teaching a broad behavior):
# optimization/fine_tune.py
"""Fine-tune a model on curated failure corrections.
Uses the same PEFT + TRL + Unsloth stack as distillation,
but with corrected training data from failure clusters.
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
def format_for_training(example: dict) -> dict:
"""Format a prompt-completion pair for SFT training."""
text = (
f"### Instruction:\n{example['prompt']}\n\n"
f"### Response:\n{example['completion']}"
)
return {"text": text}
def fine_tune(
model_name: str = "unsloth/Qwen2.5-7B-bnb-4bit",
training_data_path: str = "optimization/training_data/failure_corrections.jsonl",
output_dir: str = "optimization/models/fine-tuned",
max_steps: int = 100,
learning_rate: float = 1e-4,
lora_rank: int = 32,
):
"""Run fine-tuning on corrected failure examples.
Uses a lower learning rate and fewer steps than distillation
because we're making targeted corrections, not teaching a
broad task from scratch.
"""
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=4096, # Longer context for full prompts
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=lora_rank,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=lora_rank,
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)
dataset = load_dataset(
"json", data_files=training_data_path, split="train"
)
dataset = dataset.map(format_for_training)
training_config = SFTConfig(
output_dir=output_dir,
max_steps=max_steps,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=learning_rate,
logging_steps=10,
save_steps=25,
warmup_steps=10,
fp16=True,
dataset_text_field="text",
max_seq_length=4096,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_config,
)
print(f"Fine-tuning on {len(dataset)} corrected examples...")
trainer.train()
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Saved fine-tuned adapter to {output_dir}")
if __name__ == "__main__":
fine_tune()python optimization/fine_tune.pyStep 4: Evaluate for improvement AND regression
This is the critical step that separates fine-tuning from wishful thinking. You need to check two things:
- Did the failure cluster improve? Run the fine-tuned model on the specific questions that triggered the failures.
- Did anything else break? Run the fine-tuned model on the entire benchmark, including questions the base model already passes.
# optimization/eval_fine_tuned.py
"""Evaluate fine-tuned model for both improvement and regression.
Compares the fine-tuned model against the base model on:
1. The targeted failure cluster (did it improve?)
2. The full benchmark (did anything regress?)
"""
import json
from unsloth import FastLanguageModel
def evaluate_both(
base_model_name: str,
adapter_path: str,
benchmark_path: str,
failure_ids: set[str],
) -> dict:
"""Run base and fine-tuned models on the full benchmark.
Returns comparison metrics for the failure cluster and
the rest of the benchmark separately.
"""
# Load base model
base_model, tokenizer = FastLanguageModel.from_pretrained(
model_name=base_model_name,
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(base_model)
# Load fine-tuned model (base + adapter)
ft_model, _ = FastLanguageModel.from_pretrained(
model_name=base_model_name,
max_seq_length=4096,
load_in_4bit=True,
)
ft_model.load_adapter(adapter_path)
FastLanguageModel.for_inference(ft_model)
# Load benchmark
entries = []
with open(benchmark_path) as f:
for line in f:
if line.strip():
entries.append(json.loads(line.strip()))
results = {
"cluster_base_correct": 0,
"cluster_ft_correct": 0,
"cluster_total": 0,
"other_base_correct": 0,
"other_ft_correct": 0,
"other_total": 0,
"regressions": [], # Questions base got right but FT got wrong
}
for entry in entries:
qid = entry.get("question_id", "")
expected = entry.get("expected_answer", "")
is_cluster = qid in failure_ids
# Run both models (simplified — adapt to your pipeline)
base_answer = run_model(base_model, tokenizer, entry)
ft_answer = run_model(ft_model, tokenizer, entry)
base_correct = grade(base_answer, expected)
ft_correct = grade(ft_answer, expected)
if is_cluster:
results["cluster_total"] += 1
results["cluster_base_correct"] += int(base_correct)
results["cluster_ft_correct"] += int(ft_correct)
else:
results["other_total"] += 1
results["other_base_correct"] += int(base_correct)
results["other_ft_correct"] += int(ft_correct)
# Track regressions: base passed, fine-tuned failed
if base_correct and not ft_correct:
results["regressions"].append(qid)
return results
def run_model(model, tokenizer, entry: dict) -> str:
"""Generate an answer from a model. Adapt to your pipeline."""
# Placeholder — replace with your actual generation logic
prompt = entry.get("prompt", entry.get("question", ""))
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0)
return tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
def grade(answer: str, expected: str) -> bool:
"""Simple grading. Replace with your actual grading logic."""
return expected.lower() in answer.lower()
def print_comparison(results: dict) -> None:
"""Print the improvement/regression comparison."""
print(f"\n{'='*60}")
print("Fine-Tuning Evaluation: Improvement vs Regression")
print(f"{'='*60}")
ct = results["cluster_total"]
if ct > 0:
base_pct = results["cluster_base_correct"] / ct
ft_pct = results["cluster_ft_correct"] / ct
print(f"\nTargeted failure cluster ({ct} questions):")
print(f" Base model: {base_pct:.1%}")
print(f" Fine-tuned model: {ft_pct:.1%}")
print(f" Improvement: {ft_pct - base_pct:+.1%}")
ot = results["other_total"]
if ot > 0:
base_pct = results["other_base_correct"] / ot
ft_pct = results["other_ft_correct"] / ot
print(f"\nRest of benchmark ({ot} questions):")
print(f" Base model: {base_pct:.1%}")
print(f" Fine-tuned model: {ft_pct:.1%}")
print(f" Regression: {base_pct - ft_pct:+.1%}")
reg = results["regressions"]
if reg:
print(f"\nRegressions ({len(reg)} questions):")
for qid in reg[:10]:
print(f" - {qid}")
else:
print(f"\nNo regressions detected.")Reading the results
The evaluation tells you one of four things:
| Cluster improved? | Regressions? | Verdict |
|---|---|---|
| Yes | None | Ship it. The fine-tune solved the problem without side effects. |
| Yes | Some | Investigate regressions. Consider including passing examples in training data to anchor existing behavior. |
| No | None | Training data quality issue. Review corrections and add more examples. |
| No | Some | Stop. The fine-tune made things worse. Re-examine whether this is a model problem at all. |
Preference optimization with DPO
SFT requires gold outputs — the exact text the model should produce. For some tasks, you can't write the perfect answer, but you can reliably judge which of two answers is better. DPO (Direct Preference Optimization) handles this case.
# optimization/dpo_example.py
"""Example DPO training data format and training setup.
DPO learns from preference pairs: for each prompt, a preferred
response and a rejected response.
"""
# DPO training data format
dpo_example = {
"prompt": "Given the evidence above, explain what the validate_path function does.",
"chosen": (
"The validate_path function in utils/security.py checks whether "
"a given file path is contained within the allowed project directory. "
"It resolves the path using pathlib.Path.resolve() and then calls "
".relative_to() to confirm containment. If the path escapes the "
"project root, it raises a ValueError."
),
"rejected": (
"The validate_path function validates file paths. It checks if "
"the path is valid and returns True or False."
),
}
# The "chosen" response is grounded in specific evidence (file path,
# method names, behavior). The "rejected" response is vague and
# doesn't reference the evidence. DPO teaches the model to prefer
# the grounded style.DPO training with TRL:
from trl import DPOTrainer, DPOConfig
# DPO configuration — similar to SFT but with preference-specific params
dpo_config = DPOConfig(
output_dir="optimization/models/dpo-tuned",
max_steps=100,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-5, # Lower than SFT — DPO is more sensitive
beta=0.1, # Controls preference strength. Lower = stronger preference signal
logging_steps=10,
fp16=True,
max_length=4096,
max_prompt_length=2048,
)When you have a reliable grading signal (LLM-as-judge, human preferences) but writing exact gold outputs is impractical. For structured tasks like retrieval routing or format correction, SFT is simpler and more direct. For open-ended generation tasks like code explanations, DPO can capture nuanced quality distinctions that are hard to specify as exact outputs.
Hosted fine-tuning
Pick the provider tab that matches the path you've been using in the guide:
OpenAI is the most direct hosted fine-tuning path in this curriculum. The workflow is the same as the local SFT path above, except you upload curated failure-correction data and let OpenAI run the training job for you.
# Example: OpenAI fine-tuning API
from openai import OpenAI
client = OpenAI()
# Upload training data
with open("optimization/training_data/failure_corrections.jsonl", "rb") as f:
training_file = client.files.create(file=f, purpose="fine-tune")
# Start fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs": 3},
)
print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")The training data format for OpenAI's API uses messages instead of prompt/completion:
{"messages": [{"role": "system", "content": "You are a code assistant."}, {"role": "user", "content": "What does validate_path do?"}, {"role": "assistant", "content": "The corrected answer here..."}]}Together AI and Google Cloud Vertex AI offer hosted fine-tuning options as well. See the Model Selection and Serving reference if you want a broader provider comparison after you've finished the core lesson.
The fine-tuning postmortem
After every fine-tuning run, write a brief postmortem answering these questions:
- What failure cluster did you target? Name it and quantify it (e.g., "wrong_format failures, 12 out of 46 benchmark questions").
- How many training examples did you use? And how many did you discard during curation?
- Did the cluster improve? By how much?
- Did anything regress? If yes, what pattern do the regressions share?
- Was fine-tuning the right rung? In hindsight, could a cheaper intervention have solved this? Sometimes running the experiment is what proves the problem was actually at a different layer.
In production engineering, postmortems turn incidents into institutional knowledge that prevents recurrence. Fine-tuning postmortems do the same thing, capturing what worked and what didn't so you can decide whether to keep the fine-tuned model or try a different approach. A fine-tune that improves the cluster by 30% with no regressions is a clear win. A fine-tune that improves the cluster by 5% with 3 regressions is a signal to try a different approach.
Exercises
-
Identify a failure cluster. Run the cluster identification script on your most recent graded run log. Do you have a cluster with 5+ failures that share a consistent pattern?
-
Curate training data. For your largest cluster, export the failures and write corrected outputs. How long does correction take per example? This time cost is part of the optimization tax.
-
Train and evaluate. Run SFT on your corrected data. Did the cluster improve? Did anything regress?
-
Write a postmortem. Answer the five postmortem questions. Was fine-tuning the right rung for this problem?
-
Compare SFT vs. DPO (stretch). For the same failure cluster, create DPO preference pairs (use the original wrong answer as rejected, your correction as chosen). Train with DPO. Does it produce different results than SFT?
Completion checkpoint
You're done with this lesson when you can:
- Explain when fine-tuning is justified and when it's premature
- Curate training data from failure clusters in your run logs
- Run SFT with LoRA/QLoRA on corrected examples
- Evaluate for both improvement on the targeted cluster and regression on the broader benchmark
- Articulate the difference between SFT and DPO and when each is appropriate
- Write a fine-tuning postmortem that justifies keeping or discarding the result
What's next
Keep Decision Rules nearby as you keep building, and use Portfolio Milestones when you're ready to turn the work into something public.
You now have the full path behind you. Keep measuring before optimizing, and keep choosing the cheapest intervention that actually addresses the problem.
References
- PEFT documentation — parameter-efficient fine-tuning library
Build with this - TRL documentation — SFT and DPO training
Build with this - Unsloth documentation — optimized training kernels
Build with this - Hardware and Model Size Guide — VRAM requirements for fine-tuning
Start here - Model Selection and Serving — choosing models and hosting options
Deep dive - Decision Rules — the full optimization decision framework
Start here - OpenAI Fine-tuning Guide — hosted fine-tuning API
Deep dive - Google Cloud: Fine-tuning LLMs — broad overview of fine-tuning types, tradeoffs, and best practices
Deep dive