The Optimization Ladder
You've built a system that retrieves code, generates grounded answers, tracks costs, runs evals, orchestrates specialists, and maintains memory across sessions. It works. But "works" and "works efficiently" aren't the same thing. Some answers take longer than they should. Some cost more than they need to. Some failure patterns repeat across runs even after prompt tweaks.
This lesson introduces a decision framework for improving your system's behavior without reaching for the most expensive tool first. I'll call this the optimization ladder. This ladder has five rungs, ordered by cost and reversibility. Most problems resolve on the first two rungs. Distillation and fine-tuning are powerful, but they're the last rungs, not the first. We'll walk through each level, establish the decision rules for when to advance, and build a diagnostic that tells you which rung to try next.
What you'll learn
- The five rungs of the optimization ladder and why ordering matters
- Decision rules for when to move from one rung to the next
- How to diagnose whether a failure is a prompt problem, a retrieval problem, a context problem, or a model problem
- The cost, reversibility, and data requirements at each level
- When distillation and fine-tuning are justified, and when they're premature
Concepts
The optimization ladder — five intervention levels for improving AI system behavior, ordered from cheapest and most reversible to most expensive and most permanent:
- Prompt engineering — rewrite prompts, add constraints, improve output schemas
- Retrieval improvement — better chunking, indexing, routing, or method selection
- Context engineering — restructure what goes into the context window and how
- Distillation — train a smaller model to reproduce a larger model's behavior on bounded tasks
- Fine-tuning — update model weights on task-specific data for persistent adaptation
Each rung has different cost, reversibility, and data requirements. The ladder exists because engineers routinely skip to fine-tuning when the real problem was retrieving the wrong evidence or stuffing too much context into the window.
Reversibility — how easily you can undo an optimization. Prompt changes are instantly reversible: swap the old prompt back. Retrieval changes require re-indexing but don't touch the model. Context engineering changes are structural but still code-level. Distillation produces a new model artifact that you can discard but can't partially undo. Fine-tuning modifies weights in ways that may interact unpredictably with other behaviors. The less reversible the intervention, the more confidence you need before applying it.
Failure attribution — diagnosing which system component is responsible for a bad outcome. A wrong answer could be caused by:
- Prompt failure: the model had the right evidence but was poorly instructed
- Retrieval failure: the right evidence wasn't in the context window
- Context failure: too much evidence diluted the signal, or evidence was poorly structured
- Model failure: the model lacks the capability for this task class even with perfect context
Your eval data from Module 6 already contains the signals you need. Retrieval evals tell you whether the right files appeared. Answer evals tell you whether the model used the evidence well. The gap between these two signals is your failure attribution.
Optimization tax — the ongoing cost of maintaining an optimization. Prompt changes have near-zero tax: they live in your code and deploy with the application. A fine-tuned model has high tax: you need to retrain when the base model updates, manage model artifacts, and monitor for drift. Every rung of the ladder adds maintenance cost. The optimization tax should be proportional to the value gained.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest rung to try | When to escalate |
|---|---|---|---|
| Output format inconsistency | Model occasionally ignores schema constraints | Prompt engineering: tighter output schema | Format failures persist across prompt variants |
| Wrong evidence retrieved | Answer is wrong because key files are missing | Retrieval improvement: better indexing or routing | Retrieval evals show ceiling with current methods |
| Right evidence, wrong answer | Files are in context but answer doesn't use them | Context engineering: restructure evidence presentation | Multiple context structures produce the same failure |
| Expensive correct answers | Answers are right but cost too much | Distillation: compress stable behavior to cheaper model | The task is stable and bounded with eval coverage |
| Persistent failure cluster | Same error pattern survives prompt, retrieval, and context fixes | Fine-tuning: bake correct behavior into weights | You have quality training data and the task is stable |
The five rungs
Rung 1: Prompt engineering
Cost: Near zero. You're editing text. Reversibility: Instant. Swap the prompt back. Data required: Your existing benchmark questions and a few failure examples.
This is where you've been working since Module 1. Prompt engineering covers:
- Rewriting instructions for clarity
- Adding or tightening output schemas
- Decomposing complex prompts into multi-step chains
- Adding few-shot examples
- Constraining the model's behavior with explicit rules
When it's enough: The model has the right information and just needs clearer instructions. Format issues, minor behavior drift, and instruction-following problems usually resolve here.
When to climb: The model follows instructions correctly but the instructions can't compensate for missing information, or the same failure repeats despite multiple prompt variants.
Rung 2: Retrieval improvement
Cost: Low to moderate. Re-indexing, new chunking strategies, or adding a retrieval method. Reversibility: High. You're changing the retrieval pipeline, not the model. Data required: Your benchmark questions with expected-file labels from Module 2.
Retrieval improvement covers everything from Module 4:
- Changing chunk size or overlap
- Adding a retrieval method (grep, AST index, graph)
- Improving the embedding model
- Adding a reranker
- Adjusting retrieval routing between substrates
When it's enough: The model produces good answers when given the right evidence, but retrieval misses key files or returns too much noise.
When to climb: Retrieval evals show a ceiling. You've tried multiple retrieval methods and routing strategies, and the right evidence still doesn't appear consistently. Or retrieval is good but the model still fails.
Rung 3: Context engineering
Cost: Moderate. Structural changes to your pipeline. Reversibility: High. These are code changes, not model changes. Data required: Trace data showing what's in the context window when failures happen.
Context engineering covers the work from Module 4's context compilation and Module 5's evidence bundles:
- Restructuring how evidence is presented in the prompt
- Compressing or summarizing evidence to reduce noise
- Ordering evidence by relevance
- Splitting complex questions into sub-questions with focused context
- Adjusting token budgets between evidence sections
When it's enough: Retrieval finds the right evidence but the model gets confused by how it's presented, like too much context, poor ordering, or competing signals from different evidence types.
When to climb: You've restructured context multiple ways and the failure pattern persists. The model consistently fails on a task class even with well-structured, relevant evidence.
Rung 4: Distillation
Cost: Significant. Requires teacher data collection, training infrastructure, and model management. Reversibility: Medium. You can discard the student model, but training time and compute are sunk costs. Data required: High-quality teacher outputs on a bounded task set.
Distillation trains a smaller, cheaper model to reproduce the behavior of a larger model on specific tasks. The next lesson covers the full workflow.
When it's justified:
- A task is stable and bounded (not changing weekly)
- The teacher model produces consistently good outputs
- You need to reduce cost or latency for that task
- You have eval coverage to verify the student matches the teacher
When it's premature:
- The teacher behavior is still being tuned
- You don't have evals to measure whether distillation preserved quality
- The cost savings don't justify the training and maintenance overhead
Rung 5: Fine-tuning
Cost: High. Training data curation, training infrastructure, ongoing model management. Reversibility: Low. Weight changes can have unpredictable effects on other behaviors. Data required: Curated task-specific examples, ideally from your run logs and eval results.
Fine-tuning modifies the model's weights to bake in task-specific behavior. The fine-tuning lesson covers the mechanics.
When it's justified:
- A repeated failure cluster survives prompt, retrieval, and context fixes
- The task is stable enough that retraining won't be needed frequently
- You have enough high-quality examples (hundreds to thousands)
- Your evals can verify improvement without regression
When it's premature:
- You haven't tried the cheaper rungs thoroughly
- Your eval suite is weak (you can't measure whether the fine-tune helped)
- The task is still changing shape
- Your training data is noisy or small
Walkthrough
Building a failure diagnostic
Your eval data from Module 6 already contains the signals you need to attribute failures to the right rung. Here's a diagnostic that reads your run logs and tells you where to focus:
# optimization/failure_diagnostic.py
"""Analyze run-log results to recommend which optimization rung to try.
Reads a graded run log and categorizes failures by likely cause:
prompt, retrieval, context, or model capability.
"""
import json
from pathlib import Path
def load_run_log(path: str) -> list[dict]:
"""Load a JSONL run log."""
entries = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
entries.append(json.loads(line))
return entries
def diagnose_failures(entries: list[dict]) -> dict:
"""Categorize failures by optimization rung.
Uses retrieval eval labels and answer eval labels to attribute
each failure to the most likely cause.
"""
categories = {
"prompt": [], # Rung 1: right evidence, format/instruction issue
"retrieval": [], # Rung 2: wrong or missing evidence
"context": [], # Rung 3: right evidence, wrong presentation
"model": [], # Rung 4-5: persistent failure with good context
"passing": [], # No failure
}
for entry in entries:
grade = entry.get("grade", "")
question_id = entry.get("question_id", "unknown")
if grade in ("correct", "acceptable"):
categories["passing"].append(question_id)
continue
retrieval_hit = entry.get("retrieval_hit", None)
failure_label = entry.get("failure_label", "")
# Retrieval missed the target files entirely
if retrieval_hit is False or failure_label == "missing_evidence":
categories["retrieval"].append(question_id)
# Evidence was present but answer had format/instruction issues
elif failure_label in ("wrong_format", "partial_answer"):
categories["prompt"].append(question_id)
# Evidence was present, model cited some but missed key parts
elif failure_label == "incomplete_evidence_use":
categories["context"].append(question_id)
# Evidence was present, instructions were clear, model still failed
elif retrieval_hit is True:
categories["model"].append(question_id)
# Can't attribute — default to prompt (cheapest to try)
else:
categories["prompt"].append(question_id)
return categories
def recommend(categories: dict) -> list[str]:
"""Produce ordered recommendations based on failure distribution."""
recommendations = []
total_failures = sum(
len(v) for k, v in categories.items() if k != "passing"
)
if total_failures == 0:
return ["No failures detected. System is performing well."]
for rung, label in [
("retrieval", "Rung 2 — Retrieval improvement"),
("prompt", "Rung 1 — Prompt engineering"),
("context", "Rung 3 — Context engineering"),
("model", "Rung 4/5 — Distillation or fine-tuning"),
]:
count = len(categories[rung])
if count > 0:
pct = count / total_failures * 100
recommendations.append(
f"{label}: {count} failures ({pct:.0f}%) — "
f"question IDs: {categories[rung][:5]}"
+ (" ..." if count > 5 else "")
)
return recommendations
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python failure_diagnostic.py <run_log.jsonl>")
sys.exit(1)
log_path = sys.argv[1]
entries = load_run_log(log_path)
categories = diagnose_failures(entries)
print(f"\n{'='*60}")
print("Failure Diagnostic — Optimization Ladder")
print(f"{'='*60}")
print(f"Total entries: {len(entries)}")
print(f"Passing: {len(categories['passing'])}")
print(f"Failures: {sum(len(v) for k, v in categories.items() if k != 'passing')}")
print(f"\nRecommendations:")
print("-" * 40)
for rec in recommend(categories):
print(f" {rec}")
print()Run it against a graded run log:
python optimization/failure_diagnostic.py runs/baseline_graded.jsonlExpected output (your numbers will vary):
============================================================
Failure Diagnostic — Optimization Ladder
============================================================
Total entries: 15
Passing: 9
Failures: 6
Recommendations
----------------------------------------
Rung 2 — Retrieval improvement: 3 failures (50%) — question IDs: ['q3', 'q7', 'q11']
Rung 1 — Prompt engineering: 2 failures (33%) — question IDs: ['q5', 'q14']
Rung 4/5 — Distillation or fine-tuning: 1 failures (17%) — question IDs: ['q9']
This tells you two things: fix retrieval first (3 failures), then prompt engineering (2 failures). The single model-attributed failure isn't worth distillation or fine-tuning just yet. Revisit it after the cheaper fixes.
Reading the diagnostic
The diagnostic uses a simple attribution hierarchy:
- Missing evidence → retrieval problem. If the target files aren't in the context, no prompt or model change will help.
- Evidence present but format/instruction failure → prompt problem. The model had what it needed but wasn't instructed well enough.
- Evidence present but poorly used → context problem. The evidence was there but the model couldn't navigate it, likely due to too much noise or poor structure.
- Evidence present, instructions clear, still wrong → model problem. This is the only category where distillation or fine-tuning is a reasonable next step.
Most systems in early development have failures concentrated in the first two categories. That's normal and good news because those are the cheapest to fix.
Decision rules in practice
Here's how the diagnostic maps to action:
| Diagnostic result | Action | Lesson reference |
|---|---|---|
| >30% retrieval failures | Improve indexing, add retrieval methods, or tune routing | Module 4: Code Retrieval |
| >30% prompt failures | Rewrite prompts, add schemas, or decompose into steps | Module 1: Prompt Engineering |
| >20% context failures | Restructure evidence presentation or adjust token budgets | Module 4: Context Compilation |
| >20% model failures after fixing above | Consider distillation for cost, fine-tuning for capability | Lesson 8.2 and Lesson 8.3 |
Treat those thresholds as starting points instead of hard rules. The principle here is to fix the cheapest category first, re-run the benchmark, and see if the model-attributed failures shrink. Often they do, because what looked like a model problem was actually retrieval noise that made the task harder than it needed to be.
The cost-reversibility tradeoff
| Rung | One-time cost | Ongoing tax | Reversibility | Data needed |
|---|---|---|---|---|
| Prompt engineering | Minutes to hours | Near zero | Instant | Benchmark + failure examples |
| Retrieval improvement | Hours to days | Re-index on corpus changes | High (code changes) | Benchmark with expected-file labels |
| Context engineering | Hours to days | Moderate (pipeline changes) | High (code changes) | Trace data from failing runs |
| Distillation | Days to weeks | Retrain when teacher changes | Medium (discard student) | Hundreds of teacher outputs |
| Fine-tuning | Days to weeks | Retrain on base model updates | Low (weight changes are opaque) | Hundreds to thousands of curated examples |
The table illustrates the ordering well. Each step down costs more, takes longer, and is harder to undo. This highlights that distillation and fine-tuning should be justified by evidence that cheaper interventions have been tried and measured.
Exercises
-
Run the failure diagnostic on your most recent graded run log. What does the distribution look like? Does it match your intuition about where the system is weakest?
-
Simulate a rung climb. Pick the category with the most failures. Apply a fix at that rung (better prompt, better retrieval, etc.). Re-run the benchmark and the diagnostic. Did the distribution shift?
-
Cost estimation. For your top failure category, estimate the cost (in time and compute) of fixing it at the recommended rung versus skipping ahead to fine-tuning. When is the skip justified?
-
Optimization tax audit. List every optimization currently in your system (prompt caching, model routing, token budgets, etc.). For each one, note the maintenance cost. Are any optimizations costing more to maintain than they save?
Completion checkpoint
You're done with this lesson when you can:
- Name the five rungs in order and explain why the ordering matters
- Run the failure diagnostic on a graded run log and interpret the results
- Attribute a failure to the correct rung using retrieval and answer eval signals
- Explain why fine-tuning is the last rung, not the first
- Articulate the optimization tax for each rung
What's next
Distillation. Most failures should still be fixed earlier in the ladder, but when a bounded workflow already works and is too expensive, the next lesson shows how to compress it.
References
- Decision Rules — the quick-lookup version of the optimization ladder
Start here - Cost and Reliability Patterns — operational patterns that reduce cost before you reach for distillation
Build with this - Eval Taxonomy — the four eval families that power failure attribution
Build with this - Debugging Heuristics — symptom-to-fix mapping when things go wrong
Deep dive