The Optimization Ladder

You've built a system that retrieves code, generates grounded answers, tracks costs, runs evals, orchestrates specialists, and maintains memory across sessions. It works. But "works" and "works efficiently" aren't the same thing. Some answers take longer than they should. Some cost more than they need to. Some failure patterns repeat across runs even after prompt tweaks.

This lesson introduces a decision framework for improving your system's behavior without reaching for the most expensive tool first. I'll call this the optimization ladder. This ladder has five rungs, ordered by cost and reversibility. Most problems resolve on the first two rungs. Distillation and fine-tuning are powerful, but they're the last rungs, not the first. We'll walk through each level, establish the decision rules for when to advance, and build a diagnostic that tells you which rung to try next.

What you'll learn

The five rungs of the optimization ladder and why ordering matters
Decision rules for when to move from one rung to the next
How to diagnose whether a failure is a prompt problem, a retrieval problem, a context problem, or a model problem
The cost, reversibility, and data requirements at each level
When distillation and fine-tuning are justified, and when they're premature

Concepts

The optimization ladder — five intervention levels for improving AI system behavior, ordered from cheapest and most reversible to most expensive and most permanent:

Prompt engineering — rewrite prompts, add constraints, improve output schemas
Retrieval improvement — better chunking, indexing, routing, or method selection
Context engineering — restructure what goes into the context window and how
Distillation — train a smaller model to reproduce a larger model's behavior on bounded tasks
Fine-tuning — update model weights on task-specific data for persistent adaptation

Each rung has different cost, reversibility, and data requirements. The ladder exists because engineers routinely skip to fine-tuning when the real problem was retrieving the wrong evidence or stuffing too much context into the window.

Reversibility — how easily you can undo an optimization. Prompt changes are instantly reversible: swap the old prompt back. Retrieval changes require re-indexing but don't touch the model. Context engineering changes are structural but still code-level. Distillation produces a new model artifact that you can discard but can't partially undo. Fine-tuning modifies weights in ways that may interact unpredictably with other behaviors. The less reversible the intervention, the more confidence you need before applying it.

Failure attribution — diagnosing which system component is responsible for a bad outcome. A wrong answer could be caused by:

Prompt failure: the model had the right evidence but was poorly instructed
Retrieval failure: the right evidence wasn't in the context window
Context failure: too much evidence diluted the signal, or evidence was poorly structured
Model failure: the model lacks the capability for this task class even with perfect context

Your eval data from Module 6 already contains the signals you need. Retrieval evals tell you whether the right files appeared. Answer evals tell you whether the model used the evidence well. The gap between these two signals is your failure attribution.

Optimization tax — the ongoing cost of maintaining an optimization. Prompt changes have near-zero tax: they live in your code and deploy with the application. A fine-tuned model has high tax: you need to retrain when the base model updates, manage model artifacts, and monitor for drift. Every rung of the ladder adds maintenance cost. The optimization tax should be proportional to the value gained.

Problem-to-Tool Map

Problem class	Symptom	Cheapest rung to try	When to escalate
Output format inconsistency	Model occasionally ignores schema constraints	Prompt engineering: tighter output schema	Format failures persist across prompt variants
Wrong evidence retrieved	Answer is wrong because key files are missing	Retrieval improvement: better indexing or routing	Retrieval evals show ceiling with current methods
Right evidence, wrong answer	Files are in context but answer doesn't use them	Context engineering: restructure evidence presentation	Multiple context structures produce the same failure
Expensive correct answers	Answers are right but cost too much	Distillation: compress stable behavior to cheaper model	The task is stable and bounded with eval coverage
Persistent failure cluster	Same error pattern survives prompt, retrieval, and context fixes	Fine-tuning: bake correct behavior into weights	You have quality training data and the task is stable

The five rungs

Rung 1: Prompt engineering

Cost: Near zero. You're editing text. Reversibility: Instant. Swap the prompt back. Data required: Your existing benchmark questions and a few failure examples.

This is where you've been working since Module 1. Prompt engineering covers:

Rewriting instructions for clarity
Adding or tightening output schemas
Decomposing complex prompts into multi-step chains
Adding few-shot examples
Constraining the model's behavior with explicit rules

When it's enough: The model has the right information and just needs clearer instructions. Format issues, minor behavior drift, and instruction-following problems usually resolve here.

When to climb: The model follows instructions correctly but the instructions can't compensate for missing information, or the same failure repeats despite multiple prompt variants.

Rung 2: Retrieval improvement

Cost: Low to moderate. Re-indexing, new chunking strategies, or adding a retrieval method. Reversibility: High. You're changing the retrieval pipeline, not the model. Data required: Your benchmark questions with expected-file labels from Module 2.

Retrieval improvement covers everything from Module 4:

Changing chunk size or overlap
Adding a retrieval method (grep, AST index, graph)
Improving the embedding model
Adding a reranker
Adjusting retrieval routing between methods

When it's enough: The model produces good answers when given the right evidence, but retrieval misses key files or returns too much noise.

When to climb: Retrieval evals show a ceiling. You've tried multiple retrieval methods and routing strategies, and the right evidence still doesn't appear consistently. Or retrieval is good but the model still fails.

Rung 3: Context engineering

Cost: Moderate. Structural changes to your pipeline. Reversibility: High. These are code changes, not model changes. Data required: Trace data showing what's in the context window when failures happen.

Context engineering covers the work from Module 4's context compilation and Module 5's evidence bundles:

Restructuring how evidence is presented in the prompt
Compressing or summarizing evidence to reduce noise
Ordering evidence by relevance
Splitting complex questions into sub-questions with focused context
Adjusting token budgets between evidence sections

When it's enough: Retrieval finds the right evidence but the model gets confused by how it's presented, like too much context, poor ordering, or competing signals from different evidence types.

When to climb: You've restructured context multiple ways and the failure pattern persists. The model consistently fails on a task class even with well-structured, relevant evidence.

Rung 4: Distillation

Cost: Significant. Requires teacher data collection, training infrastructure, and model management. Reversibility: Medium. You can discard the student model, but training time and compute are sunk costs. Data required: High-quality teacher outputs on a bounded task set.

Distillation trains a smaller, cheaper model to reproduce the behavior of a larger model on specific tasks. The next lesson covers the full workflow.

When it's justified:

A task is stable and bounded (not changing weekly)
The teacher model produces consistently good outputs
You need to reduce cost or latency for that task
You have eval coverage to verify the student matches the teacher

When it's premature:

The teacher behavior is still being tuned
You don't have evals to measure whether distillation preserved quality
The cost savings don't justify the training and maintenance overhead

Rung 5: Fine-tuning

Cost: High. Training data curation, training infrastructure, ongoing model management. Reversibility: Low. Weight changes can have unpredictable effects on other behaviors. Data required: Curated task-specific examples, ideally from your run logs and eval results.

Fine-tuning modifies the model's weights to bake in task-specific behavior. The fine-tuning lesson covers the mechanics.

When it's justified:

A repeated failure cluster survives prompt, retrieval, and context fixes
The task is stable enough that retraining won't be needed frequently
You have enough high-quality examples (hundreds to thousands)
Your evals can verify improvement without regression

When it's premature:

You haven't tried the cheaper rungs thoroughly
Your eval suite is weak (you can't measure whether the fine-tune helped)
The task is still changing shape
Your training data is noisy or small

Walkthrough

Building a failure diagnostic

Your eval data from Module 6 already contains the signals you need to attribute failures to the right rung. Here's a diagnostic that reads your run logs and tells you where to focus:

# optimization/failure_diagnostic.py
"""Analyze run-log results to recommend which optimization rung to try.

Reads a graded run log and categorizes failures by likely cause:
prompt, retrieval, context, or model capability.
"""

import json
from pathlib import Path


def load_run_log(path: str) -> list[dict]:
    """Load a JSONL run log."""
    entries = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                entries.append(json.loads(line))
    return entries


def diagnose_failures(entries: list[dict]) -> dict:
    """Categorize failures by optimization rung.

    Uses retrieval eval labels and answer eval labels to attribute
    each failure to the most likely cause.
    """
    categories = {
        "prompt": [],       # Rung 1: right evidence, format/instruction issue
        "retrieval": [],    # Rung 2: wrong or missing evidence
        "context": [],      # Rung 3: right evidence, wrong presentation
        "model": [],        # Rung 4-5: persistent failure with good context
        "passing": [],      # No failure
    }

    for entry in entries:
        grade = entry.get("grade", "")
        question_id = entry.get("question_id", "unknown")

        if grade in ("correct", "acceptable"):
            categories["passing"].append(question_id)
            continue

        retrieval_hit = entry.get("retrieval_hit", None)
        failure_label = entry.get("failure_label", "")

        # Retrieval missed the target files entirely
        if retrieval_hit is False or failure_label == "missing_evidence":
            categories["retrieval"].append(question_id)

        # Evidence was present but answer had format/instruction issues
        elif failure_label in ("wrong_format", "partial_answer"):
            categories["prompt"].append(question_id)

        # Evidence was present, model cited some but missed key parts
        elif failure_label == "incomplete_evidence_use":
            categories["context"].append(question_id)

        # Evidence was present, instructions were clear, model still failed
        elif retrieval_hit is True:
            categories["model"].append(question_id)

        # Can't attribute — default to prompt (cheapest to try)
        else:
            categories["prompt"].append(question_id)

    return categories


def recommend(categories: dict) -> list[str]:
    """Produce ordered recommendations based on failure distribution."""
    recommendations = []
    total_failures = sum(
        len(v) for k, v in categories.items() if k != "passing"
    )

    if total_failures == 0:
        return ["No failures detected. System is performing well."]

    for rung, label in [
        ("retrieval", "Rung 2 — Retrieval improvement"),
        ("prompt", "Rung 1 — Prompt engineering"),
        ("context", "Rung 3 — Context engineering"),
        ("model", "Rung 4/5 — Distillation or fine-tuning"),
    ]:
        count = len(categories[rung])
        if count > 0:
            pct = count / total_failures * 100
            recommendations.append(
                f"{label}: {count} failures ({pct:.0f}%) — "
                f"question IDs: {categories[rung][:5]}"
                + (" ..." if count > 5 else "")
            )

    return recommendations


if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        print("Usage: python failure_diagnostic.py <run_log.jsonl>")
        sys.exit(1)

    log_path = sys.argv[1]
    entries = load_run_log(log_path)
    categories = diagnose_failures(entries)

    print(f"\n{'='*60}")
    print("Failure Diagnostic — Optimization Ladder")
    print(f"{'='*60}")
    print(f"Total entries: {len(entries)}")
    print(f"Passing: {len(categories['passing'])}")
    print(f"Failures: {sum(len(v) for k, v in categories.items() if k != 'passing')}")
    print(f"\nRecommendations:")
    print("-" * 40)
    for rec in recommend(categories):
        print(f"  {rec}")
    print()

Run it against a graded run log:

python optimization/failure_diagnostic.py runs/baseline_graded.jsonl

Expected output (your numbers will vary):

============================================================
Failure Diagnostic — Optimization Ladder
============================================================
Total entries: 15
Passing: 9
Failures: 6

Recommendations
----------------------------------------
  Rung 2 — Retrieval improvement: 3 failures (50%) — question IDs: ['q3', 'q7', 'q11']
  Rung 1 — Prompt engineering: 2 failures (33%) — question IDs: ['q5', 'q14']
  Rung 4/5 — Distillation or fine-tuning: 1 failures (17%) — question IDs: ['q9']

This tells you two things: fix retrieval first (3 failures), then prompt engineering (2 failures). The single model-attributed failure isn't worth distillation or fine-tuning just yet. Revisit it after the cheaper fixes.

Reading the diagnostic

The diagnostic uses a simple attribution hierarchy:

Missing evidence → retrieval problem. If the target files aren't in the context, no prompt or model change will help.
Evidence present but format/instruction failure → prompt problem. The model had what it needed but wasn't instructed well enough.
Evidence present but poorly used → context problem. The evidence was there but the model couldn't navigate it, likely due to too much noise or poor structure.
Evidence present, instructions clear, still wrong → model problem. This is the only category where distillation or fine-tuning is a reasonable next step.

Most systems in early development have failures concentrated in the first two categories. That's normal and good news because those are the cheapest to fix.

Decision rules in practice

Here's how the diagnostic maps to action:

Diagnostic result	Action	Lesson reference
>30% retrieval failures	Improve indexing, add retrieval methods, or tune routing	Module 4: Code Retrieval
>30% prompt failures	Rewrite prompts, add schemas, or decompose into steps	Module 1: Prompt Engineering
>20% context failures	Restructure evidence presentation or adjust token budgets	Module 4: Context Compilation
>20% model failures after fixing above	Consider distillation for cost, fine-tuning for capability	Lesson 8.2 and Lesson 8.3

Treat those thresholds as starting points instead of hard rules. The principle here is to fix the cheapest category first, re-run the benchmark, and see if the model-attributed failures shrink. Often they do, because what looked like a model problem was actually retrieval noise that made the task harder than it needed to be.

The cost-reversibility tradeoff

Rung	One-time cost	Ongoing tax	Reversibility	Data needed
Prompt engineering	Minutes to hours	Near zero	Instant	Benchmark + failure examples
Retrieval improvement	Hours to days	Re-index on corpus changes	High (code changes)	Benchmark with expected-file labels
Context engineering	Hours to days	Moderate (pipeline changes)	High (code changes)	Trace data from failing runs
Distillation	Days to weeks	Retrain when teacher changes	Medium (discard student)	Hundreds of teacher outputs
Fine-tuning	Days to weeks	Retrain on base model updates	Low (weight changes are opaque)	Hundreds to thousands of curated examples

The table illustrates the ordering well. Each step down costs more, takes longer, and is harder to undo. This highlights that distillation and fine-tuning should be justified by evidence that cheaper interventions have been tried and measured.

Exercises

Run the failure diagnostic on your most recent graded run log. What does the distribution look like? Does it match your intuition about where the system is weakest?
Simulate a rung climb. Pick the category with the most failures. Apply a fix at that rung (better prompt, better retrieval, etc.). Re-run the benchmark and the diagnostic. Did the distribution shift?
Cost estimation. For your top failure category, estimate the cost (in time and compute) of fixing it at the recommended rung versus skipping ahead to fine-tuning. When is the skip justified?
Optimization tax audit. List every optimization currently in your system (prompt caching, model routing, token budgets, etc.). For each one, note the maintenance cost. Are any optimizations costing more to maintain than they save?

Completion checkpoint

You're done with this lesson when you can:

Name the five rungs in order and explain why the ordering matters
Run the failure diagnostic on a graded run log and interpret the results
Attribute a failure to the correct rung using retrieval and answer eval signals
Explain why fine-tuning is the last rung, not the first
Articulate the optimization tax for each rung

What's next

Distillation. Most failures should still be fixed earlier in the ladder, but when a bounded workflow already works and is too expensive, the next lesson shows how to compress it.

References

Decision Rules — the quick-lookup version of the optimization ladder Start here
Cost and Reliability Patterns — operational patterns that reduce cost before you reach for distillation Build with this
Eval Taxonomy — the four eval families that power failure attribution Build with this
Debugging Heuristics — symptom-to-fix mapping when things go wrong Deep dive