Module 8: Optimization Optimization Ladder

The Optimization Ladder

You've built a system that retrieves code, generates grounded answers, tracks costs, runs evals, orchestrates specialists, and maintains memory across sessions. It works. But "works" and "works efficiently" aren't the same thing. Some answers take longer than they should. Some cost more than they need to. Some failure patterns repeat across runs even after prompt tweaks.

This lesson introduces a decision framework for improving your system's behavior without reaching for the most expensive tool first. I'll call this the optimization ladder. This ladder has five rungs, ordered by cost and reversibility. Most problems resolve on the first two rungs. Distillation and fine-tuning are powerful, but they're the last rungs, not the first. We'll walk through each level, establish the decision rules for when to advance, and build a diagnostic that tells you which rung to try next.

What you'll learn

  • The five rungs of the optimization ladder and why ordering matters
  • Decision rules for when to move from one rung to the next
  • How to diagnose whether a failure is a prompt problem, a retrieval problem, a context problem, or a model problem
  • The cost, reversibility, and data requirements at each level
  • When distillation and fine-tuning are justified, and when they're premature

Concepts

The optimization ladder — five intervention levels for improving AI system behavior, ordered from cheapest and most reversible to most expensive and most permanent:

  1. Prompt engineering — rewrite prompts, add constraints, improve output schemas
  2. Retrieval improvement — better chunking, indexing, routing, or method selection
  3. Context engineering — restructure what goes into the context window and how
  4. Distillation — train a smaller model to reproduce a larger model's behavior on bounded tasks
  5. Fine-tuning — update model weights on task-specific data for persistent adaptation

Each rung has different cost, reversibility, and data requirements. The ladder exists because engineers routinely skip to fine-tuning when the real problem was retrieving the wrong evidence or stuffing too much context into the window.

Reversibility — how easily you can undo an optimization. Prompt changes are instantly reversible: swap the old prompt back. Retrieval changes require re-indexing but don't touch the model. Context engineering changes are structural but still code-level. Distillation produces a new model artifact that you can discard but can't partially undo. Fine-tuning modifies weights in ways that may interact unpredictably with other behaviors. The less reversible the intervention, the more confidence you need before applying it.

Failure attribution — diagnosing which system component is responsible for a bad outcome. A wrong answer could be caused by:

  • Prompt failure: the model had the right evidence but was poorly instructed
  • Retrieval failure: the right evidence wasn't in the context window
  • Context failure: too much evidence diluted the signal, or evidence was poorly structured
  • Model failure: the model lacks the capability for this task class even with perfect context

Your eval data from Module 6 already contains the signals you need. Retrieval evals tell you whether the right files appeared. Answer evals tell you whether the model used the evidence well. The gap between these two signals is your failure attribution.

Optimization tax — the ongoing cost of maintaining an optimization. Prompt changes have near-zero tax: they live in your code and deploy with the application. A fine-tuned model has high tax: you need to retrain when the base model updates, manage model artifacts, and monitor for drift. Every rung of the ladder adds maintenance cost. The optimization tax should be proportional to the value gained.

Problem-to-Tool Map

Problem classSymptomCheapest rung to tryWhen to escalate
Output format inconsistencyModel occasionally ignores schema constraintsPrompt engineering: tighter output schemaFormat failures persist across prompt variants
Wrong evidence retrievedAnswer is wrong because key files are missingRetrieval improvement: better indexing or routingRetrieval evals show ceiling with current methods
Right evidence, wrong answerFiles are in context but answer doesn't use themContext engineering: restructure evidence presentationMultiple context structures produce the same failure
Expensive correct answersAnswers are right but cost too muchDistillation: compress stable behavior to cheaper modelThe task is stable and bounded with eval coverage
Persistent failure clusterSame error pattern survives prompt, retrieval, and context fixesFine-tuning: bake correct behavior into weightsYou have quality training data and the task is stable

The five rungs

Rung 1: Prompt engineering

Cost: Near zero. You're editing text. Reversibility: Instant. Swap the prompt back. Data required: Your existing benchmark questions and a few failure examples.

This is where you've been working since Module 1. Prompt engineering covers:

  • Rewriting instructions for clarity
  • Adding or tightening output schemas
  • Decomposing complex prompts into multi-step chains
  • Adding few-shot examples
  • Constraining the model's behavior with explicit rules

When it's enough: The model has the right information and just needs clearer instructions. Format issues, minor behavior drift, and instruction-following problems usually resolve here.

When to climb: The model follows instructions correctly but the instructions can't compensate for missing information, or the same failure repeats despite multiple prompt variants.

Rung 2: Retrieval improvement

Cost: Low to moderate. Re-indexing, new chunking strategies, or adding a retrieval method. Reversibility: High. You're changing the retrieval pipeline, not the model. Data required: Your benchmark questions with expected-file labels from Module 2.

Retrieval improvement covers everything from Module 4:

  • Changing chunk size or overlap
  • Adding a retrieval method (grep, AST index, graph)
  • Improving the embedding model
  • Adding a reranker
  • Adjusting retrieval routing between substrates

When it's enough: The model produces good answers when given the right evidence, but retrieval misses key files or returns too much noise.

When to climb: Retrieval evals show a ceiling. You've tried multiple retrieval methods and routing strategies, and the right evidence still doesn't appear consistently. Or retrieval is good but the model still fails.

Rung 3: Context engineering

Cost: Moderate. Structural changes to your pipeline. Reversibility: High. These are code changes, not model changes. Data required: Trace data showing what's in the context window when failures happen.

Context engineering covers the work from Module 4's context compilation and Module 5's evidence bundles:

  • Restructuring how evidence is presented in the prompt
  • Compressing or summarizing evidence to reduce noise
  • Ordering evidence by relevance
  • Splitting complex questions into sub-questions with focused context
  • Adjusting token budgets between evidence sections

When it's enough: Retrieval finds the right evidence but the model gets confused by how it's presented, like too much context, poor ordering, or competing signals from different evidence types.

When to climb: You've restructured context multiple ways and the failure pattern persists. The model consistently fails on a task class even with well-structured, relevant evidence.

Rung 4: Distillation

Cost: Significant. Requires teacher data collection, training infrastructure, and model management. Reversibility: Medium. You can discard the student model, but training time and compute are sunk costs. Data required: High-quality teacher outputs on a bounded task set.

Distillation trains a smaller, cheaper model to reproduce the behavior of a larger model on specific tasks. The next lesson covers the full workflow.

When it's justified:

  • A task is stable and bounded (not changing weekly)
  • The teacher model produces consistently good outputs
  • You need to reduce cost or latency for that task
  • You have eval coverage to verify the student matches the teacher

When it's premature:

  • The teacher behavior is still being tuned
  • You don't have evals to measure whether distillation preserved quality
  • The cost savings don't justify the training and maintenance overhead

Rung 5: Fine-tuning

Cost: High. Training data curation, training infrastructure, ongoing model management. Reversibility: Low. Weight changes can have unpredictable effects on other behaviors. Data required: Curated task-specific examples, ideally from your run logs and eval results.

Fine-tuning modifies the model's weights to bake in task-specific behavior. The fine-tuning lesson covers the mechanics.

When it's justified:

  • A repeated failure cluster survives prompt, retrieval, and context fixes
  • The task is stable enough that retraining won't be needed frequently
  • You have enough high-quality examples (hundreds to thousands)
  • Your evals can verify improvement without regression

When it's premature:

  • You haven't tried the cheaper rungs thoroughly
  • Your eval suite is weak (you can't measure whether the fine-tune helped)
  • The task is still changing shape
  • Your training data is noisy or small

Walkthrough

Building a failure diagnostic

Your eval data from Module 6 already contains the signals you need to attribute failures to the right rung. Here's a diagnostic that reads your run logs and tells you where to focus:

# optimization/failure_diagnostic.py
"""Analyze run-log results to recommend which optimization rung to try.

Reads a graded run log and categorizes failures by likely cause:
prompt, retrieval, context, or model capability.
"""

import json
from pathlib import Path


def load_run_log(path: str) -> list[dict]:
    """Load a JSONL run log."""
    entries = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:
                entries.append(json.loads(line))
    return entries


def diagnose_failures(entries: list[dict]) -> dict:
    """Categorize failures by optimization rung.

    Uses retrieval eval labels and answer eval labels to attribute
    each failure to the most likely cause.
    """
    categories = {
        "prompt": [],       # Rung 1: right evidence, format/instruction issue
        "retrieval": [],    # Rung 2: wrong or missing evidence
        "context": [],      # Rung 3: right evidence, wrong presentation
        "model": [],        # Rung 4-5: persistent failure with good context
        "passing": [],      # No failure
    }

    for entry in entries:
        grade = entry.get("grade", "")
        question_id = entry.get("question_id", "unknown")

        if grade in ("correct", "acceptable"):
            categories["passing"].append(question_id)
            continue

        retrieval_hit = entry.get("retrieval_hit", None)
        failure_label = entry.get("failure_label", "")

        # Retrieval missed the target files entirely
        if retrieval_hit is False or failure_label == "missing_evidence":
            categories["retrieval"].append(question_id)

        # Evidence was present but answer had format/instruction issues
        elif failure_label in ("wrong_format", "partial_answer"):
            categories["prompt"].append(question_id)

        # Evidence was present, model cited some but missed key parts
        elif failure_label == "incomplete_evidence_use":
            categories["context"].append(question_id)

        # Evidence was present, instructions were clear, model still failed
        elif retrieval_hit is True:
            categories["model"].append(question_id)

        # Can't attribute — default to prompt (cheapest to try)
        else:
            categories["prompt"].append(question_id)

    return categories


def recommend(categories: dict) -> list[str]:
    """Produce ordered recommendations based on failure distribution."""
    recommendations = []
    total_failures = sum(
        len(v) for k, v in categories.items() if k != "passing"
    )

    if total_failures == 0:
        return ["No failures detected. System is performing well."]

    for rung, label in [
        ("retrieval", "Rung 2 — Retrieval improvement"),
        ("prompt", "Rung 1 — Prompt engineering"),
        ("context", "Rung 3 — Context engineering"),
        ("model", "Rung 4/5 — Distillation or fine-tuning"),
    ]:
        count = len(categories[rung])
        if count > 0:
            pct = count / total_failures * 100
            recommendations.append(
                f"{label}: {count} failures ({pct:.0f}%) — "
                f"question IDs: {categories[rung][:5]}"
                + (" ..." if count > 5 else "")
            )

    return recommendations


if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        print("Usage: python failure_diagnostic.py <run_log.jsonl>")
        sys.exit(1)

    log_path = sys.argv[1]
    entries = load_run_log(log_path)
    categories = diagnose_failures(entries)

    print(f"\n{'='*60}")
    print("Failure Diagnostic — Optimization Ladder")
    print(f"{'='*60}")
    print(f"Total entries: {len(entries)}")
    print(f"Passing: {len(categories['passing'])}")
    print(f"Failures: {sum(len(v) for k, v in categories.items() if k != 'passing')}")
    print(f"\nRecommendations:")
    print("-" * 40)
    for rec in recommend(categories):
        print(f"  {rec}")
    print()

Run it against a graded run log:

python optimization/failure_diagnostic.py runs/baseline_graded.jsonl

Expected output (your numbers will vary):

============================================================
Failure Diagnostic — Optimization Ladder
============================================================
Total entries: 15
Passing: 9
Failures: 6

Recommendations
----------------------------------------
  Rung 2 — Retrieval improvement: 3 failures (50%) — question IDs: ['q3', 'q7', 'q11']
  Rung 1 — Prompt engineering: 2 failures (33%) — question IDs: ['q5', 'q14']
  Rung 4/5 — Distillation or fine-tuning: 1 failures (17%) — question IDs: ['q9']

This tells you two things: fix retrieval first (3 failures), then prompt engineering (2 failures). The single model-attributed failure isn't worth distillation or fine-tuning just yet. Revisit it after the cheaper fixes.

Reading the diagnostic

The diagnostic uses a simple attribution hierarchy:

  1. Missing evidence → retrieval problem. If the target files aren't in the context, no prompt or model change will help.
  2. Evidence present but format/instruction failure → prompt problem. The model had what it needed but wasn't instructed well enough.
  3. Evidence present but poorly used → context problem. The evidence was there but the model couldn't navigate it, likely due to too much noise or poor structure.
  4. Evidence present, instructions clear, still wrong → model problem. This is the only category where distillation or fine-tuning is a reasonable next step.

Most systems in early development have failures concentrated in the first two categories. That's normal and good news because those are the cheapest to fix.

Decision rules in practice

Here's how the diagnostic maps to action:

Diagnostic resultActionLesson reference
>30% retrieval failuresImprove indexing, add retrieval methods, or tune routingModule 4: Code Retrieval
>30% prompt failuresRewrite prompts, add schemas, or decompose into stepsModule 1: Prompt Engineering
>20% context failuresRestructure evidence presentation or adjust token budgetsModule 4: Context Compilation
>20% model failures after fixing aboveConsider distillation for cost, fine-tuning for capabilityLesson 8.2 and Lesson 8.3

Treat those thresholds as starting points instead of hard rules. The principle here is to fix the cheapest category first, re-run the benchmark, and see if the model-attributed failures shrink. Often they do, because what looked like a model problem was actually retrieval noise that made the task harder than it needed to be.

The cost-reversibility tradeoff

RungOne-time costOngoing taxReversibilityData needed
Prompt engineeringMinutes to hoursNear zeroInstantBenchmark + failure examples
Retrieval improvementHours to daysRe-index on corpus changesHigh (code changes)Benchmark with expected-file labels
Context engineeringHours to daysModerate (pipeline changes)High (code changes)Trace data from failing runs
DistillationDays to weeksRetrain when teacher changesMedium (discard student)Hundreds of teacher outputs
Fine-tuningDays to weeksRetrain on base model updatesLow (weight changes are opaque)Hundreds to thousands of curated examples

The table illustrates the ordering well. Each step down costs more, takes longer, and is harder to undo. This highlights that distillation and fine-tuning should be justified by evidence that cheaper interventions have been tried and measured.

Exercises

  1. Run the failure diagnostic on your most recent graded run log. What does the distribution look like? Does it match your intuition about where the system is weakest?

  2. Simulate a rung climb. Pick the category with the most failures. Apply a fix at that rung (better prompt, better retrieval, etc.). Re-run the benchmark and the diagnostic. Did the distribution shift?

  3. Cost estimation. For your top failure category, estimate the cost (in time and compute) of fixing it at the recommended rung versus skipping ahead to fine-tuning. When is the skip justified?

  4. Optimization tax audit. List every optimization currently in your system (prompt caching, model routing, token budgets, etc.). For each one, note the maintenance cost. Are any optimizations costing more to maintain than they save?

Completion checkpoint

You're done with this lesson when you can:

  • Name the five rungs in order and explain why the ordering matters
  • Run the failure diagnostic on a graded run log and interpret the results
  • Attribute a failure to the correct rung using retrieval and answer eval signals
  • Explain why fine-tuning is the last rung, not the first
  • Articulate the optimization tax for each rung

What's next

Distillation. Most failures should still be fixed earlier in the ladder, but when a bounded workflow already works and is too expensive, the next lesson shows how to compress it.

References

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.