Module 6: Observability and Evals Cost, Caching and Rate Limits

Cost, Caching, and Operational Telemetry

The previous lesson made the pipeline visible. You can trace any request and see what happened. But traces alone don't answer the operational questions that matter in a real system: How much does each answer cost? Are we wasting tokens on evidence the model ignores? Could we serve 80% of requests with a cheaper model? When a user hits a rate limit, do we know about it before they complain?

This lesson adds the cost, caching, and operational layers that turn raw telemetry into actionable metrics. We'll build token budgeting, prompt caching awareness, cost-per-successful-task tracking, and rate-limit telemetry, all instrumented into the traced pipeline. By the end, you'll be able to answer "is this system affordable to operate?" with data instead of hope.

What you'll learn

  • Calculate and track cost per request using token counts and provider pricing
  • Understand prompt caching: when it works well, when it doesn't, and how to measure cache hit rates
  • Build model routing by task complexity to match cost to difficulty
  • Set up token budgets that prevent runaway spending
  • Add rate-limit telemetry that connects back to the rate limiting concepts from Module 1's security-basics
  • Track cost per successful task, not just cost per request

Concepts

Cost per request — the total token cost for one end-to-end pipeline execution. This includes input tokens (system prompt + evidence + question), output tokens (the model's response), and any intermediate calls (routing classification, grounding checks). Provider pricing varies by model, so cost tracking requires knowing both the token count and the per-token price for each model used.

Cost per successful task — a more useful metric than cost per request. Not every request succeeds. Some abstain, some produce wrong answers, some hit rate limits. Cost per successful task divides total spending by the number of requests that actually helped the user. If your system costs $0.01 per request but only 60% of requests succeed, the real cost is $0.017 per successful task. This metric catches the trap of optimizing for cheap requests that aren't useful.

Prompt caching — a provider-level optimization where repeated prompt prefixes are cached and charged at a reduced rate (or not charged at all). Anthropic, OpenAI, and other providers offer variants of this. Caching works well when your system prompt is stable and appears in many requests, where the first request pays full price and subsequent requests with the same prefix get a discount. It works poorly when prompts vary significantly between requests (different evidence bundles, personalized instructions) because each unique prefix is a cache miss.

Cache hit rate — the percentage of requests where the prompt prefix was served from cache rather than processed from scratch. A high cache hit rate (70%+) means your prompt structure is cache-friendly. A low rate means either your prompts vary too much, or your request volume is too low for the cache to stay warm. Tracking this metric tells you whether caching is actually saving money or just adding complexity.

Token budget — a hard limit on the number of tokens a single request can consume, including both input and output. Token budgets prevent runaway costs from pathological queries (questions that trigger large evidence bundles or verbose responses). They're a safety net, not an optimization. The goal is to catch outliers, not to constrain normal operation.

Model routing by task complexity — using a cheaper, faster model for simple tasks and reserving expensive models for complex ones. In our pipeline, skip-mode questions (general knowledge) can use a smaller model, while complex hybrid-retrieval questions might need a larger one. This is a cost optimization that trades a small amount of routing complexity for significant savings on easy questions.

Rate-limit telemetry — structured tracking of rate-limit events: when a request is throttled, which limit was hit (per-user, per-route, provider-side), and what happened to the request (queued, retried, rejected). In Module 1's security-basics lesson, we set up rate limiting as a protective measure. Here we'll instrument it so you can see rate limiting in action and tune thresholds based on real usage patterns.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Unknown costsYou don't know what your system costs per requestManual token estimatesPer-generation cost tracking in traces
Wasted tokensEvidence is retrieved but never cited in the answerReview traces manuallyToken-waste metric comparing evidence tokens to citation count
Expensive easy questionsSimple questions use the same expensive model as complex onesSingle model for everythingModel routing by task complexity
Cache missesPrompt caching is enabled but savings are minimalCheck prompt structureCache hit rate metric with prefix stability analysis
Silent rate limitingUsers hit limits but you don't know until they complainWait for complaintsRate-limit event telemetry with alerts
Runaway costsOne pathological query costs 10x the averageNo budget enforcementToken budgets with hard cutoffs

Walkthrough

Tracking cost per request

Provider pricing changes, so we'll build a simple pricing table that's easy to update. Our goal with this isn't precision to the penny but having any cost visibility:

# observability/cost_tracker.py
"""Token cost tracking for traced pipeline runs.

Pricing is approximate and will need periodic updates.
Check your provider's pricing page for current rates.
"""

# Prices in USD per 1M tokens — update when pricing changes
# Last updated: 2026-03-25
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00, "cached_input": 1.25},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60, "cached_input": 0.10},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40, "cached_input": 0.025},
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00, "cached_input": 0.30},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00, "cached_input": 0.08},
    "Qwen/Qwen2.5-7B-Instruct": {"input": 0.27, "output": 0.27, "cached_input": 0.27},
    "gpt-oss:20b": {"input": 0.70, "output": 0.70, "cached_input": 0.70},
}

# Fallback for unknown models. Local Ollama models do not have a hosted
# per-token price, so track them separately using latency, GPU time, or
# cloud instance cost instead of forcing a fake token rate.
DEFAULT_PRICING = {"input": 1.00, "output": 3.00, "cached_input": 0.50}


def estimate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cached_input_tokens: int = 0,
) -> dict:
    """Estimate the cost of a single LLM call.

    Returns a dict with input_cost, output_cost, cache_savings,
    and total_cost in USD.
    """
    pricing = MODEL_PRICING.get(model, DEFAULT_PRICING)

    # Non-cached input tokens
    non_cached_input = input_tokens - cached_input_tokens
    input_cost = (non_cached_input / 1_000_000) * pricing["input"]
    cached_cost = (cached_input_tokens / 1_000_000) * pricing["cached_input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]

    # What caching saved
    full_input_cost = (input_tokens / 1_000_000) * pricing["input"]
    cache_savings = full_input_cost - (input_cost + cached_cost)

    return {
        "input_cost": round(input_cost + cached_cost, 6),
        "output_cost": round(output_cost, 6),
        "cache_savings": round(cache_savings, 6),
        "total_cost": round(input_cost + cached_cost + output_cost, 6),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cached_input_tokens": cached_input_tokens,
    }


if __name__ == "__main__":
    # Example: a typical RAG request
    cost = estimate_cost(
        model="Qwen/Qwen2.5-7B-Instruct",
        input_tokens=3500,  # system prompt + evidence + question
        output_tokens=400,
        cached_input_tokens=800,  # system prompt was cached
    )

    print("Cost estimate for one RAG request:")
    for k, v in cost.items():
        if isinstance(v, float):
            print(f"  {k}: ${v:.6f}")
        else:
            print(f"  {k}: {v}")
python observability/cost_tracker.py

Expected output:

Cost estimate for one RAG request:
  input_cost: $0.000465
  output_cost: $0.000240
  cache_savings: $0.000060
  total_cost: $0.000705
  model: Qwen/Qwen2.5-7B-Instruct
  input_tokens: 3500
  output_tokens: 400
  cached_input_tokens: 800

Prompt caching: when it works and when it doesn't

Prompt caching reduces input token costs by caching the stable prefix of your prompt, the part that doesn't change between requests. Here's how to think about it:

Caching works well when:

  • Your system prompt is stable across requests (same instructions, same format)
  • Evidence bundles have common prefixes (e.g., the same repository README appears in many requests)
  • Request volume is high enough to keep the cache warm (provider caches expire after inactivity)
  • You're using a provider that supports caching (Anthropic's prompt caching, OpenAI's cached input pricing)

Caching works poorly when:

  • Every request has a unique prompt (different evidence, different instructions)
  • Request volume is low (cache expires between requests)
  • You're frequently updating system prompts during development
  • Evidence bundles vary significantly between question types

The practical implication for our pipeline: the system prompt and grounding instructions are stable, so they'll cache well. The evidence bundle changes every request, so it won't cache. This means caching saves money on the fixed portion of the prompt, but the variable portion (which is usually the largest part in RAG) won't benefit.

# observability/cache_metrics.py
"""Track prompt caching effectiveness across runs."""
import json
from pathlib import Path


def analyze_cache_rates(run_traces: list[dict]) -> dict:
    """Analyze cache hit rates from traced generation spans.

    Expects a list of trace dicts with 'cached_input_tokens'
    and 'input_tokens' fields from the generation spans.
    """
    total_input_tokens = 0
    total_cached_tokens = 0
    generations = 0

    for trace in run_traces:
        if "input_tokens" in trace and "cached_input_tokens" in trace:
            total_input_tokens += trace["input_tokens"]
            total_cached_tokens += trace["cached_input_tokens"]
            generations += 1

    if generations == 0:
        return {"cache_hit_rate": 0, "generations": 0}

    hit_rate = total_cached_tokens / total_input_tokens if total_input_tokens > 0 else 0

    return {
        "cache_hit_rate": round(hit_rate * 100, 1),
        "total_input_tokens": total_input_tokens,
        "total_cached_tokens": total_cached_tokens,
        "generations": generations,
        "estimated_savings_pct": round(hit_rate * 50, 1),  # rough: cached is ~50% cheaper
    }


if __name__ == "__main__":
    # Simulated data from a traced benchmark run
    sample_traces = [
        {"input_tokens": 3500, "cached_input_tokens": 800},
        {"input_tokens": 4200, "cached_input_tokens": 800},
        {"input_tokens": 2100, "cached_input_tokens": 800},
        {"input_tokens": 3800, "cached_input_tokens": 800},
        {"input_tokens": 500, "cached_input_tokens": 200},  # skip-mode, smaller prompt
    ]

    metrics = analyze_cache_rates(sample_traces)
    print("Cache analysis:")
    for k, v in metrics.items():
        print(f"  {k}: {v}")
python observability/cache_metrics.py

Expected output:

Cache analysis:
  cache_hit_rate: 24.1
  total_input_tokens: 14100
  total_cached_tokens: 3400
  generations: 5
  estimated_savings_pct: 12.1

A 24% cache hit rate tells you that about a quarter of your input tokens are coming from cached prefixes, likely the system prompt. To improve this, you'd need to stabilize more of the prompt prefix (e.g., by putting frequently-used context before the variable evidence). But there's a tradeoff: restructuring prompts for cacheability can hurt clarity. I've found it's usually better to optimize prompt structure for quality first and accept whatever caching you get naturally.

Model routing by task complexity

Not every question needs the same model. Skip-mode questions (general knowledge) are simple, and a smaller, cheaper model handles them fine. Complex hybrid-retrieval questions with nuanced evidence might benefit from a larger model. Model routing matches cost to difficulty:

# observability/model_router.py
"""Route questions to appropriate models based on task complexity.

This extends the retrieval router from Module 5 with a model
selection step, so simple questions use cheap models and complex
questions use capable ones.
"""
import sys
sys.path.insert(0, ".")

from rag.retrieval_service import (
    classify_question, RetrievalPolicy, RetrievalMode,
)


# Model tiers — adjust based on your provider and budget
MODEL_TIERS = {
    "simple": "gpt-4.1-nano",
    "standard": "gpt-4o-mini",
    "complex": "gpt-4o",
}


def select_model(
    question: str,
    retrieval_mode: RetrievalMode,
    confidence: float,
) -> str:
    """Select a model tier based on question complexity signals.

    Simple questions (skip mode, high confidence) get cheap models.
    Complex questions (hybrid mode, low confidence) get capable models.
    """
    if retrieval_mode == RetrievalMode.SKIP:
        return MODEL_TIERS["simple"]

    if retrieval_mode == RetrievalMode.HYBRID and confidence < 0.5:
        # Ambiguous question that needed hybrid retrieval —
        # the answer will require careful reasoning over mixed evidence
        return MODEL_TIERS["complex"]

    return MODEL_TIERS["standard"]


if __name__ == "__main__":
    policy = RetrievalPolicy()

    test_questions = [
        "What is a Python list?",
        "What does validate_path return?",
        "What functions call read_file and how does the caching interact with the auth module?",
    ]

    for q in test_questions:
        classification = classify_question(q, policy)
        model = select_model(q, classification.mode, classification.confidence)
        print(f"Q: {q[:60]}...")
        print(f"  Route: {classification.mode.value} (confidence: {classification.confidence})")
        print(f"  Model: {model}")
        print()
python observability/model_router.py

Model routing is a cost optimization, not a quality optimization. The goal is to avoid paying for capability you don't need, not to find the "best" model for each question. Start with a single model for everything (as we've been doing), measure costs, and only add model routing if the cost savings justify the added complexity.

Token budgets

Token budgets are a safety net against pathological queries. A question that triggers a massive evidence bundle or an unusually verbose response can cost 10-50x the average. Budgets catch these outliers:

# observability/token_budget.py
"""Token budget enforcement for the RAG pipeline.

Prevents runaway costs from pathological queries by enforcing
hard limits on input and output tokens.
"""
from dataclasses import dataclass


@dataclass
class TokenBudget:
    """Per-request token budget configuration."""
    max_input_tokens: int = 8000     # evidence + system prompt + question
    max_output_tokens: int = 2000    # model response
    max_total_tokens: int = 10000    # combined limit
    warn_threshold: float = 0.8      # log a warning at 80% of budget

    def check(self, input_tokens: int, output_tokens: int = 0) -> dict:
        """Check whether a request is within budget.

        Returns a dict with 'allowed', 'warnings', and usage details.
        """
        total = input_tokens + output_tokens
        warnings = []

        if input_tokens > self.max_input_tokens:
            return {
                "allowed": False,
                "reason": f"Input tokens ({input_tokens}) exceed budget ({self.max_input_tokens})",
                "warnings": warnings,
            }

        if total > self.max_total_tokens:
            return {
                "allowed": False,
                "reason": f"Total tokens ({total}) exceed budget ({self.max_total_tokens})",
                "warnings": warnings,
            }

        # Warnings for approaching limits
        if input_tokens > self.max_input_tokens * self.warn_threshold:
            warnings.append(
                f"Input tokens ({input_tokens}) at "
                f"{input_tokens/self.max_input_tokens:.0%} of budget"
            )

        if total > self.max_total_tokens * self.warn_threshold:
            warnings.append(
                f"Total tokens ({total}) at "
                f"{total/self.max_total_tokens:.0%} of budget"
            )

        return {
            "allowed": True,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": total,
            "input_pct": round(input_tokens / self.max_input_tokens * 100, 1),
            "total_pct": round(total / self.max_total_tokens * 100, 1),
            "warnings": warnings,
        }


if __name__ == "__main__":
    budget = TokenBudget()

    test_cases = [
        (3500, 400, "Normal RAG request"),
        (7500, 500, "Large evidence bundle"),
        (9000, 1500, "Pathological query"),
    ]

    for input_t, output_t, label in test_cases:
        result = budget.check(input_t, output_t)
        status = "ALLOWED" if result["allowed"] else "BLOCKED"
        print(f"{label}: {status}")
        if result.get("warnings"):
            for w in result["warnings"]:
                print(f"  WARNING: {w}")
        if not result["allowed"]:
            print(f"  REASON: {result['reason']}")
        print()
python observability/token_budget.py

Expected output:

Normal RAG request: ALLOWED

Large evidence bundle: ALLOWED
  WARNING: Input tokens (7500) at 94% of budget

Pathological query: BLOCKED
  REASON: Input tokens (9000) exceed budget (8000)

Rate-limit telemetry

In Module 1's security-basics lesson, we set up rate limiting to protect the system. Here we'll add telemetry to those limits so you can see when they fire and whether the thresholds are right:

# observability/rate_limit_telemetry.py
"""Rate-limit event tracking.

Instruments rate-limit decisions so you can see throttling
in your traces and tune thresholds from real usage data.
"""
import time
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timezone


@dataclass
class RateLimitEvent:
    """A single rate-limit event for telemetry."""
    timestamp: str
    user_id: str
    route: str
    action: str           # "allowed", "throttled", "rejected"
    current_count: int
    limit: int
    window_seconds: int


@dataclass
class RateLimitTracker:
    """Track rate-limit events for telemetry and analysis.

    This wraps your existing rate limiter (from Module 1) and
    records every decision as a telemetry event.
    """
    events: list = field(default_factory=list)
    counters: dict = field(default_factory=lambda: defaultdict(list))

    # Configurable limits per route
    limits: dict = field(default_factory=lambda: {
        "rag-pipeline": {"max_requests": 30, "window_seconds": 60},
        "benchmark-run": {"max_requests": 100, "window_seconds": 300},
        "default": {"max_requests": 60, "window_seconds": 60},
    })

    def check(self, user_id: str, route: str = "default") -> RateLimitEvent:
        """Check whether a request is within rate limits.

        Records the decision as a telemetry event regardless of outcome.
        """
        now = time.time()
        config = self.limits.get(route, self.limits["default"])
        window = config["window_seconds"]
        max_req = config["max_requests"]

        # Sliding window: count requests in the last N seconds
        key = f"{user_id}:{route}"
        self.counters[key] = [
            t for t in self.counters[key] if now - t < window
        ]
        current = len(self.counters[key])

        if current >= max_req:
            action = "throttled"
        else:
            action = "allowed"
            self.counters[key].append(now)

        event = RateLimitEvent(
            timestamp=datetime.now(timezone.utc).isoformat(),
            user_id=user_id,
            route=route,
            action=action,
            current_count=current,
            limit=max_req,
            window_seconds=window,
        )
        self.events.append(event)
        return event

    def summary(self) -> dict:
        """Summarize rate-limit events for a telemetry report."""
        total = len(self.events)
        throttled = sum(1 for e in self.events if e.action == "throttled")
        by_route = defaultdict(lambda: {"total": 0, "throttled": 0})

        for e in self.events:
            by_route[e.route]["total"] += 1
            if e.action == "throttled":
                by_route[e.route]["throttled"] += 1

        return {
            "total_events": total,
            "throttled_events": throttled,
            "throttle_rate": round(throttled / total * 100, 1) if total > 0 else 0,
            "by_route": dict(by_route),
        }


if __name__ == "__main__":
    tracker = RateLimitTracker(
        limits={
            "rag-pipeline": {"max_requests": 5, "window_seconds": 10},
            "default": {"max_requests": 10, "window_seconds": 10},
        }
    )

    # Simulate a burst of requests
    print("Simulating 8 rapid requests to rag-pipeline:\n")
    for i in range(8):
        event = tracker.check("user-123", "rag-pipeline")
        print(f"  Request {i+1}: {event.action} ({event.current_count}/{event.limit})")

    print()
    summary = tracker.summary()
    print(f"Summary:")
    print(f"  Total events: {summary['total_events']}")
    print(f"  Throttled: {summary['throttled_events']}")
    print(f"  Throttle rate: {summary['throttle_rate']}%")
python observability/rate_limit_telemetry.py

Expected output:

Simulating 8 rapid requests to rag-pipeline:

  Request 1: allowed (0/5)
  Request 2: allowed (1/5)
  Request 3: allowed (2/5)
  Request 4: allowed (3/5)
  Request 5: allowed (4/5)
  Request 6: throttled (5/5)
  Request 7: throttled (5/5)
  Request 8: throttled (5/5)

Summary:
  Total events: 8
  Throttled: 3
  Throttle rate: 37.5%

The throttle rate tells you whether your limits are too tight (high throttle rate during normal use) or too loose (zero throttling even during bursts). In production, you'd feed these events into your Langfuse traces so you can see rate limiting alongside the request traces.

Cost per successful task

This is the metric that actually matters. Cost per request is easy to calculate but misleading because it treats failed requests the same as successful ones. Cost per successful task gives you the true unit economics:

# observability/success_cost.py
"""Calculate cost per successful task from a graded run log.

Combines the cost tracker with graded run results to give you
the metric that actually matters for operational decisions.
"""
import json
import sys

sys.path.insert(0, ".")
from observability.cost_tracker import estimate_cost


def cost_per_success(run_file: str, model: str = "gpt-4o-mini") -> dict:
    """Calculate cost per successful task from a graded run log."""
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    graded = [e for e in entries if e.get("grade") is not None]
    if not graded:
        return {"error": "No graded entries found"}

    # Estimate costs (using average token counts since run logs
    # don't yet have per-request token data — that's coming in
    # the harness lesson)
    total_cost = 0.0
    for e in graded:
        cost = estimate_cost(
            model=model,
            input_tokens=3500,   # average estimate
            output_tokens=400,   # average estimate
        )
        total_cost += cost["total_cost"]

    successful = [
        e for e in graded
        if e["grade"] in ("fully_correct", "partially_correct")
    ]

    cost_per_request = total_cost / len(graded)
    cost_per_success_val = total_cost / len(successful) if successful else float("inf")

    return {
        "total_requests": len(graded),
        "successful_requests": len(successful),
        "success_rate": round(len(successful) / len(graded) * 100, 1),
        "total_cost": round(total_cost, 4),
        "cost_per_request": round(cost_per_request, 6),
        "cost_per_successful_task": round(cost_per_success_val, 6),
        "cost_overhead_from_failures": round(
            (cost_per_success_val - cost_per_request) / cost_per_request * 100, 1
        ) if successful else None,
    }


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python observability/success_cost.py <graded-run-file.jsonl>")
        print("Example: python observability/success_cost.py harness/runs/baseline-2026-03-24-graded.jsonl")
        sys.exit(1)

    metrics = cost_per_success(sys.argv[1])
    print("Cost per successful task:")
    for k, v in metrics.items():
        if isinstance(v, float):
            if "cost" in k:
                print(f"  {k}: ${v}")
            else:
                print(f"  {k}: {v}%")
        else:
            print(f"  {k}: {v}")
python observability/success_cost.py harness/runs/baseline-2026-03-24-143022-graded.jsonl

Expected output (based on the typical baseline from Module 2):

Cost per successful task:
  total_requests: 30
  successful_requests: 11
  success_rate: 36.7%
  total_cost: $0.0212
  cost_per_request: $0.000705
  cost_per_successful_task: $0.001924
  cost_overhead_from_failures: 173.0%

That's a 173% overhead, meaning failures nearly triple your effective cost. This is why improving accuracy is a cost concern just as much as it is a quality concern. Every failed request is money spent with no return. We'll see this metric improve as we add retrieval and better grading in the eval lessons.

Exercises

  1. Run observability/cost_tracker.py with your actual model and typical token counts. Compare costs across at least two models (e.g., gpt-4o-mini vs. gpt-4o or claude-haiku-3-5 vs. claude-sonnet-4-6). What's the cost ratio between the cheapest and most expensive option?
  2. Analyze prompt caching potential for your pipeline. Count the stable tokens (system prompt, grounding instructions) vs. variable tokens (evidence, question) in a typical request. What's the theoretical maximum cache hit rate?
  3. Add cost tracking to your traced pipeline from the previous lesson. After each generation span, calculate the cost and attach it as metadata. Run 10 benchmark questions and check the per-question cost distribution in Langfuse.
  4. Calculate cost per successful task for your latest graded benchmark run using observability/success_cost.py. Then estimate what the metric would be if accuracy improved by 20 percentage points.
  5. Set up the rate-limit tracker and simulate your benchmark run's request pattern. Are the default limits appropriate, or would a benchmark run get throttled?

Completion checkpoint

You should now have:

  • A cost tracking module that estimates per-request cost from token counts and model pricing
  • An understanding of prompt caching: when it helps, when it doesn't, and how to measure cache hit rates
  • A model routing strategy that matches model cost to task complexity
  • Token budgets that catch pathological queries before they become expensive
  • Rate-limit telemetry that records throttling decisions as structured events
  • Cost per successful task calculated for at least one graded benchmark run

Reflection prompts

  • What surprised you about the cost breakdown? Was the model (input vs. output) or the evidence (retrieval tokens) the bigger factor?
  • If you had to cut costs by 50% without reducing accuracy, which optimization would you try first: model routing, prompt caching, or retrieval skipping?
  • How does the cost per successful task metric change your thinking about accuracy improvements compared to cost optimizations?

Connecting to the project

We now have two layers of operational telemetry: traces that show what happened in each request, and cost metrics that show what each request cost. Together, they answer the question "is this system affordable and visible enough to operate?"

But we've been building these pieces (the benchmark, the run logs, the telemetry, the cost tracking) as separate scripts. The next lesson pulls them together into a single harness: one command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. That harness will become the foundation for every evaluation we build afterward.

What's next

Building Your AI Harness. You have the pieces of an experiment system now, scattered across modules; the next lesson assembles them into one repeatable loop.

References

Start here

  • Anthropic: Prompt caching — Anthropic's prompt caching documentation, including when caching activates and pricing discounts

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.