Cost, Caching, and Operational Telemetry

The previous lesson made the pipeline visible. You can trace any request and see what happened. But traces alone don't answer the operational questions that matter in a real system: How much does each answer cost? Are we wasting tokens on evidence the model ignores? Could we serve 80% of requests with a cheaper model? When a user hits a rate limit, do we know about it before they complain?

This lesson adds the cost, caching, and operational layers that turn raw telemetry into actionable metrics. We'll build token budgeting, prompt caching awareness, cost-per-successful-task tracking, and rate-limit telemetry, all instrumented into the traced pipeline. By the end, you'll be able to answer "is this system affordable to operate?" with data instead of hope.

What you'll learn

Calculate and track cost per request using token counts and provider pricing
Understand prompt caching: when it works well, when it doesn't, and how to measure cache hit rates
Build model routing by task complexity to match cost to difficulty
Set up token budgets that prevent runaway spending
Add rate-limit telemetry that connects back to the rate limiting concepts from Module 1's security-basics
Track cost per successful task, not just cost per request

Concepts

Cost per request — the total token cost for one end-to-end pipeline execution. This includes input tokens (system prompt + evidence + question), output tokens (the model's response), and any intermediate calls (routing classification, grounding checks). Provider pricing varies by model, so cost tracking requires knowing both the token count and the per-token price for each model used.

Cost per successful task — a more useful metric than cost per request. Not every request succeeds. Some abstain, some produce wrong answers, some hit rate limits. Cost per successful task divides total spending by the number of requests that actually helped the user. If your system costs $0.01 per request but only 60% of requests succeed, the real cost is $0.017 per successful task. This metric catches the trap of optimizing for cheap requests that aren't useful.

Prompt caching — a provider-level optimization where repeated prompt prefixes are cached and charged at a reduced rate (or not charged at all). Anthropic, OpenAI, and other providers offer variants of this. Caching works well when your system prompt is stable and appears in many requests, where the first request pays full price and subsequent requests with the same prefix get a discount. It works poorly when prompts vary significantly between requests (different evidence bundles, personalized instructions) because each unique prefix is a cache miss.

Cache hit rate — the percentage of requests where the prompt prefix was served from cache rather than processed from scratch. A high cache hit rate (70%+) means your prompt structure is cache-friendly. A low rate means either your prompts vary too much, or your request volume is too low for the cache to stay warm. Tracking this metric tells you whether caching is actually saving money or just adding complexity.

Token budget — a hard limit on the number of tokens a single request can consume, including both input and output. Token budgets prevent runaway costs from pathological queries (questions that trigger large evidence bundles or verbose responses). They're a safety net, not an optimization. The goal is to catch outliers, not to constrain normal operation.

Model routing by task complexity — using a cheaper, faster model for simple tasks and reserving expensive models for complex ones. In our pipeline, skip-mode questions (general knowledge) can use a smaller model, while complex hybrid-retrieval questions might need a larger one. This is a cost optimization that trades a small amount of routing complexity for significant savings on easy questions.

Rate-limit telemetry — structured tracking of rate-limit events: when a request is throttled, which limit was hit (per-user, per-route, provider-side), and what happened to the request (queued, retried, rejected). In Module 1's security-basics lesson, we set up rate limiting as a protective measure. Here we'll instrument it so you can see rate limiting in action and tune thresholds based on real usage patterns.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Unknown costs	You don't know what your system costs per request	Manual token estimates	Per-generation cost tracking in traces
Wasted tokens	Evidence is retrieved but never cited in the answer	Review traces manually	Token-waste metric comparing evidence tokens to citation count
Expensive easy questions	Simple questions use the same expensive model as complex ones	Single model for everything	Model routing by task complexity
Cache misses	Prompt caching is enabled but savings are minimal	Check prompt structure	Cache hit rate metric with prefix stability analysis
Silent rate limiting	Users hit limits but you don't know until they complain	Wait for complaints	Rate-limit event telemetry with alerts
Runaway costs	One pathological query costs 10x the average	No budget enforcement	Token budgets with hard cutoffs

Walkthrough

Tracking cost per request

Provider pricing changes, so we'll build a simple pricing table that's easy to update. Our goal with this isn't precision to the penny but having any cost visibility:

# observability/cost_tracker.py
"""Token cost tracking for traced pipeline runs.

Pricing is approximate and will need periodic updates.
Check your provider's pricing page for current rates.
"""

# Prices in USD per 1M tokens — update when pricing changes
# Last updated: 2026-03-25
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00, "cached_input": 1.25},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60, "cached_input": 0.10},
    "gpt-4.1-nano": {"input": 0.10, "output": 0.40, "cached_input": 0.025},
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00, "cached_input": 0.30},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00, "cached_input": 0.08},
    "Qwen/Qwen2.5-7B-Instruct": {"input": 0.27, "output": 0.27, "cached_input": 0.27},
    "gpt-oss:20b": {"input": 0.70, "output": 0.70, "cached_input": 0.70},
}

# Fallback for unknown models. Local Ollama models do not have a hosted
# per-token price, so track them separately using latency, GPU time, or
# cloud instance cost instead of forcing a fake token rate.
DEFAULT_PRICING = {"input": 1.00, "output": 3.00, "cached_input": 0.50}


def estimate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cached_input_tokens: int = 0,
) -> dict:
    """Estimate the cost of a single LLM call.

    Returns a dict with input_cost, output_cost, cache_savings,
    and total_cost in USD.
    """
    pricing = MODEL_PRICING.get(model, DEFAULT_PRICING)

    # Non-cached input tokens
    non_cached_input = input_tokens - cached_input_tokens
    input_cost = (non_cached_input / 1_000_000) * pricing["input"]
    cached_cost = (cached_input_tokens / 1_000_000) * pricing["cached_input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]

    # What caching saved
    full_input_cost = (input_tokens / 1_000_000) * pricing["input"]
    cache_savings = full_input_cost - (input_cost + cached_cost)

    return {
        "input_cost": round(input_cost + cached_cost, 6),
        "output_cost": round(output_cost, 6),
        "cache_savings": round(cache_savings, 6),
        "total_cost": round(input_cost + cached_cost + output_cost, 6),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cached_input_tokens": cached_input_tokens,
    }


if __name__ == "__main__":
    # Example: a typical RAG request
    cost = estimate_cost(
        model="Qwen/Qwen2.5-7B-Instruct",
        input_tokens=3500,  # system prompt + evidence + question
        output_tokens=400,
        cached_input_tokens=800,  # system prompt was cached
    )

    print("Cost estimate for one RAG request:")
    for k, v in cost.items():
        if isinstance(v, float):
            print(f"  {k}: ${v:.6f}")
        else:
            print(f"  {k}: {v}")

python observability/cost_tracker.py

Expected output:

Cost estimate for one RAG request:
  input_cost: $0.000465
  output_cost: $0.000240
  cache_savings: $0.000060
  total_cost: $0.000705
  model: Qwen/Qwen2.5-7B-Instruct
  input_tokens: 3500
  output_tokens: 400
  cached_input_tokens: 800

Prompt caching: when it works and when it doesn't

Prompt caching reduces input token costs by caching the stable prefix of your prompt, the part that doesn't change between requests. Here's how to think about it:

Caching works well when:

Your system prompt is stable across requests (same instructions, same format)
Evidence bundles have common prefixes (e.g., the same repository README appears in many requests)
Request volume is high enough to keep the cache warm (provider caches expire after inactivity)
You're using a provider that supports caching (Anthropic's prompt caching, OpenAI's cached input pricing)

Caching works poorly when:

Every request has a unique prompt (different evidence, different instructions)
Request volume is low (cache expires between requests)
You're frequently updating system prompts during development
Evidence bundles vary significantly between question types

The practical implication for our pipeline: the system prompt and grounding instructions are stable, so they'll cache well. The evidence bundle changes every request, so it won't cache. This means caching saves money on the fixed portion of the prompt, but the variable portion (which is usually the largest part in RAG) won't benefit.

# observability/cache_metrics.py
"""Track prompt caching effectiveness across runs."""
import json
from pathlib import Path


def analyze_cache_rates(run_traces: list[dict]) -> dict:
    """Analyze cache hit rates from traced generation spans.

    Expects a list of trace dicts with 'cached_input_tokens'
    and 'input_tokens' fields from the generation spans.
    """
    total_input_tokens = 0
    total_cached_tokens = 0
    generations = 0

    for trace in run_traces:
        if "input_tokens" in trace and "cached_input_tokens" in trace:
            total_input_tokens += trace["input_tokens"]
            total_cached_tokens += trace["cached_input_tokens"]
            generations += 1

    if generations == 0:
        return {"cache_hit_rate": 0, "generations": 0}

    hit_rate = total_cached_tokens / total_input_tokens if total_input_tokens > 0 else 0

    return {
        "cache_hit_rate": round(hit_rate * 100, 1),
        "total_input_tokens": total_input_tokens,
        "total_cached_tokens": total_cached_tokens,
        "generations": generations,
        "estimated_savings_pct": round(hit_rate * 50, 1),  # rough: cached is ~50% cheaper
    }


if __name__ == "__main__":
    # Simulated data from a traced benchmark run
    sample_traces = [
        {"input_tokens": 3500, "cached_input_tokens": 800},
        {"input_tokens": 4200, "cached_input_tokens": 800},
        {"input_tokens": 2100, "cached_input_tokens": 800},
        {"input_tokens": 3800, "cached_input_tokens": 800},
        {"input_tokens": 500, "cached_input_tokens": 200},  # skip-mode, smaller prompt
    ]

    metrics = analyze_cache_rates(sample_traces)
    print("Cache analysis:")
    for k, v in metrics.items():
        print(f"  {k}: {v}")

python observability/cache_metrics.py

Expected output:

Cache analysis:
  cache_hit_rate: 24.1
  total_input_tokens: 14100
  total_cached_tokens: 3400
  generations: 5
  estimated_savings_pct: 12.1

A 24% cache hit rate tells you that about a quarter of your input tokens are coming from cached prefixes, likely the system prompt. To improve this, you'd need to stabilize more of the prompt prefix (e.g., by putting frequently-used context before the variable evidence). But there's a tradeoff: restructuring prompts for cacheability can hurt clarity. I've found it's usually better to optimize prompt structure for quality first and accept whatever caching you get naturally.

Model routing by task complexity

Not every question needs the same model. Skip-mode questions (general knowledge) are simple, and a smaller, cheaper model handles them fine. Complex hybrid-retrieval questions with nuanced evidence might benefit from a larger model. Model routing matches cost to difficulty:

# observability/model_router.py
"""Route questions to appropriate models based on task complexity.

This extends the retrieval router from Module 5 with a model
selection step, so simple questions use cheap models and complex
questions use capable ones.
"""
import sys
sys.path.insert(0, ".")

from rag.retrieval_service import (
    classify_question, RetrievalPolicy, RetrievalMode,
)


# Model tiers — adjust based on your provider and budget
MODEL_TIERS = {
    "simple": "gpt-4.1-nano",
    "standard": "gpt-4o-mini",
    "complex": "gpt-4o",
}


def select_model(
    question: str,
    retrieval_mode: RetrievalMode,
    confidence: float,
) -> str:
    """Select a model tier based on question complexity signals.

    Simple questions (skip mode, high confidence) get cheap models.
    Complex questions (hybrid mode, low confidence) get capable models.
    """
    if retrieval_mode == RetrievalMode.SKIP:
        return MODEL_TIERS["simple"]

    if retrieval_mode == RetrievalMode.HYBRID and confidence < 0.5:
        # Ambiguous question that needed hybrid retrieval —
        # the answer will require careful reasoning over mixed evidence
        return MODEL_TIERS["complex"]

    return MODEL_TIERS["standard"]


if __name__ == "__main__":
    policy = RetrievalPolicy()

    test_questions = [
        "What is a Python list?",
        "What does validate_path return?",
        "What functions call read_file and how does the caching interact with the auth module?",
    ]

    for q in test_questions:
        classification = classify_question(q, policy)
        model = select_model(q, classification.mode, classification.confidence)
        print(f"Q: {q[:60]}...")
        print(f"  Route: {classification.mode.value} (confidence: {classification.confidence})")
        print(f"  Model: {model}")
        print()

python observability/model_router.py

Model routing is a cost optimization, not a quality optimization. The goal is to avoid paying for capability you don't need, not to find the "best" model for each question. Start with a single model for everything (as we've been doing), measure costs, and only add model routing if the cost savings justify the added complexity.

Token budgets

Token budgets are a safety net against pathological queries. A question that triggers a massive evidence bundle or an unusually verbose response can cost 10-50x the average. Budgets catch these outliers:

# observability/token_budget.py
"""Token budget enforcement for the RAG pipeline.

Prevents runaway costs from pathological queries by enforcing
hard limits on input and output tokens.
"""
from dataclasses import dataclass


@dataclass
class TokenBudget:
    """Per-request token budget configuration."""
    max_input_tokens: int = 8000     # evidence + system prompt + question
    max_output_tokens: int = 2000    # model response
    max_total_tokens: int = 10000    # combined limit
    warn_threshold: float = 0.8      # log a warning at 80% of budget

    def check(self, input_tokens: int, output_tokens: int = 0) -> dict:
        """Check whether a request is within budget.

        Returns a dict with 'allowed', 'warnings', and usage details.
        """
        total = input_tokens + output_tokens
        warnings = []

        if input_tokens > self.max_input_tokens:
            return {
                "allowed": False,
                "reason": f"Input tokens ({input_tokens}) exceed budget ({self.max_input_tokens})",
                "warnings": warnings,
            }

        if total > self.max_total_tokens:
            return {
                "allowed": False,
                "reason": f"Total tokens ({total}) exceed budget ({self.max_total_tokens})",
                "warnings": warnings,
            }

        # Warnings for approaching limits
        if input_tokens > self.max_input_tokens * self.warn_threshold:
            warnings.append(
                f"Input tokens ({input_tokens}) at "
                f"{input_tokens/self.max_input_tokens:.0%} of budget"
            )

        if total > self.max_total_tokens * self.warn_threshold:
            warnings.append(
                f"Total tokens ({total}) at "
                f"{total/self.max_total_tokens:.0%} of budget"
            )

        return {
            "allowed": True,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": total,
            "input_pct": round(input_tokens / self.max_input_tokens * 100, 1),
            "total_pct": round(total / self.max_total_tokens * 100, 1),
            "warnings": warnings,
        }


if __name__ == "__main__":
    budget = TokenBudget()

    test_cases = [
        (3500, 400, "Normal RAG request"),
        (7500, 500, "Large evidence bundle"),
        (9000, 1500, "Pathological query"),
    ]

    for input_t, output_t, label in test_cases:
        result = budget.check(input_t, output_t)
        status = "ALLOWED" if result["allowed"] else "BLOCKED"
        print(f"{label}: {status}")
        if result.get("warnings"):
            for w in result["warnings"]:
                print(f"  WARNING: {w}")
        if not result["allowed"]:
            print(f"  REASON: {result['reason']}")
        print()

python observability/token_budget.py

Expected output:

Normal RAG request: ALLOWED

Large evidence bundle: ALLOWED
  WARNING: Input tokens (7500) at 94% of budget

Pathological query: BLOCKED
  REASON: Input tokens (9000) exceed budget (8000)

Rate-limit telemetry

In Module 1's security-basics lesson, we set up rate limiting to protect the system. Here we'll add telemetry to those limits so you can see when they fire and whether the thresholds are right:

# observability/rate_limit_telemetry.py
"""Rate-limit event tracking.

Instruments rate-limit decisions so you can see throttling
in your traces and tune thresholds from real usage data.
"""
import time
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timezone


@dataclass
class RateLimitEvent:
    """A single rate-limit event for telemetry."""
    timestamp: str
    user_id: str
    route: str
    action: str           # "allowed", "throttled", "rejected"
    current_count: int
    limit: int
    window_seconds: int


@dataclass
class RateLimitTracker:
    """Track rate-limit events for telemetry and analysis.

    This wraps your existing rate limiter (from Module 1) and
    records every decision as a telemetry event.
    """
    events: list = field(default_factory=list)
    counters: dict = field(default_factory=lambda: defaultdict(list))

    # Configurable limits per route
    limits: dict = field(default_factory=lambda: {
        "rag-pipeline": {"max_requests": 30, "window_seconds": 60},
        "benchmark-run": {"max_requests": 100, "window_seconds": 300},
        "default": {"max_requests": 60, "window_seconds": 60},
    })

    def check(self, user_id: str, route: str = "default") -> RateLimitEvent:
        """Check whether a request is within rate limits.

        Records the decision as a telemetry event regardless of outcome.
        """
        now = time.time()
        config = self.limits.get(route, self.limits["default"])
        window = config["window_seconds"]
        max_req = config["max_requests"]

        # Sliding window: count requests in the last N seconds
        key = f"{user_id}:{route}"
        self.counters[key] = [
            t for t in self.counters[key] if now - t < window
        ]
        current = len(self.counters[key])

        if current >= max_req:
            action = "throttled"
        else:
            action = "allowed"
            self.counters[key].append(now)

        event = RateLimitEvent(
            timestamp=datetime.now(timezone.utc).isoformat(),
            user_id=user_id,
            route=route,
            action=action,
            current_count=current,
            limit=max_req,
            window_seconds=window,
        )
        self.events.append(event)
        return event

    def summary(self) -> dict:
        """Summarize rate-limit events for a telemetry report."""
        total = len(self.events)
        throttled = sum(1 for e in self.events if e.action == "throttled")
        by_route = defaultdict(lambda: {"total": 0, "throttled": 0})

        for e in self.events:
            by_route[e.route]["total"] += 1
            if e.action == "throttled":
                by_route[e.route]["throttled"] += 1

        return {
            "total_events": total,
            "throttled_events": throttled,
            "throttle_rate": round(throttled / total * 100, 1) if total > 0 else 0,
            "by_route": dict(by_route),
        }


if __name__ == "__main__":
    tracker = RateLimitTracker(
        limits={
            "rag-pipeline": {"max_requests": 5, "window_seconds": 10},
            "default": {"max_requests": 10, "window_seconds": 10},
        }
    )

    # Simulate a burst of requests
    print("Simulating 8 rapid requests to rag-pipeline:\n")
    for i in range(8):
        event = tracker.check("user-123", "rag-pipeline")
        print(f"  Request {i+1}: {event.action} ({event.current_count}/{event.limit})")

    print()
    summary = tracker.summary()
    print(f"Summary:")
    print(f"  Total events: {summary['total_events']}")
    print(f"  Throttled: {summary['throttled_events']}")
    print(f"  Throttle rate: {summary['throttle_rate']}%")

python observability/rate_limit_telemetry.py

Expected output:

Simulating 8 rapid requests to rag-pipeline:

  Request 1: allowed (0/5)
  Request 2: allowed (1/5)
  Request 3: allowed (2/5)
  Request 4: allowed (3/5)
  Request 5: allowed (4/5)
  Request 6: throttled (5/5)
  Request 7: throttled (5/5)
  Request 8: throttled (5/5)

Summary:
  Total events: 8
  Throttled: 3
  Throttle rate: 37.5%

The throttle rate tells you whether your limits are too tight (high throttle rate during normal use) or too loose (zero throttling even during bursts). In production, you'd feed these events into your Langfuse traces so you can see rate limiting alongside the request traces.

Cost per successful task

This is the metric that actually matters. Cost per request is easy to calculate but misleading because it treats failed requests the same as successful ones. Cost per successful task gives you the true unit economics:

# observability/success_cost.py
"""Calculate cost per successful task from a graded run log.

Combines the cost tracker with graded run results to give you
the metric that actually matters for operational decisions.
"""
import json
import sys

sys.path.insert(0, ".")
from observability.cost_tracker import estimate_cost


def cost_per_success(run_file: str, model: str = "gpt-4o-mini") -> dict:
    """Calculate cost per successful task from a graded run log."""
    entries = []
    with open(run_file) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))

    graded = [e for e in entries if e.get("grade") is not None]
    if not graded:
        return {"error": "No graded entries found"}

    # Estimate costs (using average token counts since run logs
    # don't yet have per-request token data — that's coming in
    # the harness lesson)
    total_cost = 0.0
    for e in graded:
        cost = estimate_cost(
            model=model,
            input_tokens=3500,   # average estimate
            output_tokens=400,   # average estimate
        )
        total_cost += cost["total_cost"]

    successful = [
        e for e in graded
        if e["grade"] in ("fully_correct", "partially_correct")
    ]

    cost_per_request = total_cost / len(graded)
    cost_per_success_val = total_cost / len(successful) if successful else float("inf")

    return {
        "total_requests": len(graded),
        "successful_requests": len(successful),
        "success_rate": round(len(successful) / len(graded) * 100, 1),
        "total_cost": round(total_cost, 4),
        "cost_per_request": round(cost_per_request, 6),
        "cost_per_successful_task": round(cost_per_success_val, 6),
        "cost_overhead_from_failures": round(
            (cost_per_success_val - cost_per_request) / cost_per_request * 100, 1
        ) if successful else None,
    }


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python observability/success_cost.py <graded-run-file.jsonl>")
        print("Example: python observability/success_cost.py harness/runs/baseline-2026-03-24-graded.jsonl")
        sys.exit(1)

    metrics = cost_per_success(sys.argv[1])
    print("Cost per successful task:")
    for k, v in metrics.items():
        if isinstance(v, float):
            if "cost" in k:
                print(f"  {k}: ${v}")
            else:
                print(f"  {k}: {v}%")
        else:
            print(f"  {k}: {v}")

python observability/success_cost.py harness/runs/baseline-2026-03-24-143022-graded.jsonl

Expected output (based on the typical baseline from Module 2):

Cost per successful task:
  total_requests: 30
  successful_requests: 11
  success_rate: 36.7%
  total_cost: $0.0212
  cost_per_request: $0.000705
  cost_per_successful_task: $0.001924
  cost_overhead_from_failures: 173.0%

That's a 173% overhead, meaning failures nearly triple your effective cost. This is why improving accuracy is a cost concern just as much as it is a quality concern. Every failed request is money spent with no return. We'll see this metric improve as we add retrieval and better grading in the eval lessons.

Exercises

Run observability/cost_tracker.py with your actual model and typical token counts. Compare costs across at least two models (e.g., gpt-4o-mini vs. gpt-4o or claude-haiku-3-5 vs. claude-sonnet-4-6). What's the cost ratio between the cheapest and most expensive option?
Analyze prompt caching potential for your pipeline. Count the stable tokens (system prompt, grounding instructions) vs. variable tokens (evidence, question) in a typical request. What's the theoretical maximum cache hit rate?
Add cost tracking to your traced pipeline from the previous lesson. After each generation span, calculate the cost and attach it as metadata. Run 10 benchmark questions and check the per-question cost distribution in Langfuse.
Calculate cost per successful task for your latest graded benchmark run using observability/success_cost.py. Then estimate what the metric would be if accuracy improved by 20 percentage points.
Set up the rate-limit tracker and simulate your benchmark run's request pattern. Are the default limits appropriate, or would a benchmark run get throttled?

Completion checkpoint

You should now have:

A cost tracking module that estimates per-request cost from token counts and model pricing
An understanding of prompt caching: when it helps, when it doesn't, and how to measure cache hit rates
A model routing strategy that matches model cost to task complexity
Token budgets that catch pathological queries before they become expensive
Rate-limit telemetry that records throttling decisions as structured events
Cost per successful task calculated for at least one graded benchmark run

Reflection prompts

What surprised you about the cost breakdown? Was the model (input vs. output) or the evidence (retrieval tokens) the bigger factor?
If you had to cut costs by 50% without reducing accuracy, which optimization would you try first: model routing, prompt caching, or retrieval skipping?
How does the cost per successful task metric change your thinking about accuracy improvements compared to cost optimizations?

Connecting to the project

We now have two layers of operational telemetry: traces that show what happened in each request, and cost metrics that show what each request cost. Together, they answer the question "is this system affordable and visible enough to operate?"

But we've been building these pieces (the benchmark, the run logs, the telemetry, the cost tracking) as separate scripts. The next lesson pulls them together into a single harness: one command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. That harness will become the foundation for every evaluation we build afterward.

What's next

Building Your AI Harness. You have the pieces of an experiment system now, scattered across modules; the next lesson assembles them into one repeatable loop.

References

Start here

Anthropic: Prompt caching — Anthropic's prompt caching documentation, including when caching activates and pricing discounts

Build with this

OpenAI: Prompt caching FAQ — OpenAI's approach to prompt caching with details on cache behavior and pricing
Langfuse: Cost tracking — how Langfuse calculates and displays per-generation costs from token usage

Deep dive

Anthropic: Rate limits — rate-limit behavior, headers, and retry strategies for the Anthropic API
Cost and Reliability Patterns — the full decision framework for prompt caching, rate limiting, model routing, token budgeting, and circuit breakers