Cost, Caching, and Operational Telemetry
The previous lesson made the pipeline visible. You can trace any request and see what happened. But traces alone don't answer the operational questions that matter in a real system: How much does each answer cost? Are we wasting tokens on evidence the model ignores? Could we serve 80% of requests with a cheaper model? When a user hits a rate limit, do we know about it before they complain?
This lesson adds the cost, caching, and operational layers that turn raw telemetry into actionable metrics. We'll build token budgeting, prompt caching awareness, cost-per-successful-task tracking, and rate-limit telemetry, all instrumented into the traced pipeline. By the end, you'll be able to answer "is this system affordable to operate?" with data instead of hope.
What you'll learn
- Calculate and track cost per request using token counts and provider pricing
- Understand prompt caching: when it works well, when it doesn't, and how to measure cache hit rates
- Build model routing by task complexity to match cost to difficulty
- Set up token budgets that prevent runaway spending
- Add rate-limit telemetry that connects back to the rate limiting concepts from Module 1's security-basics
- Track cost per successful task, not just cost per request
Concepts
Cost per request — the total token cost for one end-to-end pipeline execution. This includes input tokens (system prompt + evidence + question), output tokens (the model's response), and any intermediate calls (routing classification, grounding checks). Provider pricing varies by model, so cost tracking requires knowing both the token count and the per-token price for each model used.
Cost per successful task — a more useful metric than cost per request. Not every request succeeds. Some abstain, some produce wrong answers, some hit rate limits. Cost per successful task divides total spending by the number of requests that actually helped the user. If your system costs $0.01 per request but only 60% of requests succeed, the real cost is $0.017 per successful task. This metric catches the trap of optimizing for cheap requests that aren't useful.
Prompt caching — a provider-level optimization where repeated prompt prefixes are cached and charged at a reduced rate (or not charged at all). Anthropic, OpenAI, and other providers offer variants of this. Caching works well when your system prompt is stable and appears in many requests, where the first request pays full price and subsequent requests with the same prefix get a discount. It works poorly when prompts vary significantly between requests (different evidence bundles, personalized instructions) because each unique prefix is a cache miss.
Cache hit rate — the percentage of requests where the prompt prefix was served from cache rather than processed from scratch. A high cache hit rate (70%+) means your prompt structure is cache-friendly. A low rate means either your prompts vary too much, or your request volume is too low for the cache to stay warm. Tracking this metric tells you whether caching is actually saving money or just adding complexity.
Token budget — a hard limit on the number of tokens a single request can consume, including both input and output. Token budgets prevent runaway costs from pathological queries (questions that trigger large evidence bundles or verbose responses). They're a safety net, not an optimization. The goal is to catch outliers, not to constrain normal operation.
Model routing by task complexity — using a cheaper, faster model for simple tasks and reserving expensive models for complex ones. In our pipeline, skip-mode questions (general knowledge) can use a smaller model, while complex hybrid-retrieval questions might need a larger one. This is a cost optimization that trades a small amount of routing complexity for significant savings on easy questions.
Rate-limit telemetry — structured tracking of rate-limit events: when a request is throttled, which limit was hit (per-user, per-route, provider-side), and what happened to the request (queued, retried, rejected). In Module 1's security-basics lesson, we set up rate limiting as a protective measure. Here we'll instrument it so you can see rate limiting in action and tune thresholds based on real usage patterns.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Unknown costs | You don't know what your system costs per request | Manual token estimates | Per-generation cost tracking in traces |
| Wasted tokens | Evidence is retrieved but never cited in the answer | Review traces manually | Token-waste metric comparing evidence tokens to citation count |
| Expensive easy questions | Simple questions use the same expensive model as complex ones | Single model for everything | Model routing by task complexity |
| Cache misses | Prompt caching is enabled but savings are minimal | Check prompt structure | Cache hit rate metric with prefix stability analysis |
| Silent rate limiting | Users hit limits but you don't know until they complain | Wait for complaints | Rate-limit event telemetry with alerts |
| Runaway costs | One pathological query costs 10x the average | No budget enforcement | Token budgets with hard cutoffs |
Walkthrough
Tracking cost per request
Provider pricing changes, so we'll build a simple pricing table that's easy to update. Our goal with this isn't precision to the penny but having any cost visibility:
# observability/cost_tracker.py
"""Token cost tracking for traced pipeline runs.
Pricing is approximate and will need periodic updates.
Check your provider's pricing page for current rates.
"""
# Prices in USD per 1M tokens — update when pricing changes
# Last updated: 2026-03-25
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00, "cached_input": 1.25},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
"gpt-4.1-mini": {"input": 0.40, "output": 1.60, "cached_input": 0.10},
"gpt-4.1-nano": {"input": 0.10, "output": 0.40, "cached_input": 0.025},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00, "cached_input": 0.30},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00, "cached_input": 0.08},
"Qwen/Qwen2.5-7B-Instruct": {"input": 0.27, "output": 0.27, "cached_input": 0.27},
"gpt-oss:20b": {"input": 0.70, "output": 0.70, "cached_input": 0.70},
}
# Fallback for unknown models. Local Ollama models do not have a hosted
# per-token price, so track them separately using latency, GPU time, or
# cloud instance cost instead of forcing a fake token rate.
DEFAULT_PRICING = {"input": 1.00, "output": 3.00, "cached_input": 0.50}
def estimate_cost(
model: str,
input_tokens: int,
output_tokens: int,
cached_input_tokens: int = 0,
) -> dict:
"""Estimate the cost of a single LLM call.
Returns a dict with input_cost, output_cost, cache_savings,
and total_cost in USD.
"""
pricing = MODEL_PRICING.get(model, DEFAULT_PRICING)
# Non-cached input tokens
non_cached_input = input_tokens - cached_input_tokens
input_cost = (non_cached_input / 1_000_000) * pricing["input"]
cached_cost = (cached_input_tokens / 1_000_000) * pricing["cached_input"]
output_cost = (output_tokens / 1_000_000) * pricing["output"]
# What caching saved
full_input_cost = (input_tokens / 1_000_000) * pricing["input"]
cache_savings = full_input_cost - (input_cost + cached_cost)
return {
"input_cost": round(input_cost + cached_cost, 6),
"output_cost": round(output_cost, 6),
"cache_savings": round(cache_savings, 6),
"total_cost": round(input_cost + cached_cost + output_cost, 6),
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cached_input_tokens": cached_input_tokens,
}
if __name__ == "__main__":
# Example: a typical RAG request
cost = estimate_cost(
model="Qwen/Qwen2.5-7B-Instruct",
input_tokens=3500, # system prompt + evidence + question
output_tokens=400,
cached_input_tokens=800, # system prompt was cached
)
print("Cost estimate for one RAG request:")
for k, v in cost.items():
if isinstance(v, float):
print(f" {k}: ${v:.6f}")
else:
print(f" {k}: {v}")python observability/cost_tracker.pyExpected output:
Cost estimate for one RAG request:
input_cost: $0.000465
output_cost: $0.000240
cache_savings: $0.000060
total_cost: $0.000705
model: Qwen/Qwen2.5-7B-Instruct
input_tokens: 3500
output_tokens: 400
cached_input_tokens: 800
Prompt caching: when it works and when it doesn't
Prompt caching reduces input token costs by caching the stable prefix of your prompt, the part that doesn't change between requests. Here's how to think about it:
Caching works well when:
- Your system prompt is stable across requests (same instructions, same format)
- Evidence bundles have common prefixes (e.g., the same repository README appears in many requests)
- Request volume is high enough to keep the cache warm (provider caches expire after inactivity)
- You're using a provider that supports caching (Anthropic's prompt caching, OpenAI's cached input pricing)
Caching works poorly when:
- Every request has a unique prompt (different evidence, different instructions)
- Request volume is low (cache expires between requests)
- You're frequently updating system prompts during development
- Evidence bundles vary significantly between question types
The practical implication for our pipeline: the system prompt and grounding instructions are stable, so they'll cache well. The evidence bundle changes every request, so it won't cache. This means caching saves money on the fixed portion of the prompt, but the variable portion (which is usually the largest part in RAG) won't benefit.
# observability/cache_metrics.py
"""Track prompt caching effectiveness across runs."""
import json
from pathlib import Path
def analyze_cache_rates(run_traces: list[dict]) -> dict:
"""Analyze cache hit rates from traced generation spans.
Expects a list of trace dicts with 'cached_input_tokens'
and 'input_tokens' fields from the generation spans.
"""
total_input_tokens = 0
total_cached_tokens = 0
generations = 0
for trace in run_traces:
if "input_tokens" in trace and "cached_input_tokens" in trace:
total_input_tokens += trace["input_tokens"]
total_cached_tokens += trace["cached_input_tokens"]
generations += 1
if generations == 0:
return {"cache_hit_rate": 0, "generations": 0}
hit_rate = total_cached_tokens / total_input_tokens if total_input_tokens > 0 else 0
return {
"cache_hit_rate": round(hit_rate * 100, 1),
"total_input_tokens": total_input_tokens,
"total_cached_tokens": total_cached_tokens,
"generations": generations,
"estimated_savings_pct": round(hit_rate * 50, 1), # rough: cached is ~50% cheaper
}
if __name__ == "__main__":
# Simulated data from a traced benchmark run
sample_traces = [
{"input_tokens": 3500, "cached_input_tokens": 800},
{"input_tokens": 4200, "cached_input_tokens": 800},
{"input_tokens": 2100, "cached_input_tokens": 800},
{"input_tokens": 3800, "cached_input_tokens": 800},
{"input_tokens": 500, "cached_input_tokens": 200}, # skip-mode, smaller prompt
]
metrics = analyze_cache_rates(sample_traces)
print("Cache analysis:")
for k, v in metrics.items():
print(f" {k}: {v}")python observability/cache_metrics.pyExpected output:
Cache analysis:
cache_hit_rate: 24.1
total_input_tokens: 14100
total_cached_tokens: 3400
generations: 5
estimated_savings_pct: 12.1
A 24% cache hit rate tells you that about a quarter of your input tokens are coming from cached prefixes, likely the system prompt. To improve this, you'd need to stabilize more of the prompt prefix (e.g., by putting frequently-used context before the variable evidence). But there's a tradeoff: restructuring prompts for cacheability can hurt clarity. I've found it's usually better to optimize prompt structure for quality first and accept whatever caching you get naturally.
Model routing by task complexity
Not every question needs the same model. Skip-mode questions (general knowledge) are simple, and a smaller, cheaper model handles them fine. Complex hybrid-retrieval questions with nuanced evidence might benefit from a larger model. Model routing matches cost to difficulty:
# observability/model_router.py
"""Route questions to appropriate models based on task complexity.
This extends the retrieval router from Module 5 with a model
selection step, so simple questions use cheap models and complex
questions use capable ones.
"""
import sys
sys.path.insert(0, ".")
from rag.retrieval_service import (
classify_question, RetrievalPolicy, RetrievalMode,
)
# Model tiers — adjust based on your provider and budget
MODEL_TIERS = {
"simple": "gpt-4.1-nano",
"standard": "gpt-4o-mini",
"complex": "gpt-4o",
}
def select_model(
question: str,
retrieval_mode: RetrievalMode,
confidence: float,
) -> str:
"""Select a model tier based on question complexity signals.
Simple questions (skip mode, high confidence) get cheap models.
Complex questions (hybrid mode, low confidence) get capable models.
"""
if retrieval_mode == RetrievalMode.SKIP:
return MODEL_TIERS["simple"]
if retrieval_mode == RetrievalMode.HYBRID and confidence < 0.5:
# Ambiguous question that needed hybrid retrieval —
# the answer will require careful reasoning over mixed evidence
return MODEL_TIERS["complex"]
return MODEL_TIERS["standard"]
if __name__ == "__main__":
policy = RetrievalPolicy()
test_questions = [
"What is a Python list?",
"What does validate_path return?",
"What functions call read_file and how does the caching interact with the auth module?",
]
for q in test_questions:
classification = classify_question(q, policy)
model = select_model(q, classification.mode, classification.confidence)
print(f"Q: {q[:60]}...")
print(f" Route: {classification.mode.value} (confidence: {classification.confidence})")
print(f" Model: {model}")
print()python observability/model_router.pyModel routing is a cost optimization, not a quality optimization. The goal is to avoid paying for capability you don't need, not to find the "best" model for each question. Start with a single model for everything (as we've been doing), measure costs, and only add model routing if the cost savings justify the added complexity.
Token budgets
Token budgets are a safety net against pathological queries. A question that triggers a massive evidence bundle or an unusually verbose response can cost 10-50x the average. Budgets catch these outliers:
# observability/token_budget.py
"""Token budget enforcement for the RAG pipeline.
Prevents runaway costs from pathological queries by enforcing
hard limits on input and output tokens.
"""
from dataclasses import dataclass
@dataclass
class TokenBudget:
"""Per-request token budget configuration."""
max_input_tokens: int = 8000 # evidence + system prompt + question
max_output_tokens: int = 2000 # model response
max_total_tokens: int = 10000 # combined limit
warn_threshold: float = 0.8 # log a warning at 80% of budget
def check(self, input_tokens: int, output_tokens: int = 0) -> dict:
"""Check whether a request is within budget.
Returns a dict with 'allowed', 'warnings', and usage details.
"""
total = input_tokens + output_tokens
warnings = []
if input_tokens > self.max_input_tokens:
return {
"allowed": False,
"reason": f"Input tokens ({input_tokens}) exceed budget ({self.max_input_tokens})",
"warnings": warnings,
}
if total > self.max_total_tokens:
return {
"allowed": False,
"reason": f"Total tokens ({total}) exceed budget ({self.max_total_tokens})",
"warnings": warnings,
}
# Warnings for approaching limits
if input_tokens > self.max_input_tokens * self.warn_threshold:
warnings.append(
f"Input tokens ({input_tokens}) at "
f"{input_tokens/self.max_input_tokens:.0%} of budget"
)
if total > self.max_total_tokens * self.warn_threshold:
warnings.append(
f"Total tokens ({total}) at "
f"{total/self.max_total_tokens:.0%} of budget"
)
return {
"allowed": True,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total,
"input_pct": round(input_tokens / self.max_input_tokens * 100, 1),
"total_pct": round(total / self.max_total_tokens * 100, 1),
"warnings": warnings,
}
if __name__ == "__main__":
budget = TokenBudget()
test_cases = [
(3500, 400, "Normal RAG request"),
(7500, 500, "Large evidence bundle"),
(9000, 1500, "Pathological query"),
]
for input_t, output_t, label in test_cases:
result = budget.check(input_t, output_t)
status = "ALLOWED" if result["allowed"] else "BLOCKED"
print(f"{label}: {status}")
if result.get("warnings"):
for w in result["warnings"]:
print(f" WARNING: {w}")
if not result["allowed"]:
print(f" REASON: {result['reason']}")
print()python observability/token_budget.pyExpected output:
Normal RAG request: ALLOWED
Large evidence bundle: ALLOWED
WARNING: Input tokens (7500) at 94% of budget
Pathological query: BLOCKED
REASON: Input tokens (9000) exceed budget (8000)
Rate-limit telemetry
In Module 1's security-basics lesson, we set up rate limiting to protect the system. Here we'll add telemetry to those limits so you can see when they fire and whether the thresholds are right:
# observability/rate_limit_telemetry.py
"""Rate-limit event tracking.
Instruments rate-limit decisions so you can see throttling
in your traces and tune thresholds from real usage data.
"""
import time
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timezone
@dataclass
class RateLimitEvent:
"""A single rate-limit event for telemetry."""
timestamp: str
user_id: str
route: str
action: str # "allowed", "throttled", "rejected"
current_count: int
limit: int
window_seconds: int
@dataclass
class RateLimitTracker:
"""Track rate-limit events for telemetry and analysis.
This wraps your existing rate limiter (from Module 1) and
records every decision as a telemetry event.
"""
events: list = field(default_factory=list)
counters: dict = field(default_factory=lambda: defaultdict(list))
# Configurable limits per route
limits: dict = field(default_factory=lambda: {
"rag-pipeline": {"max_requests": 30, "window_seconds": 60},
"benchmark-run": {"max_requests": 100, "window_seconds": 300},
"default": {"max_requests": 60, "window_seconds": 60},
})
def check(self, user_id: str, route: str = "default") -> RateLimitEvent:
"""Check whether a request is within rate limits.
Records the decision as a telemetry event regardless of outcome.
"""
now = time.time()
config = self.limits.get(route, self.limits["default"])
window = config["window_seconds"]
max_req = config["max_requests"]
# Sliding window: count requests in the last N seconds
key = f"{user_id}:{route}"
self.counters[key] = [
t for t in self.counters[key] if now - t < window
]
current = len(self.counters[key])
if current >= max_req:
action = "throttled"
else:
action = "allowed"
self.counters[key].append(now)
event = RateLimitEvent(
timestamp=datetime.now(timezone.utc).isoformat(),
user_id=user_id,
route=route,
action=action,
current_count=current,
limit=max_req,
window_seconds=window,
)
self.events.append(event)
return event
def summary(self) -> dict:
"""Summarize rate-limit events for a telemetry report."""
total = len(self.events)
throttled = sum(1 for e in self.events if e.action == "throttled")
by_route = defaultdict(lambda: {"total": 0, "throttled": 0})
for e in self.events:
by_route[e.route]["total"] += 1
if e.action == "throttled":
by_route[e.route]["throttled"] += 1
return {
"total_events": total,
"throttled_events": throttled,
"throttle_rate": round(throttled / total * 100, 1) if total > 0 else 0,
"by_route": dict(by_route),
}
if __name__ == "__main__":
tracker = RateLimitTracker(
limits={
"rag-pipeline": {"max_requests": 5, "window_seconds": 10},
"default": {"max_requests": 10, "window_seconds": 10},
}
)
# Simulate a burst of requests
print("Simulating 8 rapid requests to rag-pipeline:\n")
for i in range(8):
event = tracker.check("user-123", "rag-pipeline")
print(f" Request {i+1}: {event.action} ({event.current_count}/{event.limit})")
print()
summary = tracker.summary()
print(f"Summary:")
print(f" Total events: {summary['total_events']}")
print(f" Throttled: {summary['throttled_events']}")
print(f" Throttle rate: {summary['throttle_rate']}%")python observability/rate_limit_telemetry.pyExpected output:
Simulating 8 rapid requests to rag-pipeline:
Request 1: allowed (0/5)
Request 2: allowed (1/5)
Request 3: allowed (2/5)
Request 4: allowed (3/5)
Request 5: allowed (4/5)
Request 6: throttled (5/5)
Request 7: throttled (5/5)
Request 8: throttled (5/5)
Summary:
Total events: 8
Throttled: 3
Throttle rate: 37.5%
The throttle rate tells you whether your limits are too tight (high throttle rate during normal use) or too loose (zero throttling even during bursts). In production, you'd feed these events into your Langfuse traces so you can see rate limiting alongside the request traces.
Cost per successful task
This is the metric that actually matters. Cost per request is easy to calculate but misleading because it treats failed requests the same as successful ones. Cost per successful task gives you the true unit economics:
# observability/success_cost.py
"""Calculate cost per successful task from a graded run log.
Combines the cost tracker with graded run results to give you
the metric that actually matters for operational decisions.
"""
import json
import sys
sys.path.insert(0, ".")
from observability.cost_tracker import estimate_cost
def cost_per_success(run_file: str, model: str = "gpt-4o-mini") -> dict:
"""Calculate cost per successful task from a graded run log."""
entries = []
with open(run_file) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
graded = [e for e in entries if e.get("grade") is not None]
if not graded:
return {"error": "No graded entries found"}
# Estimate costs (using average token counts since run logs
# don't yet have per-request token data — that's coming in
# the harness lesson)
total_cost = 0.0
for e in graded:
cost = estimate_cost(
model=model,
input_tokens=3500, # average estimate
output_tokens=400, # average estimate
)
total_cost += cost["total_cost"]
successful = [
e for e in graded
if e["grade"] in ("fully_correct", "partially_correct")
]
cost_per_request = total_cost / len(graded)
cost_per_success_val = total_cost / len(successful) if successful else float("inf")
return {
"total_requests": len(graded),
"successful_requests": len(successful),
"success_rate": round(len(successful) / len(graded) * 100, 1),
"total_cost": round(total_cost, 4),
"cost_per_request": round(cost_per_request, 6),
"cost_per_successful_task": round(cost_per_success_val, 6),
"cost_overhead_from_failures": round(
(cost_per_success_val - cost_per_request) / cost_per_request * 100, 1
) if successful else None,
}
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python observability/success_cost.py <graded-run-file.jsonl>")
print("Example: python observability/success_cost.py harness/runs/baseline-2026-03-24-graded.jsonl")
sys.exit(1)
metrics = cost_per_success(sys.argv[1])
print("Cost per successful task:")
for k, v in metrics.items():
if isinstance(v, float):
if "cost" in k:
print(f" {k}: ${v}")
else:
print(f" {k}: {v}%")
else:
print(f" {k}: {v}")python observability/success_cost.py harness/runs/baseline-2026-03-24-143022-graded.jsonlExpected output (based on the typical baseline from Module 2):
Cost per successful task:
total_requests: 30
successful_requests: 11
success_rate: 36.7%
total_cost: $0.0212
cost_per_request: $0.000705
cost_per_successful_task: $0.001924
cost_overhead_from_failures: 173.0%
That's a 173% overhead, meaning failures nearly triple your effective cost. This is why improving accuracy is a cost concern just as much as it is a quality concern. Every failed request is money spent with no return. We'll see this metric improve as we add retrieval and better grading in the eval lessons.
Exercises
- Run
observability/cost_tracker.pywith your actual model and typical token counts. Compare costs across at least two models (e.g.,gpt-4o-minivs.gpt-4oorclaude-haiku-3-5vs.claude-sonnet-4-6). What's the cost ratio between the cheapest and most expensive option? - Analyze prompt caching potential for your pipeline. Count the stable tokens (system prompt, grounding instructions) vs. variable tokens (evidence, question) in a typical request. What's the theoretical maximum cache hit rate?
- Add cost tracking to your traced pipeline from the previous lesson. After each generation span, calculate the cost and attach it as metadata. Run 10 benchmark questions and check the per-question cost distribution in Langfuse.
- Calculate cost per successful task for your latest graded benchmark run using
observability/success_cost.py. Then estimate what the metric would be if accuracy improved by 20 percentage points. - Set up the rate-limit tracker and simulate your benchmark run's request pattern. Are the default limits appropriate, or would a benchmark run get throttled?
Completion checkpoint
You should now have:
- A cost tracking module that estimates per-request cost from token counts and model pricing
- An understanding of prompt caching: when it helps, when it doesn't, and how to measure cache hit rates
- A model routing strategy that matches model cost to task complexity
- Token budgets that catch pathological queries before they become expensive
- Rate-limit telemetry that records throttling decisions as structured events
- Cost per successful task calculated for at least one graded benchmark run
Reflection prompts
- What surprised you about the cost breakdown? Was the model (input vs. output) or the evidence (retrieval tokens) the bigger factor?
- If you had to cut costs by 50% without reducing accuracy, which optimization would you try first: model routing, prompt caching, or retrieval skipping?
- How does the cost per successful task metric change your thinking about accuracy improvements compared to cost optimizations?
Connecting to the project
We now have two layers of operational telemetry: traces that show what happened in each request, and cost metrics that show what each request cost. Together, they answer the question "is this system affordable and visible enough to operate?"
But we've been building these pieces (the benchmark, the run logs, the telemetry, the cost tracking) as separate scripts. The next lesson pulls them together into a single harness: one command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. That harness will become the foundation for every evaluation we build afterward.
What's next
Building Your AI Harness. You have the pieces of an experiment system now, scattered across modules; the next lesson assembles them into one repeatable loop.
References
Start here
- Anthropic: Prompt caching — Anthropic's prompt caching documentation, including when caching activates and pricing discounts
Build with this
- OpenAI: Prompt caching FAQ — OpenAI's approach to prompt caching with details on cache behavior and pricing
- Langfuse: Cost tracking — how Langfuse calculates and displays per-generation costs from token usage
Deep dive
- Anthropic: Rate limits — rate-limit behavior, headers, and retry strategies for the Anthropic API
- Cost and Reliability Patterns — the full decision framework for prompt caching, rate limiting, model routing, token budgeting, and circuit breakers