Telemetry Event Model

Quick-reference for the structured telemetry events instrumented across the RAG pipeline. Each event becomes a span (or generation) in the trace backend.

For the full teaching context, see Telemetry: Making Your System Visible and Cost, Caching and Rate Limits.

The five instrumented events

These are the key points in the pipeline where we create spans. Together, they produce a complete trace for any request.

Event	Span name	What it captures	When it fires
Run start	`rag-pipeline` (trace root)	Question text, retrieval mode, run ID, repo SHA, model name	A request arrives at the pipeline entry point
Route chosen	`routing`	Selected retrieval mode, classification confidence, reasoning, skipped flag	After the retrieval router classifies the question
Tool call	`retrieval`	Question, retrieval mode, list of files retrieved	When any retrieval tool is invoked (vector, lexical, graph, hybrid)
Retrieval return	(end of `retrieval` span)	Snippet count, total token count, file paths of retrieved evidence	When the evidence bundle comes back from retrieval
Response completion	`generate-grounded-answer` (generation)	Model name, input token count (exact), output token count (approximate in run-log, exact in Langfuse when using the OpenAI wrapper), answer text, citation count, duration	After the model generates its final response

Additional spans

Span name	What it captures	When it fires
`grounding-check`	Whether evidence is sufficient, reason for insufficiency	After retrieval, before generation
`generate-no-retrieval`	Model, tokens, answer (for skip-mode questions)	When routing decides no retrieval is needed

The three telemetry layers

AI systems need telemetry at three distinct levels. Confusing them leads to blind spots.

Layer	What it measures	Scope	Example metrics
1. Agent traces	What the AI system did for a single request	Per-request	Route chosen, tools called, evidence retrieved, tokens consumed, answer generated
2. Application observability	How the service performs across many requests	Aggregate	Latency p50/p95/p99, error rate, throughput, resource utilization
3. Product outcome events	Whether the system actually helps users	Business	Task completion rate, answer acceptance, escalation to human, user feedback

Most teams start with layer 1, bolt on layer 2, and only add layer 3 when they realize they can't tell whether "fast and cheap" means "useful."

Langfuse-to-OpenTelemetry concept mapping

The concepts are portable even though SDK calls differ.

Langfuse concept	OpenTelemetry equivalent	What it represents
Trace	Trace	One end-to-end request
Span	Span	One step within the request
Generation	Span with `gen_ai.*` attributes	An LLM call specifically
Metadata	Span attributes	Key-value data attached to a span
Score	-- (custom)	Eval result attached to a trace

The OpenTelemetry Semantic Conventions for GenAI define a standard attribute set for LLM calls that most observability tools understand.

Trace structure example

A typical traced RAG request produces this span hierarchy:

rag-pipeline (trace)
├── routing (span)
│     mode: "hybrid", confidence: 0.72
├── retrieval (span)
│     snippet_count: 4, total_tokens: 1823
├── grounding-check (span)
│     sufficient: true
└── generate-grounded-answer (generation)
      model: "gpt-4o-mini", input: 2400, output: 380, duration_ms: 1240

For skip-mode questions (general knowledge, no retrieval needed):

rag-pipeline (trace)
├── routing (span)
│     mode: "skip", confidence: 0.91
└── generate-no-retrieval (generation)
      model: "gpt-4o-mini", input: 520, output: 280, duration_ms: 640

Key data captured per generation

Field	Source	Purpose
`model`	Pipeline config	Model identification for cost tracking
`input` (tokens)	API response `usage`	Input token count for cost calculation (exact when available from API)
`output` (tokens)	API response `usage`	Output token count (exact via Langfuse OpenAI wrapper; approximate in run-log fallback; the taught implementation records `0` and infers from input)
`total` (tokens)	Computed	Input + output for budget checks (approximate when output is inferred)
`duration_ms`	`time.perf_counter()`	Latency measurement
`citation_count`	Answer post-processing	Grounding quality signal

What traces reveal that logs don't

Timing relationships -- retrieval took 340ms but generation took 1200ms, so generation is the bottleneck
Causal structure -- the routing span shows why hybrid mode was chosen; the retrieval span shows what evidence came back
Token accounting -- how many input tokens were evidence vs. system prompt
Failure isolation -- trace backward from a wrong answer to see whether evidence was bad (retrieval problem) or evidence was good but the model ignored it (generation problem)