Tool-Use and Trace Evals
The previous lesson evaluated outputs: did retrieval find the right evidence, and was the answer correct? This lesson evaluates behavior: did the system call the right tools, and did the overall execution path make sense?
These eval families catch a different class of failure. The answer might be correct, but the system got there by calling three unnecessary tools, retrieving from the wrong index, and burning 5x the normal token budget. Or the answer might be wrong, and the retrieval eval says "evidence was fine," which means the failure happened somewhere in the execution path between retrieval and generation. Tool-use evals and trace evals find these behavioral failures that output evals can't see.
What you'll learn
- Build tool-use evals that check whether the right tools were called with the right arguments
- Apply a trace labeling taxonomy to classify execution paths as optimal, wasteful, or broken
- Label traces with specific failure modes: wrong route, missed retrieval, unnecessary tool call, and others
- Build CI-friendly eval patterns that run automatically and flag regressions
- Connect all four eval families into the harness for a complete evaluation pass
Concepts
Tool-use eval — an evaluation of whether the system called the right tools with the right arguments and avoided calling tools it didn't need. Tool-use evals sit between retrieval evals and trace evals in the evaluation hierarchy. A retrieval eval asks "did we find the right evidence?" A tool-use eval asks "did we call the right functions to find that evidence?" The distinction matters because a correct answer can come from an incorrect tool path — the system might have called search_code when read_file would have been cheaper, or called both when only one was needed.
Trace eval — an evaluation of the full execution path for a single request, from routing decision through tool calls through generation. Trace evals are the most holistic eval family because they evaluate the system's strategy, not just its output or individual steps. A trace eval might flag a request where the system used hybrid retrieval (expensive) for a question that should have been code-only (cheap), even though the final answer was correct.
Trace labeling taxonomy — a structured set of labels for classifying trace-level failures. Where the answer grading rubric has four grades (fully correct, partially correct, unsupported, wrong), the trace taxonomy has labels for how the system behaved:
- wrong_route — the router picked the wrong retrieval mode (e.g., docs mode for a code question)
- missed_retrieval — the system should have retrieved but didn't, or retrieved from the wrong source
- bad_citation — the answer claims to cite evidence but the citation is incorrect or fabricated
- unnecessary_tool_call — the system called a tool that wasn't needed for this question
- correct_but_wasteful — the answer is right, but the execution path used more resources than necessary
- correct_and_optimal — the answer is right and the path was efficient
This taxonomy turns "the system feels slow" into "23% of traces are correct_but_wasteful, and the waste comes from unnecessary hybrid retrieval." That's actionable.
CI-friendly eval — an evaluation that can run in a continuous integration pipeline and produce a pass/fail result. CI-friendly evals are fast (minutes, not hours), deterministic enough to avoid flaky failures, and produce structured output that CI systems can parse. They're the eval equivalent of unit tests: not comprehensive, but fast enough to run on every commit.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Unnecessary tool calls | System calls tools for questions it could answer directly | Review traces manually | Tool-call count eval with expected tool lists |
| Wrong tool arguments | Tools are called correctly but with suboptimal arguments | Read tool call logs | Argument validation checks |
| Wasteful execution paths | Correct answers that cost 5x the average | Sort by cost and inspect | Trace labeling with the wasteful taxonomy |
| Routing regressions | Pipeline change broke routing for a class of questions | Re-run benchmark | CI-friendly routing accuracy check |
| Invisible path failures | The answer is fine but the execution path was fragile | No trace-level evals | Trace labeling pass on benchmark runs |
Walkthrough
Tool-use evals
Tool-use evals check three things:
- Were the right tools called? For a code lookup question, we expect
search_codeorread_file, notsearch_docs. - Were the arguments correct? If
search_codewas called, was the query reasonable? - Were unnecessary tools avoided? For a general knowledge question (skip mode), no tools should be called at all.
To make this work, we'll add expected tool information to benchmark questions:
{"id": "q001", "question": "What does validate_path return?", "expected_tools": ["search_code"], "expected_route": "code"}
{"id": "q003", "question": "What is a Python list?", "expected_tools": [], "expected_route": "skip"}
{"id": "q004", "question": "What calls read_file and why?", "expected_tools": ["search_code", "search_docs"], "expected_route": "hybrid"}Then build the grader:
# harness/graders/tool_grader.py
"""Tool-use evaluation.
Checks whether the system called the expected tools and avoided
unnecessary ones. Works with the tools_called field in run logs.
"""
import json
import sys
from collections import Counter
def grade_tool_use(entry: dict, benchmark_question: dict) -> dict:
"""Grade tool use for a single entry.
Returns metrics for tool precision, recall, and waste.
"""
expected_tools = set(benchmark_question.get("expected_tools", []))
expected_route = benchmark_question.get("expected_route")
actual_tools = set(entry.get("tools_called", []))
actual_route = entry.get("retrieval_method", "unknown")
# Tool-level metrics
if expected_tools or actual_tools:
correct_tools = expected_tools & actual_tools
unnecessary = actual_tools - expected_tools
missing = expected_tools - actual_tools
tool_precision = (
len(correct_tools) / len(actual_tools)
if actual_tools else (1.0 if not expected_tools else 0.0)
)
tool_recall = (
len(correct_tools) / len(expected_tools)
if expected_tools else (1.0 if not actual_tools else 0.0)
)
else:
# No tools expected, no tools called — correct
tool_precision = 1.0
tool_recall = 1.0
unnecessary = set()
missing = set()
# Route accuracy
route_correct = None
if expected_route:
route_correct = expected_route == actual_route
return {
"tool_precision": round(tool_precision, 3),
"tool_recall": round(tool_recall, 3),
"unnecessary_tools": list(unnecessary),
"missing_tools": list(missing),
"route_correct": route_correct,
"expected_route": expected_route,
"actual_route": actual_route,
}
def grade_run_tools(
run_file: str,
benchmark_file: str = "benchmark-questions.jsonl",
) -> list[dict]:
"""Grade tool use for an entire run."""
benchmark = {}
with open(benchmark_file) as f:
for line in f:
if line.strip():
q = json.loads(line)
benchmark[q["id"]] = q
entries = []
with open(run_file) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
results = []
for entry in entries:
q_id = entry["question_id"]
if q_id in benchmark:
bq = benchmark[q_id]
if bq.get("expected_tools") is not None or bq.get("expected_route"):
grade = grade_tool_use(entry, bq)
grade["question_id"] = q_id
grade["question"] = entry["question"][:60]
results.append(grade)
return results
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python harness/graders/tool_grader.py <run-file.jsonl>")
sys.exit(1)
results = grade_run_tools(sys.argv[1])
if not results:
print("No questions with expected_tools or expected_route found.")
print("Add these fields to your benchmark questions to enable tool-use evals.")
sys.exit(0)
# Summary
precisions = [r["tool_precision"] for r in results]
recalls = [r["tool_recall"] for r in results]
route_checks = [r for r in results if r["route_correct"] is not None]
print(f"Tool-use eval: {len(results)} questions\n")
print(f" Avg tool precision: {sum(precisions)/len(precisions):.1%}")
print(f" Avg tool recall: {sum(recalls)/len(recalls):.1%}")
if route_checks:
route_accuracy = sum(1 for r in route_checks if r["route_correct"]) / len(route_checks)
print(f" Route accuracy: {route_accuracy:.1%}")
# Show unnecessary tool calls
wasteful = [r for r in results if r["unnecessary_tools"]]
if wasteful:
print(f"\nUnnecessary tool calls ({len(wasteful)} questions):")
for r in wasteful:
print(f" {r['question']}")
for t in r["unnecessary_tools"]:
print(f" UNNECESSARY: {t}")
# Show missing tools
missing = [r for r in results if r["missing_tools"]]
if missing:
print(f"\nMissing tool calls ({len(missing)} questions):")
for r in missing:
print(f" {r['question']}")
for t in r["missing_tools"]:
print(f" MISSING: {t}")python harness/graders/tool_grader.py harness/runs/harness-2026-03-25-142233.jsonlTrace labeling
Trace labeling applies the taxonomy to each trace in a run. Unlike the tool grader (which uses rules), trace labeling requires looking at the full execution path, including the combination of routing, retrieval, tool calls, and generation:
# harness/graders/trace_labeler.py
"""Trace-level evaluation using the trace labeling taxonomy.
Labels each trace as:
- correct_and_optimal: right answer, efficient path
- correct_but_wasteful: right answer, inefficient path
- wrong_route: routing error that affected the outcome
- missed_retrieval: retrieval should have happened but didn't
- bad_citation: answer cites evidence that doesn't exist or is wrong
- unnecessary_tool_call: tools called that weren't needed
"""
import json
import sys
# Trace labeling taxonomy
TRACE_LABELS = [
"correct_and_optimal",
"correct_but_wasteful",
"wrong_route",
"missed_retrieval",
"bad_citation",
"unnecessary_tool_call",
]
def label_trace(entry: dict, benchmark_question: dict) -> dict:
"""Apply trace labels to a single entry.
Uses the grade, tool use, and route information to classify
the overall execution path.
"""
grade = entry.get("grade", "unknown")
failure_label = entry.get("failure_label")
expected_route = benchmark_question.get("expected_route")
actual_route = entry.get("retrieval_method", "unknown")
expected_tools = set(benchmark_question.get("expected_tools", []))
actual_tools = set(entry.get("tools_called", []))
duration = entry.get("duration_seconds", 0)
context_tokens = entry.get("context_tokens", 0)
labels = []
reasoning = []
# Check routing
if expected_route and expected_route != actual_route:
labels.append("wrong_route")
reasoning.append(
f"Expected route '{expected_route}', got '{actual_route}'"
)
# Check for unnecessary tools
unnecessary = actual_tools - expected_tools
if unnecessary:
labels.append("unnecessary_tool_call")
reasoning.append(f"Unnecessary tools: {unnecessary}")
# Check for missed retrieval
if failure_label in ("missing_evidence", "retrieval_miss"):
labels.append("missed_retrieval")
reasoning.append(f"Failure label indicates retrieval problem: {failure_label}")
# Check for citation issues
if failure_label == "hallucination" and context_tokens > 0:
labels.append("bad_citation")
reasoning.append("Evidence was retrieved but answer contained hallucinations")
# Determine overall path quality
if grade == "fully_correct":
# Check if the path was wasteful
# Heuristic: if context tokens are >2x the median, it's wasteful
# (in practice, you'd compute the median from the full run)
if context_tokens > 6000 or duration > 5.0:
labels.append("correct_but_wasteful")
reasoning.append(
f"Correct but used {context_tokens} context tokens / {duration}s"
)
elif not labels: # no other issues found
labels.append("correct_and_optimal")
reasoning.append("Correct answer via efficient path")
if not labels:
labels.append("correct_but_wasteful" if grade in ("fully_correct", "partially_correct") else "wrong_route")
return {
"trace_labels": labels,
"primary_label": labels[0],
"reasoning": "; ".join(reasoning),
}
def label_run_traces(
run_file: str,
benchmark_file: str = "benchmark-questions.jsonl",
) -> str:
"""Apply trace labels to an entire graded run.
Writes a new file with trace labels added to each entry.
"""
benchmark = {}
with open(benchmark_file) as f:
for line in f:
if line.strip():
q = json.loads(line)
benchmark[q["id"]] = q
entries = []
with open(run_file) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
label_counts = {}
for entry in entries:
q_id = entry["question_id"]
bq = benchmark.get(q_id, {})
result = label_trace(entry, bq)
entry["trace_labels"] = result["trace_labels"]
entry["primary_trace_label"] = result["primary_label"]
entry["trace_reasoning"] = result["reasoning"]
for label in result["trace_labels"]:
label_counts[label] = label_counts.get(label, 0) + 1
# Save labeled version
output = run_file.replace(".jsonl", "-traced.jsonl")
with open(output, "w") as f:
for entry in entries:
f.write(json.dumps(entry) + "\n")
# Print summary
total = len(entries)
print(f"Trace labeling: {total} entries\n")
print("Label distribution:")
for label in TRACE_LABELS:
count = label_counts.get(label, 0)
pct = count / total * 100 if total > 0 else 0
bar = "#" * int(pct / 2)
print(f" {label:30s}: {count:3d} ({pct:5.1f}%) {bar}")
print(f"\nLabeled results saved to {output}")
return output
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python harness/graders/trace_labeler.py <graded-run-file.jsonl>")
print("Note: run this on a GRADED file (after answer_grader.py)")
sys.exit(1)
label_run_traces(sys.argv[1])python harness/graders/trace_labeler.py harness/runs/harness-2026-03-25-142233-graded.jsonlExpected output:
Trace labeling: 30 entries
Label distribution:
correct_and_optimal : 12 ( 40.0%) ####################
correct_but_wasteful : 6 ( 20.0%) ##########
wrong_route : 4 ( 13.3%) ######
missed_retrieval : 5 ( 16.7%) ########
bad_citation : 2 ( 6.7%) ###
unnecessary_tool_call : 1 ( 3.3%) #
Labeled results saved to harness/runs/harness-2026-03-25-142233-graded-traced.jsonl
This distribution is what turns vague complaints into engineering priorities. If 20% of traces are correct_but_wasteful, you know cost optimization will help. If 13% have wrong_route, routing accuracy is the bottleneck. The labels tell you where to focus.
CI-friendly eval patterns
For CI pipelines, you need evals that run fast and produce pass/fail results. Here's a pattern that runs a small eval suite and fails if quality drops below a threshold:
# harness/ci_eval.py
"""CI-friendly eval runner.
Runs a small benchmark subset, auto-grades, and exits with
a non-zero code if quality is below threshold.
Usage:
python harness/ci_eval.py [--threshold 0.6]
"""
import argparse
import json
import os
import sys
import time
sys.path.insert(0, ".")
# CI eval uses a smaller benchmark subset for speed
CI_BENCHMARK = "benchmark-questions-ci.jsonl"
THRESHOLD_DEFAULT = 0.6 # 60% fully_correct minimum
def run_ci_eval(threshold: float = THRESHOLD_DEFAULT) -> bool:
"""Run CI eval and return True if quality meets threshold."""
from datetime import datetime, timezone
from observability.traced_pipeline import traced_rag_pipeline, langfuse
from retrieval.hybrid_retrieve import hybrid_retrieve
from harness.graders.answer_grader import grade_answer
# Load CI benchmark (smaller subset)
benchmark_file = CI_BENCHMARK
if not os.path.exists(benchmark_file):
benchmark_file = "benchmark-questions.jsonl"
print(f"CI benchmark not found, using full benchmark (slower)")
questions = []
benchmark = {}
with open(benchmark_file) as f:
for line in f:
if line.strip():
q = json.loads(line)
questions.append(q)
benchmark[q["id"]] = q
# Limit to 10 questions for CI speed
questions = questions[:10]
run_id = f"ci-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}"
print(f"CI eval: {len(questions)} questions, threshold: {threshold:.0%}\n")
# Run and grade
results = []
start = time.perf_counter()
for i, q in enumerate(questions):
print(f" [{i+1}/{len(questions)}] {q['question'][:50]}...")
answer = traced_rag_pipeline(
question=q["question"],
hybrid_retrieve_fn=hybrid_retrieve,
run_id=run_id,
)
# Auto-grade if gold answer exists
gold = q.get("gold_answer", "")
if gold:
grade_result = grade_answer(
question=q["question"],
gold_answer=gold,
system_answer=answer.answer,
)
results.append(grade_result)
langfuse.flush()
duration = time.perf_counter() - start
# Calculate pass/fail
if not results:
print("\nNo gradable results. Add gold_answer to benchmark questions.")
return False
correct = sum(1 for r in results if r["grade"] == "fully_correct")
accuracy = correct / len(results)
print(f"\n{'='*40}")
print(f"CI eval complete in {duration:.1f}s")
print(f" Accuracy: {accuracy:.0%} ({correct}/{len(results)})")
print(f" Threshold: {threshold:.0%}")
if accuracy >= threshold:
print(f" Result: PASS")
return True
else:
print(f" Result: FAIL")
# Show failures for debugging
for r in results:
if r["grade"] != "fully_correct":
print(f" {r['grade']}: {r.get('reasoning', '')[:80]}")
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="CI eval runner")
parser.add_argument(
"--threshold", type=float, default=THRESHOLD_DEFAULT,
help=f"Minimum fully_correct rate (default: {THRESHOLD_DEFAULT})",
)
args = parser.parse_args()
passed = run_ci_eval(args.threshold)
sys.exit(0 if passed else 1)# Run CI eval with default threshold
python harness/ci_eval.py
# Run with a custom threshold
python harness/ci_eval.py --threshold 0.5To integrate with your CI system, add this to your pipeline configuration. In GitHub Actions:
# .github/workflows/eval.yml
name: AI Eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements.txt
- run: python harness/ci_eval.py --threshold 0.6
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}A few practical notes on CI evals:
- Keep the CI benchmark small (10-15 questions) for speed. Use the full benchmark for milestone evaluations.
- Set the threshold conservatively at first. A threshold that's too high will cause flaky failures from LLM judge variance.
- Cache the baseline grade so you can detect regressions. If accuracy drops more than 10% from the baseline, that's worth investigating even if it's still above the absolute threshold.
- Log the full results, not just pass/fail. When CI fails, you'll need the per-question details to debug.
The complete eval flow
Here's the full evaluation pipeline, from harness run to CI check:
# Full evaluation pass (milestone)
python harness/run_harness.py # Run benchmark
python harness/graders/answer_grader.py harness/runs/latest.jsonl # Grade answers
python harness/graders/retrieval_grader.py harness/runs/latest.jsonl # Check retrieval
python harness/graders/tool_grader.py harness/runs/latest.jsonl # Check tool use
python harness/graders/trace_labeler.py harness/runs/latest-graded.jsonl # Label traces
python harness/summarize_run.py harness/runs/latest-graded-traced.jsonl # Summarize
# Quick CI check (every PR)
python harness/ci_eval.py --threshold 0.6The milestone pass gives you a complete picture: answer quality, retrieval quality, tool use efficiency, and trace-level behavior. The CI check gives you a fast regression gate. Together, they're the evaluation layer of the harness we've been building across this module.
Exercises
- Add
expected_toolsandexpected_routeto at least 10 benchmark questions. Run the tool grader and identify the most common tool-use issues. - Run the trace labeler on your latest graded run. What percentage of traces are
correct_and_optimal? What's the most common non-optimal label? - Create a CI benchmark file (
benchmark-questions-ci.jsonl) with 10 representative questions (at least 2 from each category). Runharness/ci_eval.pyand verify it completes in under 2 minutes. - Make a deliberate change to your pipeline (e.g., reduce the token budget, change the routing thresholds) and run the CI eval. Does it catch the regression?
- Review 5 trace labels by hand. Does the automatic labeling match your assessment? If not, adjust the thresholds in
trace_labeler.py.
Completion checkpoint
You have:
- A tool-use grader that checks tool precision, recall, and route accuracy against expected values
- A trace labeler that classifies execution paths using the six-label taxonomy
- A CI-friendly eval runner that produces pass/fail results in under 2 minutes
- All four eval families (retrieval, answer, tool-use, trace) connected to the harness
- At least one run that's been fully evaluated: answer-graded, retrieval-checked, tool-graded, and trace-labeled
Reflection prompts
- Looking at your trace label distribution, what's the biggest source of waste in your pipeline? What would you change to reduce it?
- How much confidence do you have in the automated grades vs. your manual grades? What would increase your confidence?
- The CI eval uses a small benchmark subset. What risks does that create? How would you choose which questions to include in the CI set?
Connecting to the project
We can now measure the single-agent system across all four eval families. We know how well retrieval works, whether answers are correct and grounded, whether tools are used efficiently, and whether execution paths are optimal. The harness produces these measurements automatically and can run in CI to catch regressions.
This is the foundation that makes everything from here forward accountable. Only now are we ready to add multi-agent coordination, because when we split work across specialists, we'll need to measure whether the coordination actually helps or just adds complexity. Without the eval framework we built in this module, adding orchestration would be flying blind. Module 7 will add the orchestration layer, and every architectural decision will be measured against the baseline we've established here.
What's next
Orchestration. Only now do you have the measurement foundation to tell whether adding more agents helps or just adds motion; the next lesson uses that foundation to decide when a split is justified.
References
Start here
- Anthropic: Building effective agents — the eval-driven development loop and trace analysis patterns that underpin this lesson's approach
Build with this
- promptfoo documentation — test-case-driven regression testing for prompts and pipelines, useful for the CI eval pattern
- Langfuse: Datasets and experiments — Langfuse's approach to managing benchmark datasets and running experiments
Deep dive
- DeepEval documentation — Python-centric LLM eval framework with built-in metrics for faithfulness, relevance, and tool use
- OpenAI: Evaluation getting started — OpenAI's structured approach to evaluation, including trace-level analysis