Building Your AI Harness
Over the last several modules, you've built pieces of an experiment system without calling it one. In Module 2, you created benchmark questions and a run-log schema. In the last two lessons, you added telemetry and cost tracking. In Module 5, you built the RAG pipeline that actually answers questions. Each piece works, but they're separate scripts with separate entry points. Running an experiment means executing multiple commands, copying file paths between scripts, and manually connecting the results.
This lesson pulls everything together into a single harness. One command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. The harness isn't a new idea; it's the name for what you've been building all along. After this lesson, every evaluation in the rest of the module will run through this harness, and you'll be able to compare any two versions of your system with a single command.
What you'll learn
- Understand the AI harness concept: what it is, what it contains, and why it matters
- Connect the existing artifacts (benchmark questions, run-log schema, telemetry, cost tracker) into a unified experiment runner
- Run a complete benchmark pass with one command that produces traced, costed, gradable results
- Compare two runs side by side using the summarize script
- Set up the harness as the foundation for the eval lessons that follow
Concepts
AI harness — the experiment framework that makes iteration on an AI system repeatable and comparable. A harness has four components:
- Benchmark set — the questions and gold answers that define what "correct" means (you built this in Module 2's benchmark-design lesson)
- Runner — the code that sends benchmark questions through the system and captures outputs (you built a basic version in run-log-and-baseline)
- Telemetry — the traces and cost data that show what happened inside each request (you built this in the telemetry lesson and cost lesson)
- Graders — the evaluation logic that assigns grades and failure labels to each output (you've been doing this manually; we'll automate it in the next two lessons)
The harness is what turns "try it and see" into "run the experiment and compare." Without it, every change to the system requires manual testing and subjective judgment. With it, you can make a change, run the harness, and see exactly what improved, what regressed, and what stayed the same.
Experiment run — one complete pass of the benchmark through the system. Each run produces a JSONL log, a set of traces, and cost data. Runs are identified by a unique run ID and tagged with the repo SHA, model version, and harness version so you can reproduce them.
Run comparison — a side-by-side analysis of two experiment runs. The simplest comparison is overall accuracy, but the more useful comparison is at the category and failure-label level. If Run B has higher accuracy than Run A, the comparison tells you where it improved (which categories?) and how (which failure modes disappeared?).
Walkthrough
What you've already built
Let's take inventory. Here's what exists in your project from previous lessons:
anchor-repo/
├── benchmark-questions.jsonl # Module 2: 30+ questions with gold answers
├── harness/
│ ├── schema.py # Module 2: run-log schema definition
│ ├── run_baseline.py # Module 2: basic benchmark runner
│ ├── grade_baseline.py # Module 2: interactive grading tool
│ ├── summarize_run.py # Module 2: run summary metrics
│ └── runs/ # Module 2: saved run logs
│ ├── baseline-2026-03-24-*.jsonl
│ └── traced-2026-03-25-*.jsonl # Module 6 L1: traced run
├── observability/
│ ├── traced_pipeline.py # Module 6 L1: instrumented RAG pipeline
│ ├── traced_benchmark.py # Module 6 L1: traced benchmark runner
│ ├── cost_tracker.py # Module 6 L2: cost estimation
│ ├── cache_metrics.py # Module 6 L2: cache analysis
│ ├── model_router.py # Module 6 L2: model routing
│ ├── token_budget.py # Module 6 L2: budget enforcement
│ ├── rate_limit_telemetry.py # Module 6 L2: rate-limit tracking
│ └── success_cost.py # Module 6 L2: cost-per-success metric
└── rag/
├── retrieval_service.py # Module 5: routed retrieval
├── rag_with_routing.py # Module 5: full RAG pipeline
├── grounded_answer.py # Module 5: grounded answer generation
└── evidence_bundle.py # Module 5: evidence bundle format
The harness we're building now will unify the runner, telemetry, and cost tracking into a single entry point. The graders (which we'll build in the next two lessons) will plug into this same harness.
The unified harness runner
This script replaces harness/run_baseline.py and observability/traced_benchmark.py with a single runner that does everything:
# harness/run_harness.py
"""Unified AI harness runner.
One command to:
1. Load benchmark questions
2. Run each through the traced, costed RAG pipeline
3. Save a structured JSONL run log
4. Print a summary
Usage:
python harness/run_harness.py [--run-id RUN_ID] [--limit N]
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime, timezone
sys.path.insert(0, ".")
from langfuse import Langfuse
from observability.traced_pipeline import traced_rag_pipeline
from observability.cost_tracker import estimate_cost
from retrieval.hybrid_retrieve import hybrid_retrieve
# --- Argument parsing ---
parser = argparse.ArgumentParser(description="Run the AI harness benchmark")
parser.add_argument("--run-id", default=None, help="Custom run ID")
parser.add_argument("--limit", type=int, default=None, help="Limit to N questions")
parser.add_argument(
"--benchmark", default="benchmark-questions.jsonl",
help="Path to benchmark questions file",
)
args = parser.parse_args()
# --- Configuration ---
RUN_ID = args.run_id or f"harness-{datetime.now(timezone.utc).strftime('%Y-%m-%d-%H%M%S')}"
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = args.benchmark
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()
langfuse = Langfuse()
# --- Load benchmark questions ---
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
if args.limit:
questions = questions[:args.limit]
print(f"AI Harness Run")
print(f" Run ID: {RUN_ID}")
print(f" Provider: {PROVIDER}")
print(f" Model: {MODEL}")
print(f" Repo SHA: {REPO_SHA}")
print(f" Questions: {len(questions)}")
print()
# --- Run each question ---
os.makedirs("harness/runs", exist_ok=True)
results = []
total_cost = 0.0
run_start = time.perf_counter()
for i, q in enumerate(questions):
q_start = time.perf_counter()
print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:55]}...")
try:
answer = traced_rag_pipeline(
question=q["question"],
hybrid_retrieve_fn=hybrid_retrieve,
model=MODEL,
run_id=RUN_ID,
)
q_duration = time.perf_counter() - q_start
# Estimate cost using rough token counts. Langfuse will have
# the exact usage once the traces are flushed.
cost = estimate_cost(
model=MODEL,
input_tokens=answer.context_tokens or 3500,
output_tokens=400,
)
total_cost += cost["total_cost"]
entry = {
"run_id": RUN_ID,
"question_id": q["id"],
"question": q["question"],
"category": q["category"],
"answer": answer.answer,
"abstained": answer.abstained,
"abstention_reason": answer.abstention_reason if answer.abstained else None,
"model": MODEL,
"provider": PROVIDER,
"evidence_files": list(set(
c.get("file_path", "") for c in answer.citations
)) if answer.citations else [],
"tools_called": [], # Populated if using the agent loop
"retrieval_method": getattr(answer, "route", "routed"),
"citation_count": len(answer.citations) if answer.citations else 0,
"context_tokens": answer.context_tokens or 0,
"estimated_cost": cost["total_cost"],
"duration_seconds": round(q_duration, 2),
"grade": None,
"failure_label": None,
"grading_notes": "",
"repo_sha": REPO_SHA,
"timestamp": datetime.now(timezone.utc).isoformat(),
"harness_version": "v0.3",
}
results.append(entry)
except Exception as e:
print(f" ERROR: {e}")
entry = {
"run_id": RUN_ID,
"question_id": q["id"],
"question": q["question"],
"category": q["category"],
"answer": f"ERROR: {e}",
"abstained": True,
"abstention_reason": f"Pipeline error: {e}",
"model": MODEL,
"provider": PROVIDER,
"evidence_files": [],
"tools_called": [],
"retrieval_method": "error",
"citation_count": 0,
"context_tokens": 0,
"estimated_cost": 0,
"duration_seconds": round(time.perf_counter() - q_start, 2),
"grade": "wrong",
"failure_label": "pipeline_error",
"grading_notes": str(e),
"repo_sha": REPO_SHA,
"timestamp": datetime.now(timezone.utc).isoformat(),
"harness_version": "v0.3",
}
results.append(entry)
total_duration = time.perf_counter() - run_start
# --- Save results ---
with open(OUTPUT_FILE, "w") as f:
for entry in results:
f.write(json.dumps(entry) + "\n")
langfuse.flush()
# --- Print summary ---
print(f"\n{'='*50}")
print(f"Run complete: {RUN_ID}")
print(f" Duration: {total_duration:.1f}s")
print(f" Questions: {len(results)}")
print(f" Abstained: {sum(1 for r in results if r['abstained'])}")
print(f" Est. cost: ${total_cost:.4f}")
print(f" Saved to: {OUTPUT_FILE}")
print(f" Traces: Langfuse (filter by run_id: {RUN_ID})")
print(f"\nNext steps:")
print(f" Grade: python harness/grade_baseline.py {OUTPUT_FILE}")
print(f" Summarize: python harness/summarize_run.py {OUTPUT_FILE}")python harness/run_harness.py --limit 5Expected output looks like this, with the provider and model lines matching the tab you used:
AI Harness Run
Run ID: harness-2026-03-25-142233
Provider: openai
Model: gpt-4o-mini
Repo SHA: a1b2c3d
Questions: 5
[1/5] symbol_lookup: What does validate_path return?...
[2/5] architecture: What is the architecture of the retrieval system?...
[3/5] debugging: Why does the auth middleware reject valid tokens?...
[4/5] onboarding: How do I set up the development environment?...
[5/5] change_impact: What changed in the caching module last week?...
==================================================
Run complete: harness-2026-03-25-142233
Duration: 12.3s
Questions: 5
Abstained: 1
Est. cost: $0.0035
Saved to: harness/runs/harness-2026-03-25-142233.jsonl
Traces: Langfuse (filter by run_id: harness-2026-03-25-142233)
Next steps:
Grade: python harness/grade_baseline.py harness/runs/harness-2026-03-25-142233.jsonl
Summarize: python harness/summarize_run.py harness/runs/harness-2026-03-25-142233.jsonlThat's the whole flow: one command, and you get a traced, costed, gradable run log. The --limit flag is useful during development so you can run 5 questions to verify things work, then do a full pass when you're ready.
Comparing two runs
The summarize script from Module 2 works on any run log. To compare two runs, you'll run it on both and look at the differences:
# harness/compare_runs.py
"""Compare two graded run logs side by side.
Shows where accuracy improved, regressed, or stayed the same.
"""
import json
import sys
from collections import Counter
def load_run(path: str) -> list[dict]:
entries = []
with open(path) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
return entries
def summarize(entries: list[dict]) -> dict:
graded = [e for e in entries if e.get("grade") is not None]
if not graded:
return {"total": 0, "grades": {}, "failures": {}, "by_category": {}}
grades = Counter(e["grade"] for e in graded)
failures = Counter(
e["failure_label"] for e in graded
if e.get("failure_label") and e["failure_label"] != "none"
)
by_cat = {}
for cat in sorted(set(e["category"] for e in graded)):
cat_entries = [e for e in graded if e["category"] == cat]
correct = sum(1 for e in cat_entries if e["grade"] == "fully_correct")
by_cat[cat] = {"correct": correct, "total": len(cat_entries)}
return {
"total": len(graded),
"grades": dict(grades),
"failures": dict(failures),
"by_category": by_cat,
}
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: python harness/compare_runs.py <run-a.jsonl> <run-b.jsonl>")
sys.exit(1)
run_a = load_run(sys.argv[1])
run_b = load_run(sys.argv[2])
summary_a = summarize(run_a)
summary_b = summarize(run_b)
name_a = run_a[0]["run_id"] if run_a else "Run A"
name_b = run_b[0]["run_id"] if run_b else "Run B"
print(f"Comparing: {name_a} vs {name_b}\n")
# Overall accuracy
for grade in ["fully_correct", "partially_correct", "unsupported", "wrong"]:
a_count = summary_a["grades"].get(grade, 0)
b_count = summary_b["grades"].get(grade, 0)
a_pct = a_count / summary_a["total"] * 100 if summary_a["total"] else 0
b_pct = b_count / summary_b["total"] * 100 if summary_b["total"] else 0
delta = b_pct - a_pct
arrow = "+" if delta > 0 else ""
print(f" {grade:20s}: {a_pct:5.1f}% -> {b_pct:5.1f}% ({arrow}{delta:.1f}%)")
# Per-category comparison
all_cats = sorted(
set(list(summary_a["by_category"].keys()) + list(summary_b["by_category"].keys()))
)
print(f"\nPer-category accuracy:")
for cat in all_cats:
a = summary_a["by_category"].get(cat, {"correct": 0, "total": 0})
b = summary_b["by_category"].get(cat, {"correct": 0, "total": 0})
print(f" {cat:20s}: {a['correct']}/{a['total']} -> {b['correct']}/{b['total']}")
# Failure label changes
all_labels = sorted(
set(list(summary_a["failures"].keys()) + list(summary_b["failures"].keys()))
)
if all_labels:
print(f"\nFailure label changes:")
for label in all_labels:
a_count = summary_a["failures"].get(label, 0)
b_count = summary_b["failures"].get(label, 0)
delta = b_count - a_count
arrow = "+" if delta > 0 else ""
print(f" {label:20s}: {a_count} -> {b_count} ({arrow}{delta})")python harness/compare_runs.py \
harness/runs/baseline-2026-03-24-graded.jsonl \
harness/runs/harness-2026-03-25-graded.jsonlThe harness as an experiment framework
Here's the mental model for how the harness works going forward:
Every experiment follows the same flow: run the harness, grade the results, summarize, and compare. When you change the pipeline (better retrieval, different model, new prompt), you run the harness again and compare the new run to the previous one. The graders (which we'll build in the next two lessons) plug into this same flow.
This is what the harness concept means in practice: a repeatable experiment framework that separates the system under test from the evaluation logic. You can change the pipeline without changing the harness, and you can change the graders without re-running the pipeline. That separation makes iteration fast and comparisons trustworthy.
Harness versioning
Notice the harness_version field in the run log. We've been incrementing it:
v0.1— Module 2: manual prompting, no retrieval, hand-gradedv0.2— Module 6 L1: traced pipeline with retrieval routingv0.3— Module 6 L3 (this lesson): unified runner with cost tracking
When you compare two runs, the harness version tells you whether the runs were produced by the same infrastructure. If they were, differences in results are due to pipeline changes. If they weren't, you'll need to account for harness improvements too.
Exercises
- Run
python harness/run_harness.pyagainst your full benchmark set. Verify that the output file contains all expected fields from the schema, including the newestimated_costandduration_secondsfields. - Grade the run using
harness/grade_baseline.pyand then compare it to your Module 2 baseline usingharness/compare_runs.py. Which categories improved the most from adding retrieval? - Run the harness twice with different models (e.g.,
--model gpt-4o-miniand--model gpt-4o). Compare costs and accuracy. Is the more expensive model worth it? - Run the harness with
--limit 5and inspect the traces in Langfuse. Verify that each trace has the correct run ID and that you can filter by run ID to see all five traces together. - Add a
--dry-runflag torun_harness.pythat prints what would happen without making any API calls. This is useful for verifying configuration before spending money.
Completion checkpoint
You have:
- A unified harness runner (
harness/run_harness.py) that produces traced, costed, gradable run logs with one command - A run comparison script (
harness/compare_runs.py) that shows accuracy changes at the category and failure-label level - At least one complete harness run against your full benchmark
- A comparison between your Module 2 baseline and your current system showing where retrieval improved results
- An understanding of the four-component harness structure (benchmark, runner, telemetry, graders) and how the next two lessons will complete it
Reflection prompts
- How does having a one-command harness change the way you think about making changes to the pipeline? Would you be more or less likely to experiment?
- Looking at your run comparison, are the improvements from retrieval distributed evenly across categories, or concentrated in specific ones? What does that tell you about where to focus next?
- The harness currently uses rough token estimates for cost. What would you need to capture exact costs, and is the precision worth the implementation effort?
Connecting to the project
The harness is now a single-command experiment framework. You can run the benchmark, trace every question, estimate costs, grade results, and compare runs. What's missing is automated grading. Right now, you still grade by hand, which is slow and doesn't scale.
The next two lessons add automated graders for the four eval families: retrieval evals, answer evals, tool-use evals, and trace evals. These graders will plug directly into the harness, so by the end of the module you'll be able to run a fully automated experiment pass: harness run, auto-grade, summarize, and compare without any manual intervention.
What's next
Retrieval and Answer Evals. The harness makes runs repeatable; the next lesson turns grading into something repeatable too, starting with evidence quality and answer quality.
References
Start here
- Hamel Husain: Your AI product needs evals — the strongest practical argument for building a harness before iterating on your system
Build with this
- JSONL specification — the format underlying all run logs; simple, appendable, and diffable
- Langfuse: Sessions — how to group traces by session (run ID) for viewing all traces from a single harness run
Deep dive
- Anthropic: Building effective agents — discusses the eval-first development loop that the harness enables