Building Your AI Harness

Over the last several modules, you've built pieces of an experiment system without calling it one. In Module 2, you created benchmark questions and a run-log schema. In the last two lessons, you added telemetry and cost tracking. In Module 5, you built the RAG pipeline that actually answers questions. Each piece works, but they're separate scripts with separate entry points. Running an experiment means executing multiple commands, copying file paths between scripts, and manually connecting the results.

This lesson pulls everything together into a single harness. One command to run the benchmark, trace every question, calculate costs, and produce a structured run log ready for grading. The harness isn't a new idea; it's the name for what you've been building all along. After this lesson, every evaluation in the rest of the module will run through this harness, and you'll be able to compare any two versions of your system with a single command.

What you'll learn

Understand the AI harness concept: what it is, what it contains, and why it matters
Connect the existing artifacts (benchmark questions, run-log schema, telemetry, cost tracker) into a unified experiment runner
Run a complete benchmark pass with one command that produces traced, costed, gradable results
Compare two runs side by side using the summarize script
Set up the harness as the foundation for the eval lessons that follow

Concepts

AI harness — the experiment framework that makes iteration on an AI system repeatable and comparable. A harness has four components:

Benchmark set — the questions and gold answers that define what "correct" means (you built this in Module 2's benchmark-design lesson)
Runner — the code that sends benchmark questions through the system and captures outputs (you built a basic version in run-log-and-baseline)
Telemetry — the traces and cost data that show what happened inside each request (you built this in the telemetry lesson and cost lesson)
Graders — the evaluation logic that assigns grades and failure labels to each output (you've been doing this manually; we'll automate it in the next two lessons)

The harness is what turns "try it and see" into "run the experiment and compare." Without it, every change to the system requires manual testing and subjective judgment. With it, you can make a change, run the harness, and see exactly what improved, what regressed, and what stayed the same.

Experiment run — one complete pass of the benchmark through the system. Each run produces a JSONL log, a set of traces, and cost data. Runs are identified by a unique run ID and tagged with the repo SHA, model version, and harness version so you can reproduce them.

Run comparison — a side-by-side analysis of two experiment runs. The simplest comparison is overall accuracy, but the more useful comparison is at the category and failure-label level. If Run B has higher accuracy than Run A, the comparison tells you where it improved (which categories?) and how (which failure modes disappeared?).

Walkthrough

What you've already built

Let's take inventory. Here's what exists in your project from previous lessons:

anchor-repo/
├── benchmark-questions.jsonl          # Module 2: 30+ questions with gold answers
├── harness/
│   ├── schema.py                      # Module 2: run-log schema definition
│   ├── run_baseline.py                # Module 2: basic benchmark runner
│   ├── grade_baseline.py              # Module 2: interactive grading tool
│   ├── summarize_run.py               # Module 2: run summary metrics
│   └── runs/                          # Module 2: saved run logs
│       ├── baseline-2026-03-24-*.jsonl
│       └── traced-2026-03-25-*.jsonl  # Module 6 L1: traced run
├── observability/
│   ├── traced_pipeline.py             # Module 6 L1: instrumented RAG pipeline
│   ├── traced_benchmark.py            # Module 6 L1: traced benchmark runner
│   ├── cost_tracker.py                # Module 6 L2: cost estimation
│   ├── cache_metrics.py               # Module 6 L2: cache analysis
│   ├── model_router.py                # Module 6 L2: model routing
│   ├── token_budget.py                # Module 6 L2: budget enforcement
│   ├── rate_limit_telemetry.py        # Module 6 L2: rate-limit tracking
│   └── success_cost.py                # Module 6 L2: cost-per-success metric
└── rag/
    ├── retrieval_service.py           # Module 5: routed retrieval
    ├── rag_with_routing.py            # Module 5: full RAG pipeline
    ├── grounded_answer.py             # Module 5: grounded answer generation
    └── evidence_bundle.py             # Module 5: evidence bundle format

The harness we're building now will unify the runner, telemetry, and cost tracking into a single entry point. The graders (which we'll build in the next two lessons) will plug into this same harness.

The unified harness runner

This script replaces harness/run_baseline.py and observability/traced_benchmark.py with a single runner that does everything:

# harness/run_harness.py
"""Unified AI harness runner.

One command to:
1. Load benchmark questions
2. Run each through the traced, costed RAG pipeline
3. Save a structured JSONL run log
4. Print a summary

Usage:
    python harness/run_harness.py [--run-id RUN_ID] [--limit N]
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime, timezone

sys.path.insert(0, ".")

from langfuse import Langfuse
from observability.traced_pipeline import traced_rag_pipeline
from observability.cost_tracker import estimate_cost
from retrieval.hybrid_retrieve import hybrid_retrieve

# --- Argument parsing ---
parser = argparse.ArgumentParser(description="Run the AI harness benchmark")
parser.add_argument("--run-id", default=None, help="Custom run ID")
parser.add_argument("--limit", type=int, default=None, help="Limit to N questions")
parser.add_argument(
    "--benchmark", default="benchmark-questions.jsonl",
    help="Path to benchmark questions file",
)
args = parser.parse_args()

# --- Configuration ---
RUN_ID = args.run_id or f"harness-{datetime.now(timezone.utc).strftime('%Y-%m-%d-%H%M%S')}"
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = args.benchmark
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()

langfuse = Langfuse()

# --- Load benchmark questions ---
questions = []
with open(BENCHMARK_FILE) as f:
    for line in f:
        if line.strip():
            questions.append(json.loads(line))

if args.limit:
    questions = questions[:args.limit]

print(f"AI Harness Run")
print(f"  Run ID:    {RUN_ID}")
print(f"  Provider:  {PROVIDER}")
print(f"  Model:     {MODEL}")
print(f"  Repo SHA:  {REPO_SHA}")
print(f"  Questions: {len(questions)}")
print()

# --- Run each question ---
os.makedirs("harness/runs", exist_ok=True)
results = []
total_cost = 0.0
run_start = time.perf_counter()

for i, q in enumerate(questions):
    q_start = time.perf_counter()
    print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:55]}...")

    try:
        answer = traced_rag_pipeline(
            question=q["question"],
            hybrid_retrieve_fn=hybrid_retrieve,
            model=MODEL,
            run_id=RUN_ID,
        )

        q_duration = time.perf_counter() - q_start

        # Estimate cost using rough token counts. Langfuse will have
        # the exact usage once the traces are flushed.
        cost = estimate_cost(
            model=MODEL,
            input_tokens=answer.context_tokens or 3500,
            output_tokens=400,
        )
        total_cost += cost["total_cost"]

        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": answer.answer,
            "abstained": answer.abstained,
            "abstention_reason": answer.abstention_reason if answer.abstained else None,
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": list(set(
                c.get("file_path", "") for c in answer.citations
            )) if answer.citations else [],
            "tools_called": [],  # Populated if using the agent loop
            "retrieval_method": getattr(answer, "route", "routed"),
            "citation_count": len(answer.citations) if answer.citations else 0,
            "context_tokens": answer.context_tokens or 0,
            "estimated_cost": cost["total_cost"],
            "duration_seconds": round(q_duration, 2),
            "grade": None,
            "failure_label": None,
            "grading_notes": "",
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.3",
        }
        results.append(entry)

    except Exception as e:
        print(f"  ERROR: {e}")
        entry = {
            "run_id": RUN_ID,
            "question_id": q["id"],
            "question": q["question"],
            "category": q["category"],
            "answer": f"ERROR: {e}",
            "abstained": True,
            "abstention_reason": f"Pipeline error: {e}",
            "model": MODEL,
            "provider": PROVIDER,
            "evidence_files": [],
            "tools_called": [],
            "retrieval_method": "error",
            "citation_count": 0,
            "context_tokens": 0,
            "estimated_cost": 0,
            "duration_seconds": round(time.perf_counter() - q_start, 2),
            "grade": "wrong",
            "failure_label": "pipeline_error",
            "grading_notes": str(e),
            "repo_sha": REPO_SHA,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "harness_version": "v0.3",
        }
        results.append(entry)

total_duration = time.perf_counter() - run_start

# --- Save results ---
with open(OUTPUT_FILE, "w") as f:
    for entry in results:
        f.write(json.dumps(entry) + "\n")

langfuse.flush()

# --- Print summary ---
print(f"\n{'='*50}")
print(f"Run complete: {RUN_ID}")
print(f"  Duration:  {total_duration:.1f}s")
print(f"  Questions: {len(results)}")
print(f"  Abstained: {sum(1 for r in results if r['abstained'])}")
print(f"  Est. cost: ${total_cost:.4f}")
print(f"  Saved to:  {OUTPUT_FILE}")
print(f"  Traces:    Langfuse (filter by run_id: {RUN_ID})")
print(f"\nNext steps:")
print(f"  Grade:     python harness/grade_baseline.py {OUTPUT_FILE}")
print(f"  Summarize: python harness/summarize_run.py {OUTPUT_FILE}")

python harness/run_harness.py --limit 5

Expected output looks like this, with the provider and model lines matching the tab you used:

AI Harness Run
  Run ID:    harness-2026-03-25-142233
  Provider:  openai
  Model:     gpt-4o-mini
  Repo SHA:  a1b2c3d
  Questions: 5

[1/5] symbol_lookup: What does validate_path return?...
[2/5] architecture: What is the architecture of the retrieval system?...
[3/5] debugging: Why does the auth middleware reject valid tokens?...
[4/5] onboarding: How do I set up the development environment?...
[5/5] change_impact: What changed in the caching module last week?...

==================================================
Run complete: harness-2026-03-25-142233
  Duration:  12.3s
  Questions: 5
  Abstained: 1
  Est. cost: $0.0035
  Saved to:  harness/runs/harness-2026-03-25-142233.jsonl
  Traces:    Langfuse (filter by run_id: harness-2026-03-25-142233)

Next steps:
  Grade:     python harness/grade_baseline.py harness/runs/harness-2026-03-25-142233.jsonl
  Summarize: python harness/summarize_run.py harness/runs/harness-2026-03-25-142233.jsonl

That's the whole flow: one command, and you get a traced, costed, gradable run log. The --limit flag is useful during development so you can run 5 questions to verify things work, then do a full pass when you're ready.

Comparing two runs

The summarize script from Module 2 works on any run log. To compare two runs, you'll run it on both and look at the differences:

# harness/compare_runs.py
"""Compare two graded run logs side by side.

Shows where accuracy improved, regressed, or stayed the same.
"""
import json
import sys
from collections import Counter


def load_run(path: str) -> list[dict]:
    entries = []
    with open(path) as f:
        for line in f:
            if line.strip():
                entries.append(json.loads(line))
    return entries


def summarize(entries: list[dict]) -> dict:
    graded = [e for e in entries if e.get("grade") is not None]
    if not graded:
        return {"total": 0, "grades": {}, "failures": {}, "by_category": {}}

    grades = Counter(e["grade"] for e in graded)
    failures = Counter(
        e["failure_label"] for e in graded
        if e.get("failure_label") and e["failure_label"] != "none"
    )

    by_cat = {}
    for cat in sorted(set(e["category"] for e in graded)):
        cat_entries = [e for e in graded if e["category"] == cat]
        correct = sum(1 for e in cat_entries if e["grade"] == "fully_correct")
        by_cat[cat] = {"correct": correct, "total": len(cat_entries)}

    return {
        "total": len(graded),
        "grades": dict(grades),
        "failures": dict(failures),
        "by_category": by_cat,
    }


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python harness/compare_runs.py <run-a.jsonl> <run-b.jsonl>")
        sys.exit(1)

    run_a = load_run(sys.argv[1])
    run_b = load_run(sys.argv[2])

    summary_a = summarize(run_a)
    summary_b = summarize(run_b)

    name_a = run_a[0]["run_id"] if run_a else "Run A"
    name_b = run_b[0]["run_id"] if run_b else "Run B"

    print(f"Comparing: {name_a} vs {name_b}\n")

    # Overall accuracy
    for grade in ["fully_correct", "partially_correct", "unsupported", "wrong"]:
        a_count = summary_a["grades"].get(grade, 0)
        b_count = summary_b["grades"].get(grade, 0)
        a_pct = a_count / summary_a["total"] * 100 if summary_a["total"] else 0
        b_pct = b_count / summary_b["total"] * 100 if summary_b["total"] else 0
        delta = b_pct - a_pct
        arrow = "+" if delta > 0 else ""
        print(f"  {grade:20s}: {a_pct:5.1f}% -> {b_pct:5.1f}% ({arrow}{delta:.1f}%)")

    # Per-category comparison
    all_cats = sorted(
        set(list(summary_a["by_category"].keys()) + list(summary_b["by_category"].keys()))
    )
    print(f"\nPer-category accuracy:")
    for cat in all_cats:
        a = summary_a["by_category"].get(cat, {"correct": 0, "total": 0})
        b = summary_b["by_category"].get(cat, {"correct": 0, "total": 0})
        print(f"  {cat:20s}: {a['correct']}/{a['total']} -> {b['correct']}/{b['total']}")

    # Failure label changes
    all_labels = sorted(
        set(list(summary_a["failures"].keys()) + list(summary_b["failures"].keys()))
    )
    if all_labels:
        print(f"\nFailure label changes:")
        for label in all_labels:
            a_count = summary_a["failures"].get(label, 0)
            b_count = summary_b["failures"].get(label, 0)
            delta = b_count - a_count
            arrow = "+" if delta > 0 else ""
            print(f"  {label:20s}: {a_count} -> {b_count} ({arrow}{delta})")

python harness/compare_runs.py \
  harness/runs/baseline-2026-03-24-graded.jsonl \
  harness/runs/harness-2026-03-25-graded.jsonl

The harness as an experiment framework

Here's the mental model for how the harness works going forward:

Harness experiment flow: benchmark questions through run_harness, producing JSONL and Langfuse traces, then graders, graded run log, and summarize/compare

Every experiment follows the same flow: run the harness, grade the results, summarize, and compare. When you change the pipeline (better retrieval, different model, new prompt), you run the harness again and compare the new run to the previous one. The graders (which we'll build in the next two lessons) plug into this same flow.

This is what the harness concept means in practice: a repeatable experiment framework that separates the system under test from the evaluation logic. You can change the pipeline without changing the harness, and you can change the graders without re-running the pipeline. That separation makes iteration fast and comparisons trustworthy.

Harness versioning

Notice the harness_version field in the run log. We've been incrementing it:

v0.1 — Module 2: manual prompting, no retrieval, hand-graded
v0.2 — Module 6 L1: traced pipeline with retrieval routing
v0.3 — Module 6 L3 (this lesson): unified runner with cost tracking

When you compare two runs, the harness version tells you whether the runs were produced by the same infrastructure. If they were, differences in results are due to pipeline changes. If they weren't, you'll need to account for harness improvements too.

Exercises

Run python harness/run_harness.py against your full benchmark set. Verify that the output file contains all expected fields from the schema, including the new estimated_cost and duration_seconds fields.
Grade the run using harness/grade_baseline.py and then compare it to your Module 2 baseline using harness/compare_runs.py. Which categories improved the most from adding retrieval?
Run the harness twice with different models (e.g., --model gpt-4o-mini and --model gpt-4o). Compare costs and accuracy. Is the more expensive model worth it?
Run the harness with --limit 5 and inspect the traces in Langfuse. Verify that each trace has the correct run ID and that you can filter by run ID to see all five traces together.
Add a --dry-run flag to run_harness.py that prints what would happen without making any API calls. This is useful for verifying configuration before spending money.

Completion checkpoint

You have:

A unified harness runner (harness/run_harness.py) that produces traced, costed, gradable run logs with one command
A run comparison script (harness/compare_runs.py) that shows accuracy changes at the category and failure-label level
At least one complete harness run against your full benchmark
A comparison between your Module 2 baseline and your current system showing where retrieval improved results
An understanding of the four-component harness structure (benchmark, runner, telemetry, graders) and how the next two lessons will complete it

Reflection prompts

How does having a one-command harness change the way you think about making changes to the pipeline? Would you be more or less likely to experiment?
Looking at your run comparison, are the improvements from retrieval distributed evenly across categories, or concentrated in specific ones? What does that tell you about where to focus next?
The harness currently uses rough token estimates for cost. What would you need to capture exact costs, and is the precision worth the implementation effort?

Connecting to the project

The harness is now a single-command experiment framework. You can run the benchmark, trace every question, estimate costs, grade results, and compare runs. What's missing is automated grading. Right now, you still grade by hand, which is slow and doesn't scale.

The next two lessons add automated graders for the four eval families: retrieval evals, answer evals, tool-use evals, and trace evals. These graders will plug directly into the harness, so by the end of the module you'll be able to run a fully automated experiment pass: harness run, auto-grade, summarize, and compare without any manual intervention.

What's next

Retrieval and Answer Evals. The harness makes runs repeatable; the next lesson turns grading into something repeatable too, starting with evidence quality and answer quality.

References

Start here

Hamel Husain: Your AI product needs evals — the strongest practical argument for building a harness before iterating on your system

Build with this

JSONL specification — the format underlying all run logs; simple, appendable, and diffable
Langfuse: Sessions — how to group traces by session (run ID) for viewing all traces from a single harness run

Deep dive

Anthropic: Building effective agents — discusses the eval-first development loop that the harness enables