Run Logs and Your First Baseline
You should now have a benchmark set with 30+ questions and gold answers. In this lesson, we'll define the structured format for recording experiments, run your first end-to-end baseline, and grade it. By the end, you'll have a concrete number, your baseline accuracy. That's the number every future improvement gets measured against.
This is the second component of your AI harness. The benchmark set (from the previous lessons) defines what to test. The run log defines how to record what happened. Together they give you reproducible, comparable experiments.
What you'll learn
- Design a run-log schema that captures inputs, outputs, tool traces, and grading results
- Run a complete benchmark pass against your anchor repository using manual prompting
- Grade each answer using the four-level rubric and apply failure labels
- Calculate your baseline accuracy and identify the most common failure modes
- Save your first run log in structured JSONL format
Concepts
Run-log schema: the structured format for recording a single benchmark run. Each entry captures: the question asked, the system's response, any tools called or evidence retrieved, the grade assigned, the failure label (if applicable), and metadata like timestamps and model version. A well-designed schema makes runs comparable. You can diff two logs and see exactly what changed.
Baseline: your first graded benchmark run. It doesn't matter how bad the baseline is. What matters is that you have a number. "40% fully correct, 30% partially correct, 30% wrong" is infinitely more useful than "the system seems okay." Every improvement you make in later modules gets measured against this baseline.
Failure distribution: the pattern of how your system fails, not just how often. If 80% of your failures are retrieval_miss (the system couldn't find the right code), you know retrieval is the bottleneck. If failures are evenly split between retrieval and reasoning, the fix is different. The failure distribution tells you where to invest effort.
Walkthrough
Define your run-log schema
Create a file that defines the shape of each run-log entry. We'll use JSONL (one JSON object per line) because it's easy to append to, easy to parse, and easy to diff.
We'll work inside your anchor repository from now on. If you haven't already, set up a Python environment there and install the provider SDK you chose in Module 1:
cd anchor-repo # or wherever you cloned your anchor repository
python -m venv .venv && source .venv/bin/activateInstall your provider SDK and set your API key:
pip install openai
export OPENAI_API_KEY="sk-..."Then create the harness directory:
mkdir -p harnessHere's the schema we'll use. Each line in the log file will be one of these objects:
# harness/schema.py
"""Run-log schema for benchmark experiments.
Each entry in a .jsonl run log follows this structure.
"""
SCHEMA_DESCRIPTION = {
# --- Identity ---
"run_id": "Unique identifier for this run (e.g., 'baseline-2026-03-24')",
"question_id": "Matches the 'id' field in benchmark-questions.jsonl",
# --- Input ---
"question": "The benchmark question text",
"category": "symbol_lookup | architecture | change_impact | debugging | onboarding",
# --- System response ---
"answer": "The full text of the system's response",
"model": "Model used (e.g., 'gpt-4o-mini', 'claude-sonnet-4-6')",
"provider": "Provider used (e.g., 'openai', 'gemini', 'anthropic', 'github-models', 'huggingface', 'ollama-local', 'ollama-cloud')",
# --- Evidence and tools ---
"evidence_files": "List of file paths the system cited or retrieved",
"tools_called": "List of tool names invoked (empty for manual prompting)",
"retrieval_method": "How evidence was found (e.g., 'manual', 'vector', 'bm25', 'none')",
# --- Grading ---
"grade": "fully_correct | partially_correct | unsupported | wrong",
"failure_label": "missing_evidence | retrieval_miss | wrong_chunk | hallucination | reasoning_error | scope_confusion | null",
"grading_notes": "Brief explanation of why this grade was assigned",
# --- Metadata ---
"repo_sha": "Git SHA of the anchor repo at time of run",
"timestamp": "ISO 8601 timestamp",
"harness_version": "Version of your harness (start with 'v0.1')",
}This schema is intentionally simple. You don't need a database. A JSONL file is enough. The important thing is that every run uses the same shape so you can compare them.
Create the baseline runner
For your first baseline, we'll keep it simple: manually prompt a model with each benchmark question and record the results. Pick your provider and paste the complete script into harness/run_baseline.py:
# harness/run_baseline.py
"""Run a manual baseline against your benchmark questions."""
import json
import os
from datetime import datetime, timezone
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
RUN_ID = "baseline-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
BENCHMARK_FILE = "benchmark-questions.jsonl"
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()
SYSTEM_PROMPT = (
"You are a code assistant for a software project. "
"Answer questions about the codebase based on your knowledge. "
"If you're not sure, say so."
)
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
print(f"Loaded {len(questions)} benchmark questions")
print(f"Run ID: {RUN_ID} | Repo SHA: {REPO_SHA} | Model: {MODEL}\n")
os.makedirs("harness/runs", exist_ok=True)
results = []
for i, q in enumerate(questions):
print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": q["question"]},
],
temperature=0,
)
answer = response.choices[0].message.content
results.append({
"run_id": RUN_ID,
"question_id": q["id"],
"question": q["question"],
"category": q["category"],
"answer": answer,
"model": MODEL,
"provider": PROVIDER,
"evidence_files": [],
"tools_called": [],
"retrieval_method": "none",
"grade": None,
"failure_label": None,
"grading_notes": "",
"repo_sha": REPO_SHA,
"timestamp": datetime.now(timezone.utc).isoformat(),
"harness_version": "v0.1",
})
with open(OUTPUT_FILE, "w") as f:
for entry in results:
f.write(json.dumps(entry) + "\n")
print(f"\nDone. {len(results)} results saved to {OUTPUT_FILE}")
print("Next step: open the file and grade each answer by hand.")Run it:
python harness/run_baseline.pyThis will take a minute or two depending on how many questions you have. The model is answering without any retrieval or tools. It's working from its training data only. Expect most answers to be wrong or unsupported for repo-specific questions. That's the point.
Grade your baseline by hand
Open the output file and grade each entry. This is the most important exercise in this module. You're building your grading instincts.
# harness/grade_baseline.py
"""Interactive grading tool for baseline run logs."""
import json
import sys
GRADES = ["fully_correct", "partially_correct", "unsupported", "wrong"]
FAILURE_LABELS = [
"missing_evidence", "retrieval_miss", "wrong_chunk",
"hallucination", "reasoning_error", "scope_confusion", "none",
]
if len(sys.argv) < 2:
print("Usage: python harness/grade_baseline.py <run-file.jsonl>")
print("Example: python harness/grade_baseline.py harness/runs/baseline-2026-03-24-143022.jsonl")
sys.exit(1)
run_file = sys.argv[1]
# Load entries
entries = []
with open(run_file) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
print(f"Grading {len(entries)} entries from {run_file}\n")
for i, entry in enumerate(entries):
if entry["grade"] is not None:
print(f"[{i+1}] Already graded: {entry['grade']}")
continue
print(f"\n{'='*60}")
print(f"[{i+1}/{len(entries)}] {entry['category']}: {entry['question']}")
print(f"{'='*60}")
print(f"\nAnswer:\n{entry['answer']}\n")
# Grade
print(f"Grades: {', '.join(f'{j}={g}' for j, g in enumerate(GRADES))}")
grade_idx = input("Grade (0-3): ").strip()
if grade_idx.isdigit() and 0 <= int(grade_idx) < len(GRADES):
entry["grade"] = GRADES[int(grade_idx)]
else:
print("Skipping...")
continue
# Failure label (only if not fully correct)
if entry["grade"] != "fully_correct":
print(f"Labels: {', '.join(f'{j}={l}' for j, l in enumerate(FAILURE_LABELS))}")
label_idx = input(f"Failure label (0-{len(FAILURE_LABELS)-1}): ").strip()
if label_idx.isdigit() and 0 <= int(label_idx) < len(FAILURE_LABELS):
entry["failure_label"] = FAILURE_LABELS[int(label_idx)]
# Notes
entry["grading_notes"] = input("Brief note (or Enter to skip): ").strip()
# Save graded version
output = run_file.replace(".jsonl", "-graded.jsonl")
with open(output, "w") as f:
for entry in entries:
f.write(json.dumps(entry) + "\n")
print(f"\nGraded results saved to {output}")# Use the exact filename from your baseline run
python harness/grade_baseline.py harness/runs/baseline-2026-03-24-143022.jsonlFor each question, you'll compare the model's answer against the actual code in your anchor repository, assign a grade, and (for non-correct answers) label the failure type.
Calculate your baseline metrics
After grading, calculate your baseline numbers:
# harness/summarize_run.py
"""Summarize a graded run log."""
import json
import sys
from collections import Counter
run_file = sys.argv[1]
entries = []
with open(run_file) as f:
for line in f:
if line.strip():
entries.append(json.loads(line))
graded = [e for e in entries if e["grade"] is not None]
total = len(graded)
if total == 0:
print("No graded entries found.")
sys.exit(1)
# Grade distribution
grade_counts = Counter(e["grade"] for e in graded)
print(f"Run: {graded[0]['run_id']}")
print(f"Model: {graded[0]['model']}")
print(f"Total graded: {total}\n")
print("Grade distribution:")
for grade in ["fully_correct", "partially_correct", "unsupported", "wrong"]:
count = grade_counts.get(grade, 0)
pct = count / total * 100
print(f" {grade:20s}: {count:3d} ({pct:.0f}%)")
# Failure label distribution (non-correct only)
failures = [e for e in graded if e["grade"] != "fully_correct"]
if failures:
label_counts = Counter(e["failure_label"] for e in failures if e["failure_label"])
print(f"\nFailure labels ({len(failures)} non-correct answers):")
for label, count in label_counts.most_common():
print(f" {label:20s}: {count:3d}")
# Per-category breakdown
print("\nPer-category accuracy:")
categories = sorted(set(e["category"] for e in graded))
for cat in categories:
cat_entries = [e for e in graded if e["category"] == cat]
correct = sum(1 for e in cat_entries if e["grade"] == "fully_correct")
print(f" {cat:20s}: {correct}/{len(cat_entries)} fully correct")# Use the exact graded filename from the previous step
python harness/summarize_run.py harness/runs/baseline-2026-03-24-143022-graded.jsonlExpected output (your numbers will differ):
Run: baseline-2026-03-24-143022
Model: gpt-4o-mini
Total graded: 30
Grade distribution:
fully_correct : 3 (10%)
partially_correct : 8 (27%)
unsupported : 11 (37%)
wrong : 8 (27%)
Failure labels (27 non-correct answers):
hallucination : 12
missing_evidence : 8
reasoning_error : 5
scope_confusion : 2
Per-category accuracy:
architecture : 0/6 fully correct
change_impact : 0/6 fully correct
debugging : 1/6 fully correct
onboarding : 1/6 fully correct
symbol_lookup : 1/6 fully correct
This baseline is a model answering from training data alone, no retrieval, no tools, and no context about your specific codebase, so low accuracy is what we'd expect at this stage. Look at the failure labels rather than the overall score. If missing_evidence and hallucination dominate, that tells you the system's main gap is access to your codebase, not reasoning ability. We'll build retrieval in Modules 3-5 and see how these numbers change.
You'll notice missing_evidence as a label here. Once we add retrieval, that label will split into more specific categories like retrieval_miss (the system searched but didn't find the right code) and wrong_chunk (it found related but wrong code). For now, missing_evidence captures the situation honestly: we haven't given the system any way to look at the code yet.
Exercises
- Run the baseline script against your 30+ benchmark questions and save the results.
- Grade at least 15 answers by hand using the four-level rubric and failure labels.
- Run the summary script and record your baseline numbers.
- Write a one-paragraph "baseline memo" answering: What's the overall accuracy? Which category is strongest/weakest? What's the most common failure mode? What would help most?
Completion checkpoint
You have:
- A run-log schema defined in
harness/schema.py - At least one complete baseline run saved as JSONL in
harness/runs/ - At least 15 entries graded with the four-level rubric and failure labels
- Baseline metrics calculated (overall accuracy, per-category breakdown, failure distribution)
- A one-paragraph baseline memo identifying the biggest opportunity for improvement
Reflection prompts
- What types of questions did the model handle best with no retrieval? Why?
- What's the most common failure label? What does that tell you about what the system needs next?
- Did any answers surprise you, either better or worse than expected?
Connecting to the project
This is the last lesson before we start building the actual code assistant. Everything from here forward (tool calling in Module 3, retrieval in Module 4, RAG in Module 5, evals in Module 6) will be measured against the baseline you just created.
Your harness/ directory is the beginning of your AI harness. Right now it has a benchmark set, a run log, and a summary script. In Module 5 we'll add telemetry, and in Module 6 we'll add automated grading. By then, you'll be able to compare any two system versions with one command.
What's next
Building a Raw Tool Loop. With a measured baseline in place, you can start building without losing the plot; the next lesson gives the model direct ways to inspect the repo instead of guessing.
References
Start here
- Anthropic: Building effective agents — evaluation-first development, practical harness patterns
Build with this
- JSONL specification — the format we use for run logs; simple, appendable, diffable
- OpenAI: Evaluation getting started — structured evaluation concepts and setup
Deep dive
- Hamel Husain: Your AI product needs evals — strong argument for evaluation discipline with practical examples