Designing Good Benchmark Questions

In the previous lesson, you wrote your first 10 benchmark questions. Some of them are probably good. Some of them probably aren't, and right now you might not be able to tell which is which. That's perfectly normal. Writing diagnostic questions is a skill in itself, and takes practice. In this lesson, we'll learn to improve our questions.

By the end, you'll have a better idea of what makes a question useful for measuring your system, what makes a question misleading, and how to grade the answers consistently.

What you'll learn

Distinguish diagnostic questions from vague or untestable ones
Apply a grading rubric (fully correct / partially correct / unsupported / wrong) consistently
Identify and fix common benchmark design mistakes
Write gold answers that are specific enough to grade against
Expand your benchmark set to 30+ questions with confidence

Concepts

Diagnostic question: a benchmark question where the answer clearly reveals whether the system has a specific capability. "Tell me about the codebase" is not diagnostic. Any answer could be partially right. "What function handles JWT token verification, and what happens when the token is expired?" is diagnostic. The answer is either correct and specific, or it isn't.

Grading rubric: a consistent set of criteria for evaluating answers. Without a rubric, grading is subjective and varies between sessions. The rubric we'll use has four levels:

Grade	Meaning
Fully correct	The answer addresses the question accurately, is supported by the code, and doesn't contain fabricated details
Partially correct	The answer is on the right track but misses key details, includes minor inaccuracies, or is incomplete
Unsupported	The answer sounds plausible but can't be verified from the codebase. It may be hallucinated
Wrong	The answer is factually incorrect based on what the code actually does

Failure label: a tag that explains why a wrong or unsupported answer failed. Common labels: missing_evidence (the system lacked the information it needed (use this for the baseline before retrieval exists)retrieval_miss (the retrieval system didn't find the right code), wrong_chunk (it found related but wrong code), hallucination (it fabricated details), reasoning_error (it found the right code but drew the wrong conclusion), scope_confusion (it answered about the wrong part of the codebase).

Walkthrough

What makes a question diagnostic

A diagnostic question has three characteristics:

It has a verifiable answer. You can check the answer against the actual code. "Is this a well-designed system?" is not verifiable. "Does the Router class inherit from APIRouter?" is verifiable.
It tests one capability at a time. "Find the auth middleware and explain how it interacts with the rate limiter and suggest improvements" tests three things at once. If the system gets it wrong, you can't tell which capability failed. Split it into three questions.
The difficulty is intentional. You should know whether a question is easy (symbol lookup) or hard (change impact) when you write it. If you don't know how hard it is, you can't interpret the results.

10 worked examples

Here are 10 example benchmark questions for a fictional CRUD-style FastAPI application. This is not the real FastAPI framework repo. It's a simplified template app we're using to illustrate the patterns. When you write your own questions, translate the structure (category, difficulty, what it tests) to your actual anchor repo rather than copying these filenames.

Symbol lookup (easy)

What does the get_current_user function in auth/dependencies.py do?

Gold answer: It's a FastAPI dependency that extracts a JWT token from the Authorization header, decodes it, looks up the user in the database, and returns the User model. If the token is missing or invalid, it raises an HTTPException with status 401.

Design note: Good diagnostic question. One function, one file, verifiable answer. Tests whether the system can find and describe a specific symbol.

Symbol lookup (medium)

What environment variables does the application require to start?

Gold answer: DATABASE_URL (PostgreSQL connection string), SECRET_KEY (JWT signing key, required, no default), REDIS_URL (optional, defaults to localhost:6379).

Design note: Tests retrieval across multiple files. Env vars are typically scattered across config modules, not in one place.

Architecture explanation

How does a request flow from the HTTP endpoint to the database for the POST /items route?

Gold answer: The request hits the FastAPI router in routers/items.py, passes through the get_current_user dependency for auth, validates the request body with the ItemCreate Pydantic model, calls crud.items.create_item() which uses SQLAlchemy to insert a row, and returns the created item as an ItemResponse model.

Design note: Tests multi-file, multi-layer understanding. The gold answer traces a specific path through the code.

Architecture explanation

What is the relationship between models.py, schemas.py, and crud.py in this project?

Gold answer: models.py defines SQLAlchemy ORM models (database tables). schemas.py defines Pydantic models for API request/response validation. crud.py contains database operations that accept Pydantic schemas and return ORM models. The separation keeps database concerns out of the API layer.

Design note: Tests structural understanding, not just symbol lookup.

Change impact

What would break if I removed the verify_token middleware from main.py?

Gold answer: All authenticated endpoints would stop requiring authentication. Any route that depends on get_current_user would raise an error because the token extraction dependency would no longer find a valid token in the request. The /health and /docs endpoints would be unaffected since they don't use auth dependencies.

Design note: This is harder. It requires reasoning about dependencies and side effects, not just reading code.

Change impact

If I changed the Item model to add a category field, what other files would I need to update?

Gold answer: You'd need to update schemas.py (add category to ItemCreate and ItemResponse), create a migration (alembic revision --autogenerate), and optionally update crud.py if there are category-specific queries. Tests that reference items would also need updating.

Design note: Tests multi-file impact reasoning.

Debugging

Users report getting 422 errors when creating items. The request body looks correct. What's likely wrong?

Gold answer: A 422 from FastAPI means Pydantic validation failed. Check whether the ItemCreate schema has required fields the client isn't sending, whether field types mismatch (e.g., sending a string for an integer field), or whether a recent schema change added required fields without updating the client.

Design note: Tests diagnostic reasoning. The gold answer explains the mechanism, not just the fix.

Debugging

The /search endpoint returns empty results even though there are matching items in the database. Where would you look first?

Gold answer: Check the search query in crud/search.py. Likely a SQL LIKE pattern issue (missing % wildcards), a case sensitivity mismatch, or a filter that's too restrictive. Also check whether the search index is up to date if the project uses full-text search.

Design note: Open-ended enough to test reasoning, specific enough to grade.

Onboarding

I'm new to this project. Where should I start reading to understand the main API structure?

Gold answer: Start with main.py to see how the app is assembled and which routers are included. Then look at routers/ to see the available endpoints. Read models.py and schemas.py to understand the data shapes. The README.md has setup instructions and the docs/ directory has architecture notes.

Design note: Tests the system's ability to give useful orientation, not just list files.

Onboarding

Q10

How do I run the tests for this project?

Gold answer: Run pytest from the project root. Tests are in the tests/ directory. You need a test database configured. Check conftest.py for the test database URL and fixtures. Run pytest -v for verbose output or pytest tests/test_items.py for a specific module.

Design note: Practical onboarding question. The answer should be actionable, not theoretical.

A bad question, rewritten

Bad: "Is the code well-organized?"

This question is not diagnostic. Any answer could be argued as partially correct. It doesn't test a specific capability and can't be graded consistently.

Better: "What design pattern does the project use to separate database models from API schemas, and where are the boundaries?"

This is verifiable (the pattern either exists or it doesn't), tests architectural understanding, and has a concrete answer you can grade.

Building your grading rubric

For each benchmark question, grade the system's answer using this four-level rubric:

Grade	Criteria	Action
Fully correct	All key facts match the gold answer. No fabricated details. Evidence is from the right files.	Record as pass.
Partially correct	Core idea is right but missing details, or includes minor inaccuracies alongside correct content.	Record which parts are correct and which are missing. Apply a failure label to the missing parts.
Unsupported	The answer sounds plausible but references code, functions, or patterns that don't exist in the repo.	Record as unsupported. Apply `hallucination` or `wrong_chunk` failure label.
Wrong	The answer contradicts what the code actually does.	Record as wrong. Apply the most specific failure label.

When you grade, always check against the actual code. Don't grade from memory. The gold answer is your reference, but the code is the ground truth.

Exercises

Review your 10 questions from the previous lesson. For each one, ask: is it diagnostic? Does it test one capability? Can I verify the answer against the code? Rewrite any that fail these tests.
Expand your benchmark set to 30 questions. Use the five categories as a guide. Aim for at least 5 per category. Use the 10 worked examples above as templates.
Write gold answers for at least 8 of your 30 questions. Keep them concise (2-5 sentences) but specific enough to grade against.
Take 3 of your benchmark questions and try answering them by manually prompting a model (using the quickstart from Module 1). Grade the answers using the four-level rubric. Apply failure labels to any non-fully-correct answers.
Save your benchmark set as a structured file:

# Create in your anchor repo directory
touch benchmark-questions.jsonl

{"id": "q01", "category": "symbol_lookup", "question": "What does get_current_user do?", "gold_answer": "It extracts a JWT...", "difficulty": "easy"}
{"id": "q02", "category": "architecture", "question": "How does a request flow from...", "gold_answer": "The request hits...", "difficulty": "medium"}

Completion checkpoint

You should now have:

30+ benchmark questions covering all five categories
Gold answers for at least 8 questions
At least 3 questions hand-graded using the four-level rubric with failure labels
A benchmark-questions.jsonl file with your structured benchmark set
Confidence that your questions are diagnostic, verifiable, and categorized by difficulty
"Golden" by Huntrix queued in your playlist

What's next

Run Logs and Baseline. Once the benchmark is solid, the next lesson gives you the schema and process for recording runs, grading them, and establishing the number you will spend the rest of the path improving.

References

Start here

Anthropic: Building effective agents — evaluation-first development with practical patterns

Build with this

OpenAI: Evaluation getting started — structured evaluation concepts and setup
Braintrust: Evals guide — practical eval suite design for AI applications

Deep dive

LMSYS: Chatbot Arena — large-scale model evaluation; useful for understanding how evaluation methodology affects conclusions