Build a Raw Tool-Calling Loop
Up to now, the model has been answering questions from training data alone, and your baseline showed exactly how limited that is. In this lesson, we'll give the model tools it can use to actually read your codebase: list files, search text, and read file contents. The model will decide which tools to call, in what order, and when it has enough information to answer.
We're building this from scratch. No framework, no library. Just a Python loop that sends a message, checks whether the model wants to call a tool, executes it, sends the result back, and repeats. Understanding these raw mechanics is important because every agent framework is an abstraction over this same loop. If you understand the loop, you can debug any framework.
What you'll learn
- Build a tool-calling control loop that lets a model interact with your codebase
- Define tool schemas that tell the model what tools are available and what arguments they accept
- Implement three core tools:
list_files,search_text, andread_file - Apply the tool argument validation patterns from Security Basics
- Run benchmark questions through your agent and compare results against your Module 2 baseline
Concepts
Agent: a system where a model decides which tools to call, observes the results, and iterates until a task is complete. The minimal definition: agent = model + tools + control loop. Everything else (state management, error handling, orchestration) builds on top of this.
Control loop: the code that manages the agent's cycle: send prompt → check for tool calls → execute tools → append results → repeat or finish. You own this loop. The model makes requests; your code decides whether and how to fulfill them.
Tool schema: a JSON description of a tool that tells the model what it does, what arguments it accepts, and what types those arguments are. The model uses this schema to decide when to call the tool and what arguments to provide. A clear schema reduces bad tool calls; a vague schema produces garbage arguments.
Tool result: the output your code returns to the model after executing a tool call. The model treats this as new context and decides what to do next: call another tool, call the same tool with different arguments, or produce a final answer.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Model can't inspect your repo | Answers are generic or hallucinated because the model has no access to your code | Manual copy-paste of code into the prompt | Give the model tools to list, search, and read files |
| Tool calls have bad arguments | Model requests files that don't exist or passes wrong argument types | Fix the tool schema descriptions | Add argument validation with allowlists |
| Agent loops forever | Model keeps calling tools without converging on an answer | Set a maximum iteration count | Add a hard stop after N tool rounds |
Walkthrough
Project setup
We'll work in your anchor repository. Make sure your environment from Module 2 is active:
cd anchor-repo
source .venv/bin/activateCreate an agent/ directory for this module's code:
mkdir -p agentBy the end of this lesson, you'll have:
anchor-repo/
├── agent/
│ ├── tools.py # Tool implementations
│ ├── schemas.py # Tool schema definitions
│ ├── loop.py # The raw control loop
│ └── run_benchmark.py # Benchmark runner using the agent
├── harness/ # From Module 2
│ ├── runs/
│ └── ...
└── benchmark-questions.jsonl
Define your tools
Create three tools that let the model explore your codebase. Each tool is a plain Python function with input validation.
# agent/tools.py
"""Tools that let the model interact with the anchor repository."""
import os
import subprocess
from pathlib import Path
# All tool operations are restricted to this directory
REPO_ROOT = Path(".").resolve()
# Directories to exclude from search and listing — .venv, .git, __pycache__, etc.
EXCLUDED_DIRS = {".venv", ".git", "__pycache__", "node_modules", ".tox", ".mypy_cache"}
def _is_excluded(path: Path) -> bool:
"""Check whether a path falls inside an excluded directory.
Args:
path: Path relative to the repository root.
Returns:
bool: True when the path should be hidden from tool access.
"""
return any(part in EXCLUDED_DIRS for part in path.parts)
def validate_path(path_str: str) -> Path:
"""Resolve and validate a repo-relative path for safe tool use.
Args:
path_str: Path provided by the model, relative to ``REPO_ROOT``.
Returns:
Path: A fully resolved path that stays inside the repository.
Raises:
ValueError: If the path escapes the repository or enters an excluded directory.
"""
requested = (REPO_ROOT / path_str).resolve()
try:
requested.relative_to(REPO_ROOT)
except ValueError:
raise ValueError(f"Path '{path_str}' is outside the repository")
if _is_excluded(requested.relative_to(REPO_ROOT)):
raise ValueError(f"Path '{path_str}' is in an excluded directory")
return requested
def list_files(glob_pattern: str = "**/*") -> str:
"""List repository files that match a glob pattern.
Args:
glob_pattern: Glob pattern to evaluate against files under ``REPO_ROOT``.
Returns:
str: A newline-delimited listing of matching files or a short status message.
"""
matches = sorted(
str(p.relative_to(REPO_ROOT))
for p in REPO_ROOT.glob(glob_pattern)
if p.is_file() and not _is_excluded(p.relative_to(REPO_ROOT))
)
if not matches:
return f"No files matching '{glob_pattern}'"
# Limit output to avoid flooding the context
if len(matches) > 50:
return "\n".join(matches[:50]) + f"\n... and {len(matches) - 50} more files"
return "\n".join(matches)
def search_text(query: str, glob_pattern: str = None) -> str:
"""Search repository files for a text pattern.
Args:
query: Text pattern to pass to ``grep``.
glob_pattern: Optional file glob to narrow the search scope.
Returns:
str: Matching lines with file and line context, or a status message.
"""
exclude_args = []
for d in EXCLUDED_DIRS:
exclude_args.extend(["--exclude-dir", d])
cmd = ["grep", "-rn", "--include=*.py"] + exclude_args + [query, "."]
if glob_pattern:
cmd = ["grep", "-rn", f"--include={glob_pattern}"] + exclude_args + [query, "."]
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10, cwd=REPO_ROOT)
lines = result.stdout.strip().split("\n")
if not lines or lines == [""]:
return f"No matches for '{query}'"
if len(lines) > 30:
return "\n".join(lines[:30]) + f"\n... and {len(lines) - 30} more matches"
return "\n".join(lines)
except subprocess.TimeoutExpired:
return "Search timed out"
def read_file(path: str, start_line: int = None, end_line: int = None) -> str:
"""Read a repository file, optionally trimming to a line range.
Args:
path: Repo-relative file path to read.
start_line: Optional 1-based line number to start from.
end_line: Optional inclusive line number to stop at.
Returns:
str: File contents or a short error/status message safe for the model to inspect.
"""
file_path = validate_path(path)
if not file_path.exists():
return f"File not found: {path}"
if not file_path.is_file():
return f"Not a file: {path}"
text = file_path.read_text()
lines = text.split("\n")
if start_line is not None or end_line is not None:
start = max(0, (start_line or 1) - 1)
end = end_line or len(lines)
lines = lines[start:end]
# Limit output size
if len(lines) > 200:
return "\n".join(lines[:200]) + f"\n... truncated ({len(lines)} total lines)"
return "\n".join(lines)
# Registry for dispatching tool calls
TOOL_FUNCTIONS = {
"list_files": list_files,
"search_text": search_text,
"read_file": read_file,
}Notice the security patterns from Module 1: validate_path uses relative_to to prevent path traversal, outputs are size-limited to avoid flooding the model's context, and there's a timeout on the subprocess call.
Define tool schemas
Create the JSON schemas that tell the model what tools are available:
# agent/schemas.py
"""Tool schemas for the model API."""
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "list_files",
"description": "List files in the repository matching a glob pattern. Use this to explore the repo structure before reading specific files.",
"parameters": {
"type": "object",
"properties": {
"glob_pattern": {
"type": "string",
"description": "Glob pattern to match files, e.g. '**/*.py', 'src/**/*.ts', 'tests/*'. Defaults to all files.",
}
},
"required": [],
},
},
},
{
"type": "function",
"function": {
"name": "search_text",
"description": "Search for a text pattern in repository files. Returns matching lines with file path and line number. Use this to find where something is defined or used.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Text to search for (passed to grep)",
},
"glob_pattern": {
"type": "string",
"description": "Optional file pattern to restrict search, e.g. '*.py', '*.md'",
},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file in the repository. Optionally specify a line range to read a specific section.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "File path relative to the repository root, e.g. 'src/main.py'",
},
"start_line": {
"type": "integer",
"description": "First line to read (1-based). Omit to start from the beginning.",
},
"end_line": {
"type": "integer",
"description": "Last line to read. Omit to read to the end.",
},
},
"required": ["path"],
},
},
},
]Having good tool descriptions is important. The model uses these descriptions to decide when to call a tool and what arguments to provide. "List files in the repository" is more useful than "List files." It tells the model what the tool operates on.
Build the control loop
This is the core of the agent. It's a loop that:
- Sends the conversation to the model with tool schemas
- Checks if the model wants to call a tool
- If yes: executes the tool, appends the result, and loops back to step 1
- If no: returns the model's final answer
Pick your provider for the complete agent/loop.py:
# agent/loop.py
"""Raw tool-calling control loop."""
import json
import sys
from openai import OpenAI
from agent.schemas import TOOL_SCHEMAS
from agent.tools import TOOL_FUNCTIONS
client = OpenAI()
SYSTEM_PROMPT = """You are a code assistant for this repository. Answer questions by using the available tools to explore the codebase.
Rules:
- Use list_files to understand the repo structure before diving into specific files.
- Use search_text to find where things are defined or used.
- Use read_file to examine specific code.
- Base your answers on what you find in the code, not on assumptions.
- If you can't find enough evidence, say so rather than guessing.
- When you have enough information, provide your answer with file references."""
MAX_TOOL_ROUNDS = 10
def run_agent(question: str, model: str = "gpt-4o-mini", verbose: bool = True) -> dict:
"""Run one question through the raw tool-calling control loop.
Args:
question: User question the agent should answer about the repository.
model: Provider-specific model identifier to call.
verbose: Whether to print each tool invocation while the loop runs.
Returns:
dict: Final answer text plus tool-call metadata and loop outcome details.
"""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
tool_calls_made = []
for round_num in range(MAX_TOOL_ROUNDS):
response = client.chat.completions.create(
model=model,
messages=messages,
tools=TOOL_SCHEMAS,
temperature=0,
)
msg = response.choices[0].message
if not msg.tool_calls:
if verbose:
print(f" [{round_num + 1} rounds, {len(tool_calls_made)} tool calls]")
return {
"answer": msg.content,
"tool_calls": tool_calls_made,
"rounds": round_num + 1,
"finish_reason": response.choices[0].finish_reason,
}
messages.append(msg)
for call in msg.tool_calls:
fn_name = call.function.name
try:
args = json.loads(call.function.arguments)
except json.JSONDecodeError:
args = {}
if verbose:
print(f" Tool: {fn_name}({', '.join(f'{k}={v!r}' for k, v in args.items())})")
if fn_name not in TOOL_FUNCTIONS:
result = f"Rejected: '{fn_name}' is not a registered tool."
else:
try:
result = TOOL_FUNCTIONS[fn_name](**args)
except (ValueError, TypeError) as e:
result = f"Validation error: {e}"
except Exception as e:
result = f"Error: {e}"
tool_calls_made.append({
"tool": fn_name,
"args": args,
"result_preview": result[:200] if len(result) > 200 else result,
})
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
return {
"answer": "Reached maximum tool rounds without a final answer.",
"tool_calls": tool_calls_made,
"rounds": MAX_TOOL_ROUNDS,
"finish_reason": "max_rounds",
}
if __name__ == "__main__":
question = sys.argv[1] if len(sys.argv) > 1 else "What are the main modules in this repository?"
print(f"Question: {question}\n")
result = run_agent(question)
print(f"\nAnswer:\n{result['answer']}")Run it:
python -m agent.loop "What are the main modules in this repository?"You should see the model calling list_files to explore the structure, possibly search_text to find specific patterns, and then producing an answer based on what it found. Watch the tool calls. This is the agent reasoning in real time.
Note that the Anthropic version includes its own tool schemas inline because Anthropic uses a different schema format (input_schema instead of parameters, no type: "function" wrapper). The OpenAI, Hugging Face, and Ollama versions all share agent/schemas.py.
Run your benchmark through the agent
Now the real test. Run your benchmark questions through the agent and compare against your Module 2 baseline:
# agent/run_benchmark.py
"""Run benchmark questions through the tool-calling agent."""
import json
import os
from datetime import datetime, timezone
from agent.loop import run_agent
RUN_ID = "agent-v1-" + datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M%S")
MODEL = "gpt-4o-mini"
PROVIDER = "openai"
BENCHMARK_FILE = "benchmark-questions.jsonl"
OUTPUT_FILE = f"harness/runs/{RUN_ID}.jsonl"
REPO_SHA = os.popen("git rev-parse --short HEAD").read().strip()
questions = []
with open(BENCHMARK_FILE) as f:
for line in f:
if line.strip():
questions.append(json.loads(line))
print(f"Running {len(questions)} benchmark questions")
print(f"Run ID: {RUN_ID}")
print(f"Model: {MODEL}\n")
results = []
for i, q in enumerate(questions):
print(f"[{i+1}/{len(questions)}] {q['category']}: {q['question'][:60]}...")
result = run_agent(q["question"], model=MODEL, verbose=True)
entry = {
"run_id": RUN_ID,
"question_id": q["id"],
"question": q["question"],
"category": q["category"],
"answer": result["answer"],
"model": MODEL,
"provider": PROVIDER,
"evidence_files": list(set(
tc["args"].get("path", "") for tc in result["tool_calls"]
if tc["tool"] == "read_file"
)),
"tools_called": [tc["tool"] for tc in result["tool_calls"]],
"retrieval_method": "tool_calling",
"grade": None,
"failure_label": None,
"grading_notes": "",
"repo_sha": REPO_SHA,
"timestamp": datetime.now(timezone.utc).isoformat(),
"harness_version": "v0.2",
}
results.append(entry)
print()
os.makedirs("harness/runs", exist_ok=True)
with open(OUTPUT_FILE, "w") as f:
for entry in results:
f.write(json.dumps(entry) + "\n")
print(f"Done. {len(results)} results saved to {OUTPUT_FILE}")
print("Next: grade these answers and compare against your baseline.")python -m agent.run_benchmarkAfter grading (using the same grade_baseline.py from Module 2), compare the numbers:
python harness/summarize_run.py harness/runs/agent-v1-*-graded.jsonlThis is the first real comparison your harness enables. How much did tool access improve over the training-data-only baseline? Which categories improved most? Which failure labels shifted from missing_evidence to something more specific?
Exercises
- Build the three tools in
agent/tools.py. Test each one independently before wiring them into the loop. - Build the control loop in
agent/loop.py. Run it against 3-5 questions manually and observe the tool call sequences. - Run your full benchmark through the agent using
run_benchmark.py. - Grade at least 15 answers and compare against your Module 2 baseline. Which categories improved? What's the new failure distribution?
- Identify one question where the agent called the right tools but still got the wrong answer. What went wrong? Was it a tool output problem, a reasoning problem, or a context problem?
Reflection prompts
- Which errors came from the model's reasoning (it had the right evidence but drew the wrong conclusion)?
- Which came from the tool interface (bad arguments, missing tools, truncated output)?
- Which came from oversized tool outputs flooding the context?
- Which came from missing retrieval (the tools didn't surface the right code)?
Completion checkpoint
You have:
- Three working tools (
list_files,search_text,read_file) with input validation - A working control loop that runs to completion without infinite looping
- A benchmark run graded and compared against your Module 2 baseline
- An understanding of where tool access helps and where it's still insufficient
Connecting to the project
This raw loop is the mechanical foundation for everything in this module. In the next lesson, you'll rebuild it using a framework and see what the framework gives you (state management, easier composition) and what it hides (the loop mechanics you now understand).
The tools you built here will also evolve. In Module 4 they'll be joined by retrieval tools, and in the next two lessons they'll become portable MCP capabilities that any client can use.
What's next
Rebuilding with a Framework. Build it once by hand first; then the next lesson makes the framework tradeoffs legible because you know what the abstraction is carrying for you.
References
Start here
- OpenAI: Function calling guide — the primary reference for tool calling with the OpenAI API
Build with this
- Anthropic: Tool use guide — Anthropic's equivalent for tool calling; compare the schema format
- OpenAI: Building agents track — a structured walkthrough of agent patterns
Deep dive
- Anthropic: Building effective agents — excellent engineering guide on agent design patterns and when tools aren't enough
- Ollama: Tool calling — Ollama's tool-calling implementation for local and cloud models