Designing Specialists and Routers
The previous lesson established when to split and built the orchestration skeleton. Now we'll fill in the specialists. A good specialist is narrow, measurable, and independently testable. A bad specialist is a vague prompt with broad tool access, essentially the single agent again with extra plumbing.
This lesson walks through the first specialist split, shows how to design prompts and tool access for each role, and, critically, benchmarks the specialist system against the single-agent baseline to verify that the split actually helps.
What you'll learn
- Design specialist agents with narrow scope, constrained tool access, and clear input/output contracts
- Build a router that classifies queries and activates the right specialist with appropriate context
- Add a human approval point for side-effecting actions like file writes or external API calls
- Test specialists in isolation before composing them through the orchestrator
- Benchmark the specialist split against the single-agent system and interpret the results
Concepts
Specialist agent — an agent designed for a narrow task with a focused system prompt, limited tool access, and a defined output format. The key property of a good specialist is that it can be tested in isolation: you can pass it an input, get an output, and grade that output without running the full orchestration pipeline. This makes specialists easier to debug and improve than a single agent that handles everything.
Prompt isolation — the practice of giving each specialist its own system prompt rather than sharing one large prompt. Prompt isolation prevents the problem we identified in the previous lesson: conflicting instructions that try to serve multiple tasks simultaneously. The code explainer's prompt says "be precise about line numbers and function signatures." The docs specialist's prompt says "synthesize high-level architecture decisions." These instructions would conflict in a single prompt.
Tool scoping — restricting each specialist's tool access to only the tools it needs. The code explainer gets search_code and read_file. The docs specialist gets search_docs. The debug specialist gets search_code, read_file, and run_tests. Tool scoping reduces unnecessary tool calls (a problem we measured with the tool grader) and limits the blast radius when a specialist misbehaves.
Human approval gate — a checkpoint where the system pauses and asks for human confirmation before executing a side-effecting action. We add approval gates for actions that modify the repository, call external APIs, or could affect production systems. The approval gate is a node in the orchestration graph — the system routes to it, pauses, and resumes when the human confirms or rejects the action.
Specialist benchmark — an eval that tests a specialist in isolation, outside the orchestration pipeline. Specialist benchmarks use a subset of your existing benchmark questions filtered to the specialist's domain. The code specialist runs against code questions only, the docs specialist against docs questions only. This lets you improve each specialist independently before measuring the composed system.
Problem-to-Tool Map
| Problem class | Symptom | Cheapest thing to try first | Tool or approach |
|---|---|---|---|
| Specialist prompt too broad | Specialist behaves like the old single agent | Narrow the system prompt | Remove irrelevant instructions, constrain to one task type |
| Wrong specialist activated | Router sends questions to the wrong specialist | Check routing accuracy | Improve classification prompt or add few-shot examples |
| Specialist calls unnecessary tools | Tool grader shows tools outside the specialist's scope | Review tool access | Remove tools the specialist doesn't need |
| Side-effecting actions run unsupervised | System writes files or calls APIs without confirmation | Add manual check before deploy | Human approval gate in the graph |
| Can't tell if split helped | No clear quality difference between single-agent and specialist | Run the benchmark on both | Side-by-side comparison on the same eval suite |
Walkthrough
The first specialist split
Based on the routing categories from the previous lesson, here's the first specialist split that works well for a code assistant:
Orchestrator / Router — classifies the incoming question, selects the specialist, passes context, and synthesizes the final response. The orchestrator doesn't answer questions itself. Its job is to make good routing decisions and combine specialist outputs.
Retriever / Evidence assembler — handles questions that need evidence from the codebase or documentation. This specialist runs retrieval, ranks evidence, and returns a structured evidence package. It doesn't generate the final answer — that happens in synthesis.
Code explainer — handles questions about specific code: what a function does, how a class works, why a pattern was chosen. Gets search_code and read_file tools with a prompt tuned for precise, citation-heavy explanations.
Test / Debug specialist — handles questions about failures, test output, and debugging. Gets search_code, read_file, and run_tests tools with a prompt tuned for diagnostic reasoning: identify the failure, locate the relevant code, explain the root cause.
These four specialists aren't a universal recipe. They're a starting point based on the question types that showed the most variation in our eval results. Your split might look different depending on where your single-agent system struggles.
Designing specialist prompts
Each specialist gets its own system prompt. The prompt should be short, specific, and impossible to confuse with another specialist's job:
# orchestration/specialists.py
"""Specialist agent implementations.
Each specialist has:
- A focused system prompt
- Scoped tool access
- A defined output format
"""
# --- System prompts ---
CODE_EXPLAINER_PROMPT = """You are a code explanation specialist for a repository assistant.
Your job: explain specific code — functions, classes, patterns, and implementation details.
Rules:
- Always cite file paths and line numbers when referencing code.
- Quote the relevant code snippet before explaining it.
- If you need to see more context, use the read_file tool.
- If you can't find the code the user is asking about, say so explicitly.
- Do NOT answer questions about documentation, architecture, or debugging.
Those go to other specialists.
Output format:
- Start with a one-sentence summary of what the code does.
- Then provide the detailed explanation with code references.
- End with any caveats or edge cases you noticed."""
DOCS_SPECIALIST_PROMPT = """You are a documentation specialist for a repository assistant.
Your job: answer questions about documentation, architecture, design decisions,
and high-level system behavior.
Rules:
- Cite specific documentation files when answering.
- If the answer requires code-level detail, say so — the code specialist
will handle it.
- Synthesize information across multiple docs when the question spans topics.
- Do NOT answer questions about specific function implementations or debugging.
Output format:
- Start with a direct answer to the question.
- Then provide supporting evidence from the documentation.
- Note any gaps where documentation is missing or unclear."""
DEBUG_SPECIALIST_PROMPT = """You are a debugging specialist for a repository assistant.
Your job: diagnose failures, explain test output, and help debug issues.
Rules:
- Start with the error message or failure symptom.
- Locate the relevant code using search and file reading.
- Provide a root cause analysis with evidence.
- If you can suggest a fix, do so — but flag any fix that modifies files
as requiring human approval.
- Do NOT answer general code questions or documentation questions.
Output format:
- Error/symptom summary
- Relevant code location(s)
- Root cause analysis
- Suggested fix (if applicable, flagged for approval if side-effecting)"""
EVIDENCE_ASSEMBLER_PROMPT = """You are an evidence assembler for a repository assistant.
Your job: retrieve and organize evidence from the codebase and documentation
to support answering a question. You do NOT generate the final answer.
Rules:
- Use search tools to find relevant code and documentation.
- Rank evidence by relevance to the question.
- Return structured evidence with file paths, snippets, and relevance notes.
- If retrieval finds nothing relevant, say so explicitly.
Output format:
Return a JSON object with:
- "evidence": list of {file, snippet, relevance} objects
- "confidence": float 0-1 indicating how well the evidence covers the question
- "gaps": list of aspects the evidence doesn't cover"""Notice the pattern: each prompt defines the specialist's scope ("your job"), its boundaries ("do NOT answer..."), and its output format. This structure makes specialists independently testable and prevents them from drifting into each other's territory.
Scoping tool access
Each specialist should only have access to the tools it needs:
# orchestration/specialists.py (continued)
# --- Tool definitions ---
CODE_TOOLS = [
{
"type": "function",
"function": {
"name": "search_code",
"description": "Search the codebase for functions, classes, or patterns",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"file_pattern": {"type": "string", "description": "Optional glob pattern to filter files"},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a specific file or range of lines",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path"},
"start_line": {"type": "integer", "description": "Optional start line"},
"end_line": {"type": "integer", "description": "Optional end line"},
},
"required": ["path"],
},
},
},
]
DOCS_TOOLS = [
{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search project documentation",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
},
"required": ["query"],
},
},
},
]
DEBUG_TOOLS = CODE_TOOLS + [
{
"type": "function",
"function": {
"name": "run_tests",
"description": "Run a specific test file or test function",
"parameters": {
"type": "object",
"properties": {
"test_path": {"type": "string", "description": "Test file or function path"},
"verbose": {"type": "boolean", "description": "Show detailed output"},
},
"required": ["test_path"],
},
},
},
]The code explainer gets search_code and read_file. The docs specialist gets only search_docs. The debug specialist gets code tools plus run_tests. This scoping means the tool grader will catch it if a specialist somehow calls a tool it shouldn't have. This is a routing error, not a tool-use error.
Calling specialists from the graph
Each specialist call follows the same pattern: build messages with the specialist prompt, call the API with scoped tools, handle tool calls, and return the result:
# orchestration/specialists.py (continued)
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-4.1-mini"
def call_specialist(
question: str,
system_prompt: str,
tools: list[dict],
tool_executor: callable,
thread_context: list[dict] | None = None,
long_term_context: str = "",
) -> dict:
"""Call a specialist with its scoped prompt, tools, and memory context."""
effective_prompt = system_prompt
if long_term_context:
effective_prompt += f"\n\n{long_term_context}"
messages = [{"role": "system", "content": effective_prompt}]
if thread_context:
messages.extend(thread_context)
messages.append({"role": "user", "content": question})
tools_called = []
max_turns = 5
for _ in range(max_turns):
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=tools if tools else None,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
messages.append(choice.message)
for tool_call in choice.message.tool_calls:
tools_called.append(tool_call.function.name)
result = tool_executor(
tool_call.function.name,
tool_call.function.arguments,
)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
})
else:
return {
"specialist_output": choice.message.content,
"tools_called": tools_called,
"model": MODEL,
}
return {
"specialist_output": messages[-1].get("content", "Max turns reached"),
"tools_called": tools_called,
"model": MODEL,
}def code_specialist(state: dict) -> dict:
"""Code explainer specialist node."""
from tools.executor import execute_tool # Your existing tool executor
result = call_specialist(
question=state["question"],
system_prompt=CODE_EXPLAINER_PROMPT,
tools=CODE_TOOLS,
tool_executor=execute_tool,
# Memory wiring: pass thread history and long-term context from state.
# These fields are populated by the memory lessons (thread-and-workflow-memory,
# long-term-memory). Until you build those layers, these will be empty.
thread_context=state.get("thread_messages"),
long_term_context=state.get("long_term_context", ""),
)
return {**state, **result}
def docs_specialist(state: dict) -> dict:
"""Documentation specialist node."""
from tools.executor import execute_tool
result = call_specialist(
question=state["question"],
system_prompt=DOCS_SPECIALIST_PROMPT,
tools=DOCS_TOOLS,
tool_executor=execute_tool,
thread_context=state.get("thread_messages"),
long_term_context=state.get("long_term_context", ""),
)
return {**state, **result}
def debug_specialist(state: dict) -> dict:
"""Debug specialist node."""
from tools.executor import execute_tool
result = call_specialist(
question=state["question"],
system_prompt=DEBUG_SPECIALIST_PROMPT,
tools=DEBUG_TOOLS,
tool_executor=execute_tool,
thread_context=state.get("thread_messages"),
long_term_context=state.get("long_term_context", ""),
)
return {**state, **result}
def general_specialist(state: dict) -> dict:
"""General-purpose fallback specialist."""
from tools.executor import execute_tool
result = call_specialist(
question=state["question"],
system_prompt="You are a general-purpose code repository assistant. Answer the question using available tools.",
tools=CODE_TOOLS + DOCS_TOOLS,
tool_executor=execute_tool,
thread_context=state.get("thread_messages"),
long_term_context=state.get("long_term_context", ""),
)
return {**state, **result}Adding a human approval gate
Some actions have side effects, such as writing files, running commands that modify state, or calling external APIs. For these, we'll add an approval point in the orchestration graph. The system pauses, presents the proposed action to a human, and only proceeds with explicit confirmation:
# orchestration/approval.py
"""Human approval gate for side-effecting actions.
Pauses the orchestration pipeline and waits for human confirmation
before executing actions that modify the repository or external state.
"""
SIDE_EFFECTING_TOOLS = {"write_file", "run_command", "create_pr", "deploy"}
def needs_approval(state: dict) -> bool:
"""Check whether any proposed action requires human approval."""
proposed_tools = set(state.get("proposed_tools", []))
return bool(proposed_tools & SIDE_EFFECTING_TOOLS)
def request_approval(state: dict) -> dict:
"""Present the proposed action and wait for human confirmation.
In a production system, this would send a notification (Slack, email,
UI prompt) and wait for a response. For development, we use stdin.
"""
proposed = state.get("proposed_action", "Unknown action")
tools = state.get("proposed_tools", [])
print("\n" + "=" * 50)
print("APPROVAL REQUIRED")
print("=" * 50)
print(f"Action: {proposed}")
print(f"Tools: {', '.join(tools)}")
print(f"Reason: {state.get('specialist_output', 'No explanation provided')[:200]}")
print()
response = input("Approve this action? [y/N]: ").strip().lower()
approved = response in ("y", "yes")
return {
**state,
"approved": approved,
"approval_response": "approved" if approved else "rejected",
}To wire this into the graph, add a conditional edge after the debug specialist (since debugging is the most likely path to suggest side-effecting actions):
# In orchestration/graph.py, extend the graph:
def check_approval_needed(state: dict) -> str:
"""Route to approval gate if side-effecting tools are proposed."""
if needs_approval(state):
return "approval"
return "synthesize"
# Replace the direct debug -> synthesize edge:
# graph.add_edge("debug", "synthesize") # Remove this
graph.add_conditional_edges(
"debug",
check_approval_needed,
{"approval": "approval", "synthesize": "synthesize"},
)
graph.add_node("approval", request_approval)
graph.add_edge("approval", "synthesize")The approval gate is a safety mechanism, not a bottleneck. It only activates for side-effecting actions, and the system continues normally for read-only questions.
Teaching simplification vs. production implementation. The
input()gate above demonstrates the concept — human gates before side-effecting actions. In production, LangGraph'sinterrupt()function with a checkpointer andthread_idprovides durable pause/resume: the graph serializes its state, the process can shut down, and execution resumes from the same point when the human responds (via a UI callback, Slack action, etc.). The portable idea is that side-effecting actions require explicit human approval. Theinput()version works for local development and testing;interrupt()with a checkpointer is what you'll use when deploying.
Benchmarking the specialist split
Now the critical step: measure whether specialists actually improve the system.
# 1. Run the single-agent baseline (if you haven't recently)
python harness/run_harness.py --pipeline single-agent
python harness/graders/answer_grader.py harness/runs/single-agent-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/single-agent-latest-graded.jsonl
# 2. Run the specialist system
python harness/run_harness.py --pipeline orchestrated
python harness/graders/answer_grader.py harness/runs/orchestrated-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/orchestrated-latest-graded.jsonl
# 3. Compare
python harness/compare_runs.py \
harness/runs/single-agent-latest-graded-traced.jsonl \
harness/runs/orchestrated-latest-graded-traced.jsonlLook at the comparison across several dimensions:
- Per-route accuracy. Do code questions score higher with the code specialist than with the single agent? What about docs and debug questions?
- Tool precision. Does tool scoping reduce unnecessary tool calls?
- Waste rate. Does routing reduce
correct_but_wastefultraces? - Cost. What's the overhead of routing + specialist calls vs. the single agent?
- Latency. Does the routing step add noticeable delay?
If the specialist system doesn't beat the single-agent baseline on at least one dimension without degrading the others, simplify. Remove specialists that don't help. Merge categories that the router can't reliably distinguish. The eval results will tell you what to do.
Testing specialists in isolation
Before tuning the composed system, test each specialist on its own domain:
# harness/test_specialist.py
"""Test a single specialist against its domain-specific benchmark subset."""
import json
import sys
from orchestration.specialists import call_specialist, CODE_EXPLAINER_PROMPT, CODE_TOOLS
from tools.executor import execute_tool
def test_specialist(
specialist_prompt: str,
specialist_tools: list,
benchmark_file: str,
route_filter: str,
):
"""Run benchmark questions for a single specialist."""
questions = []
with open(benchmark_file) as f:
for line in f:
if line.strip():
q = json.loads(line)
if q.get("expected_route") == route_filter:
questions.append(q)
print(f"Testing specialist ({route_filter}): {len(questions)} questions\n")
for q in questions:
result = call_specialist(
question=q["question"],
system_prompt=specialist_prompt,
tools=specialist_tools,
tool_executor=execute_tool,
)
print(f" Q: {q['question'][:60]}")
print(f" Tools: {result['tools_called']}")
print(f" Answer: {result['specialist_output'][:100]}...")
print()
if __name__ == "__main__":
test_specialist(
specialist_prompt=CODE_EXPLAINER_PROMPT,
specialist_tools=CODE_TOOLS,
benchmark_file="benchmark-questions.jsonl",
route_filter="code",
)Isolated testing catches problems early. If the code specialist scores poorly on code questions in isolation, the problem is the specialist's prompt or tool access, not the router, orchestrator, or synthesis step. That narrows our debugging considerably.
Exercises
-
Implement all four specialist prompts from this lesson. Test each one in isolation against its domain subset of your benchmark. Record the per-specialist accuracy.
-
Add tool scoping to each specialist so they can only access their designated tools. Run the tool grader on the orchestrated system and compare unnecessary tool calls vs. the single-agent baseline.
-
Add the human approval gate for side-effecting actions. Test it with a debug question that suggests modifying a file. Verify that the system pauses for approval and handles both approval and rejection correctly.
-
Run the full benchmark through both the single-agent and orchestrated pipelines. Build a comparison table showing accuracy, tool precision, waste rate, cost, and latency for each. Does the specialist split help?
-
Based on your comparison results, remove any specialist that doesn't improve its domain subset. Rerun the benchmark with the simplified system and verify that quality holds.
Completion checkpoint
You have:
- Four specialist agents with isolated prompts and scoped tool access
- Each specialist tested in isolation against its domain-specific benchmark subset
- A human approval gate that activates for side-effecting actions
- A side-by-side benchmark comparison of single-agent vs. specialist system
- Evidence-based decisions about which specialists to keep, remove, or merge
Reflection prompts
- Which specialist showed the biggest improvement over the single-agent baseline? What about the specialist's design made the difference?
- Did any specialist perform worse than the single agent on its domain? What would you change?
- The human approval gate adds friction. In a production system, how would you decide which actions require approval and which can proceed automatically?
What's next
Agent-to-Agent Interop. You have coordination inside one system now; the next lesson looks at what changes when agents have to talk across system boundaries.
References
Start here
- Anthropic: Building effective agents — specialist design patterns and the orchestrator-worker architecture
Build with this
- LangGraph: Multi-agent patterns — LangGraph examples for routing to specialist agents and composing their outputs
- OpenAI Agents SDK: Handoffs — alternative pattern for transferring control between specialists
Deep dive
- Anthropic: Tool use best practices — patterns for scoping tool access and validating tool calls
- Human-in-the-loop patterns — LangGraph's approach to human approval gates in agent workflows