Designing Specialists and Routers

The previous lesson established when to split and built the orchestration skeleton. Now we'll fill in the specialists. A good specialist is narrow, measurable, and independently testable. A bad specialist is a vague prompt with broad tool access, essentially the single agent again with extra plumbing.

This lesson walks through the first specialist split, shows how to design prompts and tool access for each role, and, critically, benchmarks the specialist system against the single-agent baseline to verify that the split actually helps.

What you'll learn

Design specialist agents with narrow scope, constrained tool access, and clear input/output contracts
Build a router that classifies queries and activates the right specialist with appropriate context
Add a human approval point for side-effecting actions like file writes or external API calls
Test specialists in isolation before composing them through the orchestrator
Benchmark the specialist split against the single-agent system and interpret the results

Concepts

Specialist agent — an agent designed for a narrow task with a focused system prompt, limited tool access, and a defined output format. The key property of a good specialist is that it can be tested in isolation: you can pass it an input, get an output, and grade that output without running the full orchestration pipeline. This makes specialists easier to debug and improve than a single agent that handles everything.

Prompt isolation — the practice of giving each specialist its own system prompt rather than sharing one large prompt. Prompt isolation prevents the problem we identified in the previous lesson: conflicting instructions that try to serve multiple tasks simultaneously. The code explainer's prompt says "be precise about line numbers and function signatures." The docs specialist's prompt says "synthesize high-level architecture decisions." These instructions would conflict in a single prompt.

Tool scoping — restricting each specialist's tool access to only the tools it needs. The code explainer gets search_code and read_file. The docs specialist gets search_docs. The debug specialist gets search_code, read_file, and run_tests. Tool scoping reduces unnecessary tool calls (a problem we measured with the tool grader) and limits the blast radius when a specialist misbehaves.

Human approval gate — a checkpoint where the system pauses and asks for human confirmation before executing a side-effecting action. We add approval gates for actions that modify the repository, call external APIs, or could affect production systems. The approval gate is a node in the orchestration graph — the system routes to it, pauses, and resumes when the human confirms or rejects the action.

Specialist benchmark — an eval that tests a specialist in isolation, outside the orchestration pipeline. Specialist benchmarks use a subset of your existing benchmark questions filtered to the specialist's domain. The code specialist runs against code questions only, the docs specialist against docs questions only. This lets you improve each specialist independently before measuring the composed system.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
Specialist prompt too broad	Specialist behaves like the old single agent	Narrow the system prompt	Remove irrelevant instructions, constrain to one task type
Wrong specialist activated	Router sends questions to the wrong specialist	Check routing accuracy	Improve classification prompt or add few-shot examples
Specialist calls unnecessary tools	Tool grader shows tools outside the specialist's scope	Review tool access	Remove tools the specialist doesn't need
Side-effecting actions run unsupervised	System writes files or calls APIs without confirmation	Add manual check before deploy	Human approval gate in the graph
Can't tell if split helped	No clear quality difference between single-agent and specialist	Run the benchmark on both	Side-by-side comparison on the same eval suite

Walkthrough

The first specialist split

Based on the routing categories from the previous lesson, here's the first specialist split that works well for a code assistant:

Orchestrator / Router — classifies the incoming question, selects the specialist, passes context, and synthesizes the final response. The orchestrator doesn't answer questions itself. Its job is to make good routing decisions and combine specialist outputs.

Retriever / Evidence assembler — handles questions that need evidence from the codebase or documentation. This specialist runs retrieval, ranks evidence, and returns a structured evidence package. It doesn't generate the final answer — that happens in synthesis.

Code explainer — handles questions about specific code: what a function does, how a class works, why a pattern was chosen. Gets search_code and read_file tools with a prompt tuned for precise, citation-heavy explanations.

Test / Debug specialist — handles questions about failures, test output, and debugging. Gets search_code, read_file, and run_tests tools with a prompt tuned for diagnostic reasoning: identify the failure, locate the relevant code, explain the root cause.

These four specialists aren't a universal recipe. They're a starting point based on the question types that showed the most variation in our eval results. Your split might look different depending on where your single-agent system struggles.

Designing specialist prompts

Each specialist gets its own system prompt. The prompt should be short, specific, and impossible to confuse with another specialist's job:

# orchestration/specialists.py
"""Specialist agent implementations.

Each specialist has:
- A focused system prompt
- Scoped tool access
- A defined output format
"""


# --- System prompts ---

CODE_EXPLAINER_PROMPT = """You are a code explanation specialist for a repository assistant.

Your job: explain specific code — functions, classes, patterns, and implementation details.

Rules:
- Always cite file paths and line numbers when referencing code.
- Quote the relevant code snippet before explaining it.
- If you need to see more context, use the read_file tool.
- If you can't find the code the user is asking about, say so explicitly.
- Do NOT answer questions about documentation, architecture, or debugging.
  Those go to other specialists.

Output format:
- Start with a one-sentence summary of what the code does.
- Then provide the detailed explanation with code references.
- End with any caveats or edge cases you noticed."""

DOCS_SPECIALIST_PROMPT = """You are a documentation specialist for a repository assistant.

Your job: answer questions about documentation, architecture, design decisions,
and high-level system behavior.

Rules:
- Cite specific documentation files when answering.
- If the answer requires code-level detail, say so — the code specialist
  will handle it.
- Synthesize information across multiple docs when the question spans topics.
- Do NOT answer questions about specific function implementations or debugging.

Output format:
- Start with a direct answer to the question.
- Then provide supporting evidence from the documentation.
- Note any gaps where documentation is missing or unclear."""

DEBUG_SPECIALIST_PROMPT = """You are a debugging specialist for a repository assistant.

Your job: diagnose failures, explain test output, and help debug issues.

Rules:
- Start with the error message or failure symptom.
- Locate the relevant code using search and file reading.
- Provide a root cause analysis with evidence.
- If you can suggest a fix, do so — but flag any fix that modifies files
  as requiring human approval.
- Do NOT answer general code questions or documentation questions.

Output format:
- Error/symptom summary
- Relevant code location(s)
- Root cause analysis
- Suggested fix (if applicable, flagged for approval if side-effecting)"""

EVIDENCE_ASSEMBLER_PROMPT = """You are an evidence assembler for a repository assistant.

Your job: retrieve and organize evidence from the codebase and documentation
to support answering a question. You do NOT generate the final answer.

Rules:
- Use search tools to find relevant code and documentation.
- Rank evidence by relevance to the question.
- Return structured evidence with file paths, snippets, and relevance notes.
- If retrieval finds nothing relevant, say so explicitly.

Output format:
Return a JSON object with:
- "evidence": list of {file, snippet, relevance} objects
- "confidence": float 0-1 indicating how well the evidence covers the question
- "gaps": list of aspects the evidence doesn't cover"""

Notice the pattern: each prompt defines the specialist's scope ("your job"), its boundaries ("do NOT answer..."), and its output format. This structure makes specialists independently testable and prevents them from drifting into each other's territory.

Scoping tool access

Each specialist should only have access to the tools it needs:

# orchestration/specialists.py (continued)

# --- Tool definitions ---

CODE_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_code",
            "description": "Search the codebase for functions, classes, or patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "file_pattern": {"type": "string", "description": "Optional glob pattern to filter files"},
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a specific file or range of lines",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"},
                    "start_line": {"type": "integer", "description": "Optional start line"},
                    "end_line": {"type": "integer", "description": "Optional end line"},
                },
                "required": ["path"],
            },
        },
    },
]

DOCS_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_docs",
            "description": "Search project documentation",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                },
                "required": ["query"],
            },
        },
    },
]

DEBUG_TOOLS = CODE_TOOLS + [
    {
        "type": "function",
        "function": {
            "name": "run_tests",
            "description": "Run a specific test file or test function",
            "parameters": {
                "type": "object",
                "properties": {
                    "test_path": {"type": "string", "description": "Test file or function path"},
                    "verbose": {"type": "boolean", "description": "Show detailed output"},
                },
                "required": ["test_path"],
            },
        },
    },
]

The code explainer gets search_code and read_file. The docs specialist gets only search_docs. The debug specialist gets code tools plus run_tests. This scoping means the tool grader will catch it if a specialist somehow calls a tool it shouldn't have. This is a routing error, not a tool-use error.

Calling specialists from the graph

Each specialist call follows the same pattern: build messages with the specialist prompt, call the API with scoped tools, handle tool calls, and return the result:

# orchestration/specialists.py (continued)
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4.1-mini"


def call_specialist(
    question: str,
    system_prompt: str,
    tools: list[dict],
    tool_executor: callable,
    thread_context: list[dict] | None = None,
    long_term_context: str = "",
) -> dict:
    """Call a specialist with its scoped prompt, tools, and memory context."""
    effective_prompt = system_prompt
    if long_term_context:
        effective_prompt += f"\n\n{long_term_context}"

    messages = [{"role": "system", "content": effective_prompt}]
    if thread_context:
        messages.extend(thread_context)
    messages.append({"role": "user", "content": question})

    tools_called = []
    max_turns = 5

    for _ in range(max_turns):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools if tools else None,
        )

        choice = response.choices[0]
        if choice.finish_reason == "tool_calls":
            messages.append(choice.message)
            for tool_call in choice.message.tool_calls:
                tools_called.append(tool_call.function.name)
                result = tool_executor(
                    tool_call.function.name,
                    tool_call.function.arguments,
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": str(result),
                })
        else:
            return {
                "specialist_output": choice.message.content,
                "tools_called": tools_called,
                "model": MODEL,
            }

    return {
        "specialist_output": messages[-1].get("content", "Max turns reached"),
        "tools_called": tools_called,
        "model": MODEL,
    }

def code_specialist(state: dict) -> dict:
    """Code explainer specialist node."""
    from tools.executor import execute_tool  # Your existing tool executor
    result = call_specialist(
        question=state["question"],
        system_prompt=CODE_EXPLAINER_PROMPT,
        tools=CODE_TOOLS,
        tool_executor=execute_tool,
        # Memory wiring: pass thread history and long-term context from state.
        # These fields are populated by the memory lessons (thread-and-workflow-memory,
        # long-term-memory). Until you build those layers, these will be empty.
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def docs_specialist(state: dict) -> dict:
    """Documentation specialist node."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt=DOCS_SPECIALIST_PROMPT,
        tools=DOCS_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def debug_specialist(state: dict) -> dict:
    """Debug specialist node."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt=DEBUG_SPECIALIST_PROMPT,
        tools=DEBUG_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def general_specialist(state: dict) -> dict:
    """General-purpose fallback specialist."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt="You are a general-purpose code repository assistant. Answer the question using available tools.",
        tools=CODE_TOOLS + DOCS_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}

Adding a human approval gate

Some actions have side effects, such as writing files, running commands that modify state, or calling external APIs. For these, we'll add an approval point in the orchestration graph. The system pauses, presents the proposed action to a human, and only proceeds with explicit confirmation:

# orchestration/approval.py
"""Human approval gate for side-effecting actions.

Pauses the orchestration pipeline and waits for human confirmation
before executing actions that modify the repository or external state.
"""

SIDE_EFFECTING_TOOLS = {"write_file", "run_command", "create_pr", "deploy"}


def needs_approval(state: dict) -> bool:
    """Check whether any proposed action requires human approval."""
    proposed_tools = set(state.get("proposed_tools", []))
    return bool(proposed_tools & SIDE_EFFECTING_TOOLS)


def request_approval(state: dict) -> dict:
    """Present the proposed action and wait for human confirmation.

    In a production system, this would send a notification (Slack, email,
    UI prompt) and wait for a response. For development, we use stdin.
    """
    proposed = state.get("proposed_action", "Unknown action")
    tools = state.get("proposed_tools", [])

    print("\n" + "=" * 50)
    print("APPROVAL REQUIRED")
    print("=" * 50)
    print(f"Action: {proposed}")
    print(f"Tools:  {', '.join(tools)}")
    print(f"Reason: {state.get('specialist_output', 'No explanation provided')[:200]}")
    print()

    response = input("Approve this action? [y/N]: ").strip().lower()
    approved = response in ("y", "yes")

    return {
        **state,
        "approved": approved,
        "approval_response": "approved" if approved else "rejected",
    }

To wire this into the graph, add a conditional edge after the debug specialist (since debugging is the most likely path to suggest side-effecting actions):

# In orchestration/graph.py, extend the graph:

def check_approval_needed(state: dict) -> str:
    """Route to approval gate if side-effecting tools are proposed."""
    if needs_approval(state):
        return "approval"
    return "synthesize"

# Replace the direct debug -> synthesize edge:
# graph.add_edge("debug", "synthesize")  # Remove this
graph.add_conditional_edges(
    "debug",
    check_approval_needed,
    {"approval": "approval", "synthesize": "synthesize"},
)
graph.add_node("approval", request_approval)
graph.add_edge("approval", "synthesize")

The approval gate is a safety mechanism, not a bottleneck. It only activates for side-effecting actions, and the system continues normally for read-only questions.

Teaching simplification vs. production implementation. The input() gate above demonstrates the concept — human gates before side-effecting actions. In production, LangGraph's interrupt() function with a checkpointer and thread_id provides durable pause/resume: the graph serializes its state, the process can shut down, and execution resumes from the same point when the human responds (via a UI callback, Slack action, etc.). The portable idea is that side-effecting actions require explicit human approval. The input() version works for local development and testing; interrupt() with a checkpointer is what you'll use when deploying.

Benchmarking the specialist split

Now the critical step: measure whether specialists actually improve the system.

# 1. Run the single-agent baseline (if you haven't recently)
python harness/run_harness.py --pipeline single-agent
python harness/graders/answer_grader.py harness/runs/single-agent-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/single-agent-latest-graded.jsonl

# 2. Run the specialist system
python harness/run_harness.py --pipeline orchestrated
python harness/graders/answer_grader.py harness/runs/orchestrated-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/orchestrated-latest-graded.jsonl

# 3. Compare
python harness/compare_runs.py \
    harness/runs/single-agent-latest-graded-traced.jsonl \
    harness/runs/orchestrated-latest-graded-traced.jsonl

Look at the comparison across several dimensions:

Per-route accuracy. Do code questions score higher with the code specialist than with the single agent? What about docs and debug questions?
Tool precision. Does tool scoping reduce unnecessary tool calls?
Waste rate. Does routing reduce correct_but_wasteful traces?
Cost. What's the overhead of routing + specialist calls vs. the single agent?
Latency. Does the routing step add noticeable delay?

If the specialist system doesn't beat the single-agent baseline on at least one dimension without degrading the others, simplify. Remove specialists that don't help. Merge categories that the router can't reliably distinguish. The eval results will tell you what to do.

Testing specialists in isolation

Before tuning the composed system, test each specialist on its own domain:

# harness/test_specialist.py
"""Test a single specialist against its domain-specific benchmark subset."""
import json
import sys

from orchestration.specialists import call_specialist, CODE_EXPLAINER_PROMPT, CODE_TOOLS
from tools.executor import execute_tool


def test_specialist(
    specialist_prompt: str,
    specialist_tools: list,
    benchmark_file: str,
    route_filter: str,
):
    """Run benchmark questions for a single specialist."""
    questions = []
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                if q.get("expected_route") == route_filter:
                    questions.append(q)

    print(f"Testing specialist ({route_filter}): {len(questions)} questions\n")

    for q in questions:
        result = call_specialist(
            question=q["question"],
            system_prompt=specialist_prompt,
            tools=specialist_tools,
            tool_executor=execute_tool,
        )
        print(f"  Q: {q['question'][:60]}")
        print(f"  Tools: {result['tools_called']}")
        print(f"  Answer: {result['specialist_output'][:100]}...")
        print()


if __name__ == "__main__":
    test_specialist(
        specialist_prompt=CODE_EXPLAINER_PROMPT,
        specialist_tools=CODE_TOOLS,
        benchmark_file="benchmark-questions.jsonl",
        route_filter="code",
    )

Isolated testing catches problems early. If the code specialist scores poorly on code questions in isolation, the problem is the specialist's prompt or tool access, not the router, orchestrator, or synthesis step. That narrows our debugging considerably.

Exercises

Implement all four specialist prompts from this lesson. Test each one in isolation against its domain subset of your benchmark. Record the per-specialist accuracy.
Add tool scoping to each specialist so they can only access their designated tools. Run the tool grader on the orchestrated system and compare unnecessary tool calls vs. the single-agent baseline.
Add the human approval gate for side-effecting actions. Test it with a debug question that suggests modifying a file. Verify that the system pauses for approval and handles both approval and rejection correctly.
Run the full benchmark through both the single-agent and orchestrated pipelines. Build a comparison table showing accuracy, tool precision, waste rate, cost, and latency for each. Does the specialist split help?
Based on your comparison results, remove any specialist that doesn't improve its domain subset. Rerun the benchmark with the simplified system and verify that quality holds.

Completion checkpoint

You have:

Four specialist agents with isolated prompts and scoped tool access
Each specialist tested in isolation against its domain-specific benchmark subset
A human approval gate that activates for side-effecting actions
A side-by-side benchmark comparison of single-agent vs. specialist system
Evidence-based decisions about which specialists to keep, remove, or merge

Reflection prompts

Which specialist showed the biggest improvement over the single-agent baseline? What about the specialist's design made the difference?
Did any specialist perform worse than the single agent on its domain? What would you change?
The human approval gate adds friction. In a production system, how would you decide which actions require approval and which can proceed automatically?

What's next

Agent-to-Agent Interop. You have coordination inside one system now; the next lesson looks at what changes when agents have to talk across system boundaries.

References

Start here

Anthropic: Building effective agents — specialist design patterns and the orchestrator-worker architecture

Build with this

LangGraph: Multi-agent patterns — LangGraph examples for routing to specialist agents and composing their outputs
OpenAI Agents SDK: Handoffs — alternative pattern for transferring control between specialists

Deep dive

Anthropic: Tool use best practices — patterns for scoping tool access and validating tool calls
Human-in-the-loop patterns — LangGraph's approach to human approval gates in agent workflows