Orchestration and Subagents

You now have a single-agent system with four eval families, a harness that runs in CI, and trace labels that tell you exactly where behavior breaks down. That measurement foundation changes what's possible. Before evals, adding a second agent was guesswork because you couldn't tell whether the split helped or just added complexity. Now you can.

This lesson covers why and when to add multi-agent coordination. The next lesson will cover how to design the specialists themselves. We're separating these because the decision to split is more important than the implementation. A well-measured single agent beats an unmeasured multi-agent system every time.

What you'll learn

Identify when a single-agent system has outgrown its architecture using eval evidence
Apply decision criteria for splitting work across multiple agents
Build a minimal orchestrator that routes tasks and synthesizes results using LangGraph
Trace multi-agent execution paths using the observability tools from Module 6
Compare orchestrated results against the single-agent baseline using your existing benchmark

Concepts

Orchestration — explicit control over routing, delegation, state, retries, and synthesis in a multi-agent system. Orchestration is the portable concept underneath every multi-agent framework. Whether you use LangGraph, the OpenAI Agents SDK, or Claude Agent SDK, you're making the same decisions: which agent handles this task, what state do they share, what happens when one fails, and how do their outputs combine. The framework is the implementation; orchestration is the design.

Subagent — an agent that handles a narrower task under the direction of an orchestrator. A subagent has its own system prompt, its own tool access, and (often) its own model configuration. The orchestrator decides when to invoke it, what context to pass, and how to use its output. Think of subagents as specialists called on demand, rather than autonomous peers.

Router — the decision logic that examines an incoming request and determines which agent or path should handle it. We built a simpler version of routing in the retrieval module (code vs. docs vs. hybrid). Multi-agent routing extends the same idea: instead of choosing a retrieval mode, the router chooses which specialist to activate.

Synthesis — the step where the orchestrator combines outputs from one or more subagents into a final response. Synthesis isn't just concatenation. The orchestrator may need to resolve conflicts, fill gaps, or decide that one specialist's answer supersedes another's.

Control flow graph — a directed graph where nodes represent processing steps (agent calls, tool invocations, transformations) and edges represent transitions. LangGraph makes this literal: you define nodes and edges, and the framework handles execution, state passing, and checkpointing. The graph makes control flow visible, which matters when you're debugging multi-agent traces.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
One agent does everything poorly	Routing, synthesis, retrieval, and debugging all blur together	Better prompting and tool descriptions	Add one orchestrator and a small number of specialists
Specialist prompts conflict	The main prompt grows huge and contradictory	Subtasks via plain functions	Subagents with isolated prompts
Long workflows need state	Multi-step tasks become brittle or lose context	Manual state dictionary	Graph-based stateful orchestration
Can't tell if split helps	You added agents but quality didn't visibly improve	Run the benchmark	Compare single-agent vs. multi-agent on the same eval suite

Walkthrough

When to split: the eval-driven decision

The instinct to split a system into multiple agents usually comes from feeling that one agent is doing too much. That feeling is valid, but it's not sufficient. You need eval evidence.

Look at your trace label distribution from the previous lesson. Three signals suggest a split will help:

Route-dependent quality gaps. If code questions score 85% but docs questions score 45%, the single agent's prompt may be trying to serve two different tasks with incompatible instructions. A specialist for each domain can carry a more focused prompt.
Correct-but-wasteful traces. If 20%+ of traces are correct_but_wasteful, the system is doing unnecessary work. An orchestrator that routes to the right specialist can skip irrelevant retrieval and tool calls.
Conflicting tool-use patterns. If the tool grader shows the system calling search_code for documentation questions (or vice versa), routing logic that activates the right tools for each task type will help more than prompt tweaks.

If none of these signals are present (if your single agent scores well across question types and the waste rate is low), you don't need orchestration yet. The complexity cost of multi-agent coordination is real, and it only pays off when the single-agent system has measurable limitations.

What orchestration actually controls

Regardless of the framework, orchestration makes five things explicit:

Routing: which agent or path handles this request
Delegation: what context and instructions the subagent receives
State: what information persists across steps
Retries: what happens when a subagent fails or returns low-quality output
Synthesis: how outputs from multiple subagents combine into a final response

These five decisions exist whether you implement them in LangGraph, the OpenAI Agents SDK, plain Python functions, or any other framework. The framework gives you structure; the decisions are yours.

Default: LangGraph

Why this is the default: You've already worked with LangGraph in the framework-agent lesson from Module 3. Now we'll use its graph structure to make multi-agent control flow explicit and inspectable. The graph representation maps naturally to the trace visualization you're already using, which means debugging multi-agent behavior uses the same tools.

Portable concept underneath: Orchestration is explicit control over routing, delegation, state, retries, and synthesis. Any framework that lets you define these five things clearly will work.

Closest alternatives and when to switch:

Claude Agent SDK: use when you're building primarily with Claude models and want first-party support for tool use, multi-turn conversations, and agent orchestration within Anthropic's ecosystem. See platform.claude.com/docs for the SDK documentation.
OpenAI Agents SDK: use when you want lightweight handoffs, sessions, and built-in tracing with a simpler abstraction than a full graph runtime. Works well for systems where agents pass control linearly rather than in complex branching patterns.
No multi-agent split: stay with the single agent if evals don't show real gains from specialization. This is always a valid choice.

Building the orchestrator

We'll start with the simplest useful orchestrator: a router node that classifies the incoming question, specialist nodes that handle each question type, and a synthesis node that assembles the final response.

project/
├── orchestration/
│   ├── __init__.py
│   ├── graph.py            # The orchestration graph
│   ├── router.py           # Routing logic
│   ├── specialists.py      # Specialist agent definitions
│   └── synthesis.py        # Output assembly
├── harness/                # Existing eval harness
├── retrieval/              # Existing retrieval layer
└── observability/          # Existing tracing

First, define the state that flows through the graph:

# orchestration/graph.py
"""Multi-agent orchestration graph.

Routes incoming questions to specialist agents and synthesizes
their outputs into a final response.
"""
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Literal

from langgraph.graph import StateGraph, END


@dataclass
class OrchestratorState:
    """State passed between nodes in the orchestration graph."""
    question: str = ""
    route: str = ""                      # Which specialist handles this
    specialist_output: str = ""          # Raw specialist response
    evidence: list[str] = field(default_factory=list)
    tools_called: list[str] = field(default_factory=list)
    final_answer: str = ""
    confidence: float = 0.0
    retry_count: int = 0

Next, the router. This extends the routing logic you built in the retrieval module, but now it chooses a specialist rather than a retrieval mode:

# orchestration/router.py
"""Route incoming questions to the appropriate specialist.

Uses an LLM classifier to determine which specialist should
handle the question. Falls back to the general path for
ambiguous questions.
"""
from openai import OpenAI

client = OpenAI()

ROUTE_PROMPT = """Classify this question about a code repository.
Choose exactly one route:

- "code": questions about specific code, functions, classes, or implementation details
- "docs": questions about documentation, architecture, or high-level design
- "debug": questions about errors, failures, test output, or debugging
- "general": questions that don't clearly fit the above categories

Question: {question}

Route:"""


def route_question(state: dict) -> dict:
    """Classify the question and set the route."""
    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "user", "content": ROUTE_PROMPT.format(question=state["question"])}
        ],
        max_tokens=10,
    )
    raw_route = response.choices[0].message.content or ""
    route = raw_route.splitlines()[0]
    # Some models answer with `Route: debug` instead of just `debug`.
    if ":" in route:
        route = route.split(":", 1)[1]
    route = route.strip().lower().strip('"')

    valid_routes = {"code", "docs", "debug", "general"}
    if route not in valid_routes:
        route = "general"

    return {**state, "route": route}

Now wire the graph:

# orchestration/graph.py (continued)

from orchestration.router import route_question
from orchestration.specialists import code_specialist, docs_specialist, debug_specialist, general_specialist
from orchestration.synthesis import synthesize


def route_to_specialist(state: dict) -> str:
    """Edge function: route to the appropriate specialist node."""
    return state["route"]


def build_orchestration_graph() -> StateGraph:
    """Build and compile the orchestration graph."""
    graph = StateGraph(OrchestratorState)

    # Nodes
    graph.add_node("router", route_question)
    graph.add_node("code", code_specialist)
    graph.add_node("docs", docs_specialist)
    graph.add_node("debug", debug_specialist)
    graph.add_node("general", general_specialist)
    graph.add_node("synthesize", synthesize)

    # Edges
    graph.set_entry_point("router")
    graph.add_conditional_edges(
        "router",
        route_to_specialist,
        {"code": "code", "docs": "docs", "debug": "debug", "general": "general"},
    )

    # All specialists flow to synthesis
    for specialist in ["code", "docs", "debug", "general"]:
        graph.add_edge(specialist, "synthesize")

    graph.add_edge("synthesize", END)

    return graph.compile()

The specialist implementations go in the next lesson. For now, the key insight is the shape: a routing decision at the front, specialist processing in the middle, and synthesis at the end. This pattern is the same whether you have two specialists or ten.

Tracing the orchestrated system

Multi-agent systems generate more complex traces. Each specialist call is a sub-trace within the larger orchestration trace. We'll extend the tracing from Module 6 to capture this structure:

# observability/traced_orchestration.py
"""Traced orchestration pipeline.

Wraps the orchestration graph with Langfuse tracing so each
routing decision and specialist call is visible in the trace.
"""
from langfuse import Langfuse
from orchestration.graph import build_orchestration_graph

langfuse = Langfuse()

graph = build_orchestration_graph()


def traced_orchestrated_pipeline(
    question: str,
    run_id: str = "",
) -> dict:
    """Run the orchestration graph with full tracing."""
    trace = langfuse.trace(
        name="orchestrated-pipeline",
        metadata={"run_id": run_id},
        input={"question": question},
    )

    # Route
    route_span = trace.span(name="routing")
    state = {"question": question}
    # The graph handles routing internally, but we trace the decision
    result = graph.invoke(state)
    route_span.end(output={"route": result.get("route", "unknown")})

    # Log specialist call
    specialist_span = trace.span(
        name=f"specialist-{result.get('route', 'unknown')}",
        input={"question": question, "route": result.get("route")},
    )
    specialist_span.end(output={"specialist_output": result.get("specialist_output", "")[:200]})

    trace.update(
        output={"answer": result.get("final_answer", ""), "route": result.get("route", "")},
    )

    langfuse.flush()
    return result

When you inspect traces in Langfuse, you'll see the routing decision as a span, the specialist call as a child span, and the synthesis as the final span. This structure makes it straightforward to debug routing errors. You can see exactly which specialist was chosen and what it produced.

Measuring the split

This is the most important step, and the one most often skipped. Run your existing benchmark against the orchestrated system and compare:

# Run the benchmark through the orchestrated pipeline
python harness/run_harness.py --pipeline orchestrated

# Grade with the same graders
python harness/graders/answer_grader.py harness/runs/latest.jsonl
python harness/graders/tool_grader.py harness/runs/latest.jsonl
python harness/graders/trace_labeler.py harness/runs/latest-graded.jsonl

# Compare against the single-agent baseline
python harness/compare_runs.py \
    harness/runs/single-agent-baseline.jsonl \
    harness/runs/latest-graded-traced.jsonl

What to look for in the comparison:

Overall accuracy: Did it go up, stay flat, or drop? If it dropped, the split is hurting.
Per-route accuracy: Did code questions improve while docs questions stayed the same? That suggests the code specialist is working but the docs specialist needs tuning.
Waste rate: Did the correct_but_wasteful percentage drop? If routing works well, specialists should avoid the unnecessary work that inflated the single-agent's traces.
Cost and latency: Orchestration adds overhead (the routing call, inter-agent communication). Is the quality improvement worth the cost?

If the orchestrated system doesn't beat the single-agent baseline on at least one eval dimension without degrading the others, reconsider the split. It's always valid to go back to a single agent.

Exercises

Review your trace label distribution from the previous module. Identify which question types have the most wrong_route or correct_but_wasteful labels. Write down two or three categories that would benefit from specialist handling.
Build the routing classifier using the pattern above. Test it on 20 questions from your benchmark and check whether it agrees with your expected routes. What's the routing accuracy?
Wire the orchestration graph with placeholder specialists (each specialist can just return the question with a tag for now). Run it end-to-end and verify that traces show the routing decision and specialist selection correctly.
Run 10 benchmark questions through both the single-agent pipeline and the orchestrated pipeline. Compare the trace structures side-by-side in Langfuse. Where does orchestration add visible overhead?
Write a brief evaluation memo: based on the routing accuracy and trace comparison, does orchestration look like it will improve the system? What specialist would you build first, and why?

Completion checkpoint

You have:

A routing classifier that assigns incoming questions to specialist categories with >80% agreement against your manual labels
An orchestration graph with routing, specialist nodes (even if they're placeholders), and synthesis
Traced execution that shows routing decisions and specialist calls as separate spans
A side-by-side comparison of single-agent vs. orchestrated traces for at least 10 questions
A written evaluation of whether orchestration is justified for your system

Reflection prompts

Looking at your trace labels, which question types have the clearest case for specialist handling? Which types are ambiguous?
What's the minimum number of specialists that would improve your system? What's the risk of adding more than that?
If the orchestrated system scores the same as the single agent, what would you try before adding more specialists?

What's next

Specialists and Routers. The orchestration skeleton is in place; the next lesson fills in the specialists and router contracts that make it real.

References

Start here

Anthropic: Building effective agents — the foundational patterns for multi-agent design, including when to use orchestration vs. simpler approaches

Build with this

LangGraph documentation — the graph runtime we use for orchestration, with examples of multi-agent patterns
OpenAI Agents SDK — alternative orchestration framework with built-in handoffs and tracing

Deep dive

Claude Agent SDK — Anthropic's agent orchestration SDK for Claude-based systems
Anthropic: Multi-agent systems — when multi-agent orchestration helps and when it adds unnecessary complexity