Module 7: Orchestration and Memory Specialists and Routers

Designing Specialists and Routers

The previous lesson established when to split and built the orchestration skeleton. Now we'll fill in the specialists. A good specialist is narrow, measurable, and independently testable. A bad specialist is a vague prompt with broad tool access, essentially the single agent again with extra plumbing.

This lesson walks through the first specialist split, shows how to design prompts and tool access for each role, and, critically, benchmarks the specialist system against the single-agent baseline to verify that the split actually helps.

What you'll learn

  • Design specialist agents with narrow scope, constrained tool access, and clear input/output contracts
  • Build a router that classifies queries and activates the right specialist with appropriate context
  • Add a human approval point for side-effecting actions like file writes or external API calls
  • Test specialists in isolation before composing them through the orchestrator
  • Benchmark the specialist split against the single-agent system and interpret the results

Concepts

Specialist agent — an agent designed for a narrow task with a focused system prompt, limited tool access, and a defined output format. The key property of a good specialist is that it can be tested in isolation: you can pass it an input, get an output, and grade that output without running the full orchestration pipeline. This makes specialists easier to debug and improve than a single agent that handles everything.

Prompt isolation — the practice of giving each specialist its own system prompt rather than sharing one large prompt. Prompt isolation prevents the problem we identified in the previous lesson: conflicting instructions that try to serve multiple tasks simultaneously. The code explainer's prompt says "be precise about line numbers and function signatures." The docs specialist's prompt says "synthesize high-level architecture decisions." These instructions would conflict in a single prompt.

Tool scoping — restricting each specialist's tool access to only the tools it needs. The code explainer gets search_code and read_file. The docs specialist gets search_docs. The debug specialist gets search_code, read_file, and run_tests. Tool scoping reduces unnecessary tool calls (a problem we measured with the tool grader) and limits the blast radius when a specialist misbehaves.

Human approval gate — a checkpoint where the system pauses and asks for human confirmation before executing a side-effecting action. We add approval gates for actions that modify the repository, call external APIs, or could affect production systems. The approval gate is a node in the orchestration graph — the system routes to it, pauses, and resumes when the human confirms or rejects the action.

Specialist benchmark — an eval that tests a specialist in isolation, outside the orchestration pipeline. Specialist benchmarks use a subset of your existing benchmark questions filtered to the specialist's domain. The code specialist runs against code questions only, the docs specialist against docs questions only. This lets you improve each specialist independently before measuring the composed system.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
Specialist prompt too broadSpecialist behaves like the old single agentNarrow the system promptRemove irrelevant instructions, constrain to one task type
Wrong specialist activatedRouter sends questions to the wrong specialistCheck routing accuracyImprove classification prompt or add few-shot examples
Specialist calls unnecessary toolsTool grader shows tools outside the specialist's scopeReview tool accessRemove tools the specialist doesn't need
Side-effecting actions run unsupervisedSystem writes files or calls APIs without confirmationAdd manual check before deployHuman approval gate in the graph
Can't tell if split helpedNo clear quality difference between single-agent and specialistRun the benchmark on bothSide-by-side comparison on the same eval suite

Walkthrough

The first specialist split

Based on the routing categories from the previous lesson, here's the first specialist split that works well for a code assistant:

Orchestrator / Router — classifies the incoming question, selects the specialist, passes context, and synthesizes the final response. The orchestrator doesn't answer questions itself. Its job is to make good routing decisions and combine specialist outputs.

Retriever / Evidence assembler — handles questions that need evidence from the codebase or documentation. This specialist runs retrieval, ranks evidence, and returns a structured evidence package. It doesn't generate the final answer — that happens in synthesis.

Code explainer — handles questions about specific code: what a function does, how a class works, why a pattern was chosen. Gets search_code and read_file tools with a prompt tuned for precise, citation-heavy explanations.

Test / Debug specialist — handles questions about failures, test output, and debugging. Gets search_code, read_file, and run_tests tools with a prompt tuned for diagnostic reasoning: identify the failure, locate the relevant code, explain the root cause.

These four specialists aren't a universal recipe. They're a starting point based on the question types that showed the most variation in our eval results. Your split might look different depending on where your single-agent system struggles.

Designing specialist prompts

Each specialist gets its own system prompt. The prompt should be short, specific, and impossible to confuse with another specialist's job:

# orchestration/specialists.py
"""Specialist agent implementations.

Each specialist has:
- A focused system prompt
- Scoped tool access
- A defined output format
"""


# --- System prompts ---

CODE_EXPLAINER_PROMPT = """You are a code explanation specialist for a repository assistant.

Your job: explain specific code — functions, classes, patterns, and implementation details.

Rules:
- Always cite file paths and line numbers when referencing code.
- Quote the relevant code snippet before explaining it.
- If you need to see more context, use the read_file tool.
- If you can't find the code the user is asking about, say so explicitly.
- Do NOT answer questions about documentation, architecture, or debugging.
  Those go to other specialists.

Output format:
- Start with a one-sentence summary of what the code does.
- Then provide the detailed explanation with code references.
- End with any caveats or edge cases you noticed."""

DOCS_SPECIALIST_PROMPT = """You are a documentation specialist for a repository assistant.

Your job: answer questions about documentation, architecture, design decisions,
and high-level system behavior.

Rules:
- Cite specific documentation files when answering.
- If the answer requires code-level detail, say so — the code specialist
  will handle it.
- Synthesize information across multiple docs when the question spans topics.
- Do NOT answer questions about specific function implementations or debugging.

Output format:
- Start with a direct answer to the question.
- Then provide supporting evidence from the documentation.
- Note any gaps where documentation is missing or unclear."""

DEBUG_SPECIALIST_PROMPT = """You are a debugging specialist for a repository assistant.

Your job: diagnose failures, explain test output, and help debug issues.

Rules:
- Start with the error message or failure symptom.
- Locate the relevant code using search and file reading.
- Provide a root cause analysis with evidence.
- If you can suggest a fix, do so — but flag any fix that modifies files
  as requiring human approval.
- Do NOT answer general code questions or documentation questions.

Output format:
- Error/symptom summary
- Relevant code location(s)
- Root cause analysis
- Suggested fix (if applicable, flagged for approval if side-effecting)"""

EVIDENCE_ASSEMBLER_PROMPT = """You are an evidence assembler for a repository assistant.

Your job: retrieve and organize evidence from the codebase and documentation
to support answering a question. You do NOT generate the final answer.

Rules:
- Use search tools to find relevant code and documentation.
- Rank evidence by relevance to the question.
- Return structured evidence with file paths, snippets, and relevance notes.
- If retrieval finds nothing relevant, say so explicitly.

Output format:
Return a JSON object with:
- "evidence": list of {file, snippet, relevance} objects
- "confidence": float 0-1 indicating how well the evidence covers the question
- "gaps": list of aspects the evidence doesn't cover"""

Notice the pattern: each prompt defines the specialist's scope ("your job"), its boundaries ("do NOT answer..."), and its output format. This structure makes specialists independently testable and prevents them from drifting into each other's territory.

Scoping tool access

Each specialist should only have access to the tools it needs:

# orchestration/specialists.py (continued)

# --- Tool definitions ---

CODE_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_code",
            "description": "Search the codebase for functions, classes, or patterns",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "file_pattern": {"type": "string", "description": "Optional glob pattern to filter files"},
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a specific file or range of lines",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"},
                    "start_line": {"type": "integer", "description": "Optional start line"},
                    "end_line": {"type": "integer", "description": "Optional end line"},
                },
                "required": ["path"],
            },
        },
    },
]

DOCS_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_docs",
            "description": "Search project documentation",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                },
                "required": ["query"],
            },
        },
    },
]

DEBUG_TOOLS = CODE_TOOLS + [
    {
        "type": "function",
        "function": {
            "name": "run_tests",
            "description": "Run a specific test file or test function",
            "parameters": {
                "type": "object",
                "properties": {
                    "test_path": {"type": "string", "description": "Test file or function path"},
                    "verbose": {"type": "boolean", "description": "Show detailed output"},
                },
                "required": ["test_path"],
            },
        },
    },
]

The code explainer gets search_code and read_file. The docs specialist gets only search_docs. The debug specialist gets code tools plus run_tests. This scoping means the tool grader will catch it if a specialist somehow calls a tool it shouldn't have. This is a routing error, not a tool-use error.

Calling specialists from the graph

Each specialist call follows the same pattern: build messages with the specialist prompt, call the API with scoped tools, handle tool calls, and return the result:

# orchestration/specialists.py (continued)
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4.1-mini"


def call_specialist(
    question: str,
    system_prompt: str,
    tools: list[dict],
    tool_executor: callable,
    thread_context: list[dict] | None = None,
    long_term_context: str = "",
) -> dict:
    """Call a specialist with its scoped prompt, tools, and memory context."""
    effective_prompt = system_prompt
    if long_term_context:
        effective_prompt += f"\n\n{long_term_context}"

    messages = [{"role": "system", "content": effective_prompt}]
    if thread_context:
        messages.extend(thread_context)
    messages.append({"role": "user", "content": question})

    tools_called = []
    max_turns = 5

    for _ in range(max_turns):
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools if tools else None,
        )

        choice = response.choices[0]
        if choice.finish_reason == "tool_calls":
            messages.append(choice.message)
            for tool_call in choice.message.tool_calls:
                tools_called.append(tool_call.function.name)
                result = tool_executor(
                    tool_call.function.name,
                    tool_call.function.arguments,
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": str(result),
                })
        else:
            return {
                "specialist_output": choice.message.content,
                "tools_called": tools_called,
                "model": MODEL,
            }

    return {
        "specialist_output": messages[-1].get("content", "Max turns reached"),
        "tools_called": tools_called,
        "model": MODEL,
    }
def code_specialist(state: dict) -> dict:
    """Code explainer specialist node."""
    from tools.executor import execute_tool  # Your existing tool executor
    result = call_specialist(
        question=state["question"],
        system_prompt=CODE_EXPLAINER_PROMPT,
        tools=CODE_TOOLS,
        tool_executor=execute_tool,
        # Memory wiring: pass thread history and long-term context from state.
        # These fields are populated by the memory lessons (thread-and-workflow-memory,
        # long-term-memory). Until you build those layers, these will be empty.
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def docs_specialist(state: dict) -> dict:
    """Documentation specialist node."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt=DOCS_SPECIALIST_PROMPT,
        tools=DOCS_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def debug_specialist(state: dict) -> dict:
    """Debug specialist node."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt=DEBUG_SPECIALIST_PROMPT,
        tools=DEBUG_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}


def general_specialist(state: dict) -> dict:
    """General-purpose fallback specialist."""
    from tools.executor import execute_tool
    result = call_specialist(
        question=state["question"],
        system_prompt="You are a general-purpose code repository assistant. Answer the question using available tools.",
        tools=CODE_TOOLS + DOCS_TOOLS,
        tool_executor=execute_tool,
        thread_context=state.get("thread_messages"),
        long_term_context=state.get("long_term_context", ""),
    )
    return {**state, **result}

Adding a human approval gate

Some actions have side effects, such as writing files, running commands that modify state, or calling external APIs. For these, we'll add an approval point in the orchestration graph. The system pauses, presents the proposed action to a human, and only proceeds with explicit confirmation:

# orchestration/approval.py
"""Human approval gate for side-effecting actions.

Pauses the orchestration pipeline and waits for human confirmation
before executing actions that modify the repository or external state.
"""

SIDE_EFFECTING_TOOLS = {"write_file", "run_command", "create_pr", "deploy"}


def needs_approval(state: dict) -> bool:
    """Check whether any proposed action requires human approval."""
    proposed_tools = set(state.get("proposed_tools", []))
    return bool(proposed_tools & SIDE_EFFECTING_TOOLS)


def request_approval(state: dict) -> dict:
    """Present the proposed action and wait for human confirmation.

    In a production system, this would send a notification (Slack, email,
    UI prompt) and wait for a response. For development, we use stdin.
    """
    proposed = state.get("proposed_action", "Unknown action")
    tools = state.get("proposed_tools", [])

    print("\n" + "=" * 50)
    print("APPROVAL REQUIRED")
    print("=" * 50)
    print(f"Action: {proposed}")
    print(f"Tools:  {', '.join(tools)}")
    print(f"Reason: {state.get('specialist_output', 'No explanation provided')[:200]}")
    print()

    response = input("Approve this action? [y/N]: ").strip().lower()
    approved = response in ("y", "yes")

    return {
        **state,
        "approved": approved,
        "approval_response": "approved" if approved else "rejected",
    }

To wire this into the graph, add a conditional edge after the debug specialist (since debugging is the most likely path to suggest side-effecting actions):

# In orchestration/graph.py, extend the graph:

def check_approval_needed(state: dict) -> str:
    """Route to approval gate if side-effecting tools are proposed."""
    if needs_approval(state):
        return "approval"
    return "synthesize"

# Replace the direct debug -> synthesize edge:
# graph.add_edge("debug", "synthesize")  # Remove this
graph.add_conditional_edges(
    "debug",
    check_approval_needed,
    {"approval": "approval", "synthesize": "synthesize"},
)
graph.add_node("approval", request_approval)
graph.add_edge("approval", "synthesize")

The approval gate is a safety mechanism, not a bottleneck. It only activates for side-effecting actions, and the system continues normally for read-only questions.

Teaching simplification vs. production implementation. The input() gate above demonstrates the concept — human gates before side-effecting actions. In production, LangGraph's interrupt() function with a checkpointer and thread_id provides durable pause/resume: the graph serializes its state, the process can shut down, and execution resumes from the same point when the human responds (via a UI callback, Slack action, etc.). The portable idea is that side-effecting actions require explicit human approval. The input() version works for local development and testing; interrupt() with a checkpointer is what you'll use when deploying.

Benchmarking the specialist split

Now the critical step: measure whether specialists actually improve the system.

# 1. Run the single-agent baseline (if you haven't recently)
python harness/run_harness.py --pipeline single-agent
python harness/graders/answer_grader.py harness/runs/single-agent-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/single-agent-latest-graded.jsonl

# 2. Run the specialist system
python harness/run_harness.py --pipeline orchestrated
python harness/graders/answer_grader.py harness/runs/orchestrated-latest.jsonl
python harness/graders/trace_labeler.py harness/runs/orchestrated-latest-graded.jsonl

# 3. Compare
python harness/compare_runs.py \
    harness/runs/single-agent-latest-graded-traced.jsonl \
    harness/runs/orchestrated-latest-graded-traced.jsonl

Look at the comparison across several dimensions:

  • Per-route accuracy. Do code questions score higher with the code specialist than with the single agent? What about docs and debug questions?
  • Tool precision. Does tool scoping reduce unnecessary tool calls?
  • Waste rate. Does routing reduce correct_but_wasteful traces?
  • Cost. What's the overhead of routing + specialist calls vs. the single agent?
  • Latency. Does the routing step add noticeable delay?

If the specialist system doesn't beat the single-agent baseline on at least one dimension without degrading the others, simplify. Remove specialists that don't help. Merge categories that the router can't reliably distinguish. The eval results will tell you what to do.

Testing specialists in isolation

Before tuning the composed system, test each specialist on its own domain:

# harness/test_specialist.py
"""Test a single specialist against its domain-specific benchmark subset."""
import json
import sys

from orchestration.specialists import call_specialist, CODE_EXPLAINER_PROMPT, CODE_TOOLS
from tools.executor import execute_tool


def test_specialist(
    specialist_prompt: str,
    specialist_tools: list,
    benchmark_file: str,
    route_filter: str,
):
    """Run benchmark questions for a single specialist."""
    questions = []
    with open(benchmark_file) as f:
        for line in f:
            if line.strip():
                q = json.loads(line)
                if q.get("expected_route") == route_filter:
                    questions.append(q)

    print(f"Testing specialist ({route_filter}): {len(questions)} questions\n")

    for q in questions:
        result = call_specialist(
            question=q["question"],
            system_prompt=specialist_prompt,
            tools=specialist_tools,
            tool_executor=execute_tool,
        )
        print(f"  Q: {q['question'][:60]}")
        print(f"  Tools: {result['tools_called']}")
        print(f"  Answer: {result['specialist_output'][:100]}...")
        print()


if __name__ == "__main__":
    test_specialist(
        specialist_prompt=CODE_EXPLAINER_PROMPT,
        specialist_tools=CODE_TOOLS,
        benchmark_file="benchmark-questions.jsonl",
        route_filter="code",
    )

Isolated testing catches problems early. If the code specialist scores poorly on code questions in isolation, the problem is the specialist's prompt or tool access, not the router, orchestrator, or synthesis step. That narrows our debugging considerably.

Exercises

  1. Implement all four specialist prompts from this lesson. Test each one in isolation against its domain subset of your benchmark. Record the per-specialist accuracy.

  2. Add tool scoping to each specialist so they can only access their designated tools. Run the tool grader on the orchestrated system and compare unnecessary tool calls vs. the single-agent baseline.

  3. Add the human approval gate for side-effecting actions. Test it with a debug question that suggests modifying a file. Verify that the system pauses for approval and handles both approval and rejection correctly.

  4. Run the full benchmark through both the single-agent and orchestrated pipelines. Build a comparison table showing accuracy, tool precision, waste rate, cost, and latency for each. Does the specialist split help?

  5. Based on your comparison results, remove any specialist that doesn't improve its domain subset. Rerun the benchmark with the simplified system and verify that quality holds.

Completion checkpoint

You have:

  • Four specialist agents with isolated prompts and scoped tool access
  • Each specialist tested in isolation against its domain-specific benchmark subset
  • A human approval gate that activates for side-effecting actions
  • A side-by-side benchmark comparison of single-agent vs. specialist system
  • Evidence-based decisions about which specialists to keep, remove, or merge

Reflection prompts

  • Which specialist showed the biggest improvement over the single-agent baseline? What about the specialist's design made the difference?
  • Did any specialist perform worse than the single agent on its domain? What would you change?
  • The human approval gate adds friction. In a production system, how would you decide which actions require approval and which can proceed automatically?

What's next

Agent-to-Agent Interop. You have coordination inside one system now; the next lesson looks at what changes when agents have to talk across system boundaries.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.