Module 7: Orchestration and Memory Thread and Workflow Memory

Thread and Workflow Memory

Up to this point, every request to our system starts from scratch. The orchestrator receives a question, routes it, gets a specialist response, and synthesizes an answer, and then forgets everything. Ask a follow-up question, and the system has no idea what you just discussed. Start a multi-step debugging session, and each step loses the context from the previous one.

Memory fixes this, but memory isn't one thing. It's at least three layers, and mixing them up creates problems. This lesson covers the first two layers, which are thread memory (the current conversation) and workflow state (the current multi-step task). These are the simpler, safer layers. We'll build them first, measure the improvement, and save long-term memory (the riskier layer) for the next lesson.

What you'll learn

  • Implement thread memory that maintains conversation context within a session
  • Build workflow state that persists intermediate results across steps of a multi-step task
  • Manage context window pressure as accumulated memory grows
  • Apply summarization and truncation strategies to keep memory useful without flooding the context
  • Measure whether memory improves system quality on a benchmark subset

Concepts

Thread memory — the record of the current conversation session. Thread memory includes the messages exchanged so far, the questions asked, the answers given, and any clarifications. When a user asks "what about the error handling?" after discussing a function, thread memory is what connects "the error handling" to the function from three turns ago. Thread memory is scoped to a single session and disappears when the session ends.

Workflow state — the structured record of a multi-step task's progress. Where thread memory is conversational (a sequence of messages), workflow state is operational (a data structure tracking what's been done, what's pending, and what intermediate results exist). A debugging workflow might track: symptom identified, relevant files located, root cause hypothesized, fix proposed, tests run. Workflow state makes multi-step tasks resumable and debuggable.

Context window pressure — the tension between adding memory to the context and keeping room for retrieval, tool results, and generation. Every token of memory is a token that can't be used for evidence or reasoning. As conversations grow, naive thread memory (just append every message) will eventually crowd out the context window, degrading quality. Managing context window pressure is a core context engineering challenge — one we first encountered in Module 1 and will keep encountering as the system grows.

Memory summarization — the practice of compressing older memory into a shorter summary to reduce context window pressure. Instead of keeping all 40 messages from a long conversation, summarize turns 1-30 into a paragraph and keep turns 31-40 verbatim. The summary preserves the key decisions and facts while freeing tokens for the current work.

Memory truncation — dropping the oldest memory entries when the context window fills up. Truncation is simpler than summarization but lossy — early context that might matter (the original question, key constraints) can be lost. A common pattern is to combine both: summarize old turns and truncate the summaries when even those grow too large.

Problem-to-Tool Map

Problem classSymptomCheapest thing to try firstTool or approach
System forgets the current conversationFollow-up questions lose contextInclude recent messages in the promptThread memory with a message window
Multi-step tasks lose progressEach step starts from scratchPass state manuallyExplicit workflow state in the graph
Long conversations degrade qualityAnswers get worse as conversations growLimit message countSummarize older turns, keep recent ones verbatim
Memory crowds out retrievalLess room for evidence in long threadsShorter history windowAdaptive memory budget based on task type

Walkthrough

Thread memory: maintaining conversation context

The simplest form of thread memory is including recent messages in the prompt. LangGraph provides checkpointing that makes this straightforward — the graph state persists across invocations within a session:

# orchestration/memory.py
"""Thread and workflow memory for the orchestration pipeline.

Thread memory: conversation history within a session.
Workflow state: structured progress tracking for multi-step tasks.
"""
from __future__ import annotations

from dataclasses import dataclass, field
from datetime import datetime, timezone
from openai import OpenAI


client = OpenAI()


@dataclass
class ThreadMemory:
    """Conversation history for a single session.

    Maintains a window of recent messages and a summary of older ones.
    """
    messages: list[dict] = field(default_factory=list)
    summary: str = ""
    session_id: str = ""
    created_at: str = ""
    max_recent_messages: int = 20  # Keep this many recent messages verbatim
    summary_threshold: int = 30    # Summarize when total exceeds this

    def add_message(self, role: str, content: str) -> None:
        """Add a message to the thread."""
        self.messages.append({
            "role": role,
            "content": content,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        })

        if len(self.messages) > self.summary_threshold:
            self._summarize_old_messages()

    def get_context_messages(self) -> list[dict]:
        """Get messages formatted for the LLM context."""
        context = []

        if self.summary:
            context.append({
                "role": "system",
                "content": f"Summary of earlier conversation:\n{self.summary}",
            })

        recent = self.messages[-self.max_recent_messages:]
        for msg in recent:
            context.append({"role": msg["role"], "content": msg["content"]})

        return context

    def _summarize_old_messages(self) -> None:
        """Compress older messages into a summary."""
        if len(self.messages) <= self.max_recent_messages:
            return

        old_messages = self.messages[:-self.max_recent_messages]

        summary_prompt = "Summarize this conversation history into key facts, decisions, and context. Be concise but preserve important details:\n\n"
        for msg in old_messages:
            summary_prompt += f"{msg['role']}: {msg['content'][:200]}\n"

        if self.summary:
            summary_prompt = f"Previous summary:\n{self.summary}\n\nNew messages to incorporate:\n" + summary_prompt

        response = client.chat.completions.create(
            model="gpt-4.1-nano",
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=300,
        )

        self.summary = response.choices[0].message.content
        self.messages = self.messages[-self.max_recent_messages:]

    def token_estimate(self) -> int:
        """Rough estimate of tokens consumed by this memory."""
        total_chars = len(self.summary)
        for msg in self.messages[-self.max_recent_messages:]:
            total_chars += len(msg["content"])
        return total_chars // 4

To integrate thread memory with the orchestration graph, pass the memory through the graph state:

# orchestration/graph.py (extend OrchestratorState)

@dataclass
class OrchestratorState:
    """State passed between nodes in the orchestration graph."""
    question: str = ""
    route: str = ""
    specialist_output: str = ""
    evidence: list[str] = field(default_factory=list)
    tools_called: list[str] = field(default_factory=list)
    final_answer: str = ""
    confidence: float = 0.0
    retry_count: int = 0
    # Memory fields
    thread_messages: list[dict] = field(default_factory=list)
    thread_summary: str = ""
    workflow_state: dict = field(default_factory=dict)

Then update each specialist to receive conversation context:

# orchestration/specialists.py (update call_specialist)

def call_specialist(
    question: str,
    system_prompt: str,
    tools: list[dict],
    tool_executor: callable,
    thread_context: list[dict] | None = None,
) -> dict:
    """Call a specialist with its scoped prompt, tools, and conversation context."""
    messages = [{"role": "system", "content": system_prompt}]

    # Include thread context if available
    if thread_context:
        messages.extend(thread_context)

    messages.append({"role": "user", "content": question})

    # ... rest of the function unchanged

Workflow state: tracking multi-step progress

Workflow state is different from thread memory. Where thread memory is a sequence of messages, workflow state is a structured record of what the system is doing:

# orchestration/memory.py (continued)

@dataclass
class WorkflowState:
    """Structured state for a multi-step workflow.

    Tracks what's been done, what's pending, and intermediate results.
    Each workflow type can have its own state shape.
    """
    workflow_id: str = ""
    workflow_type: str = ""  # "debug", "review", "refactor", etc.
    status: str = "in_progress"  # in_progress, paused, completed, failed
    steps_completed: list[dict] = field(default_factory=list)
    steps_pending: list[str] = field(default_factory=list)
    intermediate_results: dict = field(default_factory=dict)
    created_at: str = ""
    updated_at: str = ""

    def complete_step(self, step_name: str, result: dict) -> None:
        """Mark a step as complete and store its result."""
        self.steps_completed.append({
            "step": step_name,
            "result": result,
            "completed_at": datetime.now(timezone.utc).isoformat(),
        })
        if step_name in self.steps_pending:
            self.steps_pending.remove(step_name)
        self.updated_at = datetime.now(timezone.utc).isoformat()

    def add_intermediate_result(self, key: str, value) -> None:
        """Store an intermediate result for use by later steps."""
        self.intermediate_results[key] = value
        self.updated_at = datetime.now(timezone.utc).isoformat()

    def get_progress_summary(self) -> str:
        """Generate a summary of workflow progress for the LLM context."""
        completed = [s["step"] for s in self.steps_completed]
        summary = f"Workflow: {self.workflow_type} ({self.status})\n"
        summary += f"Completed: {', '.join(completed) if completed else 'none'}\n"
        summary += f"Pending: {', '.join(self.steps_pending) if self.steps_pending else 'none'}\n"

        if self.intermediate_results:
            summary += "Key findings:\n"
            for key, value in self.intermediate_results.items():
                summary += f"  - {key}: {str(value)[:100]}\n"

        return summary


def create_debug_workflow(symptom: str) -> WorkflowState:
    """Create a workflow for a debugging session."""
    return WorkflowState(
        workflow_type="debug",
        steps_pending=[
            "identify_symptom",
            "locate_relevant_code",
            "hypothesize_root_cause",
            "verify_hypothesis",
            "propose_fix",
        ],
        intermediate_results={"initial_symptom": symptom},
        created_at=datetime.now(timezone.utc).isoformat(),
        updated_at=datetime.now(timezone.utc).isoformat(),
    )

Workflow state shines in multi-step tasks like debugging. Without it, each step in a debugging session starts from scratch, and the system has to re-discover the error, re-locate the relevant code, and re-derive the context. With workflow state, step three (hypothesize root cause) knows what step two found (the relevant code) and what step one identified (the symptom).

Managing context window pressure

As thread memory grows, it competes with retrieval evidence and specialist reasoning for context window space. Here's a practical approach to managing the budget:

# orchestration/context_budget.py
"""Context window budget management.

Allocates context window space across memory, retrieval, and generation
to prevent memory from crowding out useful content.
"""

# Target allocations as percentage of available context
BUDGET = {
    "system_prompt": 0.10,     # 10% for system instructions
    "thread_memory": 0.20,     # 20% for conversation history
    "workflow_state": 0.05,    # 5% for workflow progress
    "retrieval_evidence": 0.40, # 40% for retrieved evidence
    "generation": 0.25,        # 25% reserved for model output
}


def allocate_context_budget(
    total_tokens: int = 128_000,
    task_type: str = "general",
) -> dict:
    """Calculate token budgets for each context section.

    Adjusts based on task type:
    - Conversation-heavy tasks get more thread memory budget
    - Research-heavy tasks get more retrieval budget
    """
    budget = BUDGET.copy()

    if task_type == "follow_up":
        # Follow-ups need more conversation context
        budget["thread_memory"] = 0.30
        budget["retrieval_evidence"] = 0.30

    elif task_type == "research":
        # Research needs more retrieval space
        budget["thread_memory"] = 0.10
        budget["retrieval_evidence"] = 0.50

    return {
        section: int(total_tokens * fraction)
        for section, fraction in budget.items()
    }


def trim_thread_to_budget(
    thread: "ThreadMemory",
    token_budget: int,
) -> list[dict]:
    """Trim thread memory to fit within a token budget.

    Strategy: keep the summary and as many recent messages as fit.
    """
    context = thread.get_context_messages()

    # Estimate tokens
    total_tokens = 0
    trimmed = []

    # Always include summary if present
    if context and context[0]["role"] == "system" and "Summary" in context[0]["content"]:
        summary_tokens = len(context[0]["content"]) // 4
        if summary_tokens < token_budget:
            trimmed.append(context[0])
            total_tokens += summary_tokens
            context = context[1:]

    # Add recent messages from most recent backwards until budget is full.
    # We reverse twice so the final message list stays chronological.
    selected_recent = []
    for msg in reversed(context):
        msg_tokens = len(msg["content"]) // 4
        if total_tokens + msg_tokens > token_budget:
            break
        selected_recent.append(msg)
        total_tokens += msg_tokens

    trimmed.extend(reversed(selected_recent))

    return trimmed

The key insight here: memory isn't free. Every token of conversation history is a token that can't be used for evidence or reasoning. Setting explicit budgets and trimming to fit them prevents the slow quality degradation that happens when conversations grow long and memory crowds out useful context.

Integrating memory with the orchestration graph

Here's how thread memory and workflow state flow through the orchestration pipeline:

# orchestration/graph.py (updated pipeline with memory)

from orchestration.memory import ThreadMemory, WorkflowState
from orchestration.context_budget import allocate_context_budget, trim_thread_to_budget


# Session store — in production, use a database
_sessions: dict[str, ThreadMemory] = {}
_workflows: dict[str, WorkflowState] = {}


def get_or_create_session(session_id: str) -> ThreadMemory:
    """Get an existing session or create a new one."""
    if session_id not in _sessions:
        _sessions[session_id] = ThreadMemory(session_id=session_id)
    return _sessions[session_id]


def run_with_memory(
    question: str,
    session_id: str,
    workflow_id: str | None = None,
) -> dict:
    """Run the orchestration pipeline with memory context."""
    # Get session memory
    thread = get_or_create_session(session_id)

    # Get workflow state if a workflow is active
    workflow = _workflows.get(workflow_id) if workflow_id else None

    # Budget allocation
    budget = allocate_context_budget(
        task_type="follow_up" if len(thread.messages) > 0 else "general",
    )

    # Trim memory to budget
    thread_context = trim_thread_to_budget(thread, budget["thread_memory"])

    # Build state for the graph.
    # thread_messages will be passed into each specialist's message list
    # by call_specialist() — see the specialist-design lesson for that wiring.
    state = {
        "question": question,
        "thread_messages": thread_context,
        "thread_summary": thread.summary,
    }

    if workflow:
        # Workflow progress is included in the question context so the
        # specialist knows what steps have already been completed.
        # In a production system, you'd inject this into the system prompt
        # or as a separate message rather than concatenating with the question.
        workflow_context = workflow.get_progress_summary()
        state["question"] = f"{workflow_context}\n\nCurrent question: {question}"
        state["workflow_state"] = {
            "progress": workflow_context,
            "intermediate_results": workflow.intermediate_results,
        }

    # Run the orchestration graph
    result = graph.invoke(state)

    # Update memory with the new exchange
    thread.add_message("user", question)
    thread.add_message("assistant", result.get("final_answer", ""))

    return result

Measuring memory's impact

Thread memory and workflow state should improve quality on conversational and multi-step benchmark subsets. Here's how to measure:

# Create a conversational benchmark subset
# These are multi-turn question sequences where context matters
python harness/run_harness.py \
    --pipeline orchestrated-with-memory \
    --benchmark benchmark-conversational.jsonl

python harness/graders/answer_grader.py harness/runs/latest.jsonl

# Compare against the no-memory orchestrated system
python harness/compare_runs.py \
    harness/runs/orchestrated-no-memory.jsonl \
    harness/runs/orchestrated-with-memory.jsonl

The conversational benchmark should include question sequences like:

{"id": "conv-001a", "question": "What does the validate_path function do?", "session": "conv-001", "turn": 1, "gold_answer": "..."}
{"id": "conv-001b", "question": "What about its error handling?", "session": "conv-001", "turn": 2, "gold_answer": "..."}
{"id": "conv-001c", "question": "Could that cause issues with symlinks?", "session": "conv-001", "turn": 3, "gold_answer": "..."}

Without thread memory, turn 2 has no idea what "its" refers to. With thread memory, the system connects "its error handling" to validate_path from turn 1. The quality difference on these sequences is where memory proves its value.

Exercises

  1. Implement the ThreadMemory class and integrate it with your orchestration pipeline. Run a 5-turn conversation and verify that follow-up questions resolve correctly (e.g., "What does that function do?" after discussing a file).

  2. Implement the summarization logic. Start a conversation, add 35+ messages, and verify that older messages get summarized while recent ones remain verbatim. Check that the summary preserves key facts.

  3. Create a debug workflow using WorkflowState. Walk through a 4-step debugging session and verify that each step can see the results of the previous steps.

  4. Implement the context budget system. Run the same 10-question benchmark with budget limits of 20%, 30%, and 40% for thread memory. Does more memory budget always help, or is there a point where it hurts retrieval quality?

  5. Create a conversational benchmark subset (at least 5 multi-turn sequences). Compare accuracy with and without thread memory. What's the improvement on follow-up questions?

Completion checkpoint

You have:

  • Thread memory that maintains conversation context across turns within a session
  • Summarization that compresses older messages to manage context window pressure
  • Workflow state that tracks multi-step task progress and intermediate results
  • A context budget system that allocates tokens across memory, retrieval, and generation
  • Measured improvement on a conversational benchmark subset showing that memory helps follow-up questions

Reflection prompts

  • At what conversation length does thread memory start degrading quality instead of helping? How would you detect this automatically?
  • Workflow state makes multi-step tasks explicit. What workflows in your code assistant would benefit most from structured state tracking?
  • The context budget allocations in this lesson are starting points. Based on your benchmark results, what adjustments would improve quality?

What's next

Long-Term Memory and Write Policies. Session memory solves continuity, but some facts need to survive beyond the current run; the next lesson adds persistence without letting memory rot or accumulate garbage.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.