Getting Started Choosing a Provider

Choosing a Provider

Before you write your first model call, you'll need to pick a provider path that gives you actual API access. This is one of the most common stumbling blocks if you're new to AI Engineering. People (yours truly) who have used ChatGPT, Gemini, or Claude in the browser assume they already have what they need to build from code. Usually they don't, and the distinction isn't obvious until you hit it.

This guide lays out the product boundaries clearly. The curriculum supports multiple paths: direct provider APIs (OpenAI, Gemini, Anthropic), hosted inference platforms (Hugging Face, Ollama Cloud, GitHub Models), and local runtimes (Ollama on your machine). You don't need them all. You don't need to marry one forever. You can start with one and add or switch to another later.

Using a cloud platform like AWS Bedrock or Google Vertex AI? Those are valid paths to the same models. The curriculum concepts apply unchanged; only the API surface differs. See Cloud Provider Surfaces for setup, translation tables, and how to map lesson examples to your platform.

What you'll decide

  • Which provider you'll use first for Module 1 exercises
  • Which site gives you API keys and billing access for that provider
  • Whether you want one provider or more than one configured
  • When it makes sense to mix providers by capability

Concepts

The following terms come up throughout this guide and the rest of the learning path. If you're already familiar with these, skip ahead to the provider comparison table.

  • Provider: the company or service that hosts a model API you call from code.
  • Developer platform: the provider's API, billing, API-key, and developer-docs surface. This is what you'll need for this learning path.
  • Consumer chat app: the browser or desktop product meant for human conversation. ChatGPT, Claude.ai, and HuggingChat are examples. Useful for experimentation, but not the same thing as API access.
  • Developer/agent tools: tools like Claude Code, Codex CLI, OpenCode, and GitHub Copilot SDK are a different category. They're developer tools or agent runtimes that use APIs under the hood, not consumer chat apps. You may use them while learning (see Choose Your Track), but they're not the same as the provider platforms where you get API keys and billing.
  • Hosted API: the provider runs the model for you and you call it over HTTP.
  • Routing platform: a developer platform that can route your request to multiple underlying model providers through one account and one token. Hugging Face Inference Providers can work this way.
  • Hosted inference platform: a platform layer that exposes multiple publishers' models behind one account or auth surface. GitHub Models fits here. It's real and useful, but it's a different layer from "the direct provider API for a specific model family."
  • Local inference: you run the model on your own machine. Ollama supports this directly. Some of our lessons later in the path discuss when this matters.

Supported starting paths

Each of these paths is supported throughout the learning path. The table below shows where to go for developer access, what not to confuse it with, and when each one is a good first choice.

Provider / platform pathDeveloper surface to useDo not confuse it withPrimary auth env varBest first if...
OpenAIplatform.openai.com + developers.openai.comchatgpt.comOPENAI_API_KEYYou want a widely documented direct API surface with a large surrounding ecosystem of examples and SDK integrations
Gemini APIaistudio.google.com + ai.google.devgemini.google.comGEMINI_API_KEYYou want Google's direct API surface, native structured outputs, and a clean path to Vertex AI later
Anthropicplatform.claude.com + platform.claude.com/docsclaude.aiANTHROPIC_API_KEYYou want Claude as your main model path and strong prompt/tool-use docs
Hugging Facehttps://huggingface.co/settings/tokens + https://huggingface.co/docs/inference-providers/indexhuggingface.co/chatHF_TOKENYou want one developer account that can access many open models/providers and later expand into datasets, Spaces, and dedicated endpoints
GitHub Modelshttps://github.com/marketplace/models + https://docs.github.com/en/rest/models/inferenceGitHub Copilot SDKGITHUB_TOKENYou want a hosted inference path behind your GitHub account and like using publisher/model IDs through one API surface
Ollama (Local)ollama.com/download + local API at http://localhost:11434Ollama Cloud at https://ollama.com/apinoneYou want privacy, offline development, predictable cost, or the strongest beginner path for local structured-output experiments
Ollama Cloudhttps://ollama.com/api + docs.ollama.comlocal Ollama at http://localhost:11434OLLAMA_API_KEYYou want a hosted Ollama path for chat/generation today and the flexibility to learn local Ollama later

Curious where GitHub fits? GitHub Copilot SDK is an agent-platform/runtime layer. GitHub Models is a hosted inference and routing layer. Both are valid ecosystem surfaces, but this curriculum teaches direct provider APIs first so you learn the underlying contracts before adding a higher-level platform layer. See Excluded Topics and Clarified Terms for the canonical taxonomy.

The product-platform distinctives

Most providers have both a consumer product and a separate developer platform, and the line between them isn't always obvious. Here's what to watch for with each provider.

OpenAI: ChatGPT is not the API Platform

If you want to build from code, the places that matter are platform.openai.com for keys and billing and developers.openai.com for the current API docs, not chatgpt.com.

  • chatgpt.com is the consumer product for using ChatGPT directly.
  • platform.openai.com is the developer platform for API keys and API billing.
  • developers.openai.com is where OpenAI's current API docs and integration guides live.
  • OpenAI's own help center explicitly says ChatGPT and the API platform are two separate platforms, and billing does not automatically carry over between them.

Practical rule:

  • If a lesson tells you to set OPENAI_API_KEY, you need the OpenAI API platform.
  • A ChatGPT Plus/Pro/Business subscription does not automatically give you API access.

Anthropic: Claude app is not the developer platform

If you want to build from code, the place that matters is Anthropic's developer platform at platform.claude.com and the canonical API docs at platform.claude.com/docs, not claude.ai.

  • claude.ai is the consumer chat product.
  • platform.claude.com and platform.claude.com/docs are the developer surfaces for API keys, billing, Workbench, and programmatic access.
  • Anthropic's own help center says paid Claude plans and the Claude Console are separate products, and a paid Claude subscription does not include API or Console access.

Practical rule:

  • If a lesson tells you to set ANTHROPIC_API_KEY, you need Anthropic's developer platform access.
  • If you are setting Anthropic up now, go to https://platform.claude.com/ first, then use https://platform.claude.com/docs/ for the API walkthroughs.
  • A Claude Pro/Max/Team/Enterprise subscription does not automatically give you API access.

Gemini: the Gemini app is not the Gemini API

If you want to build from code, the places that matter are aistudio.google.com for API-key management and ai.google.dev for the Gemini API docs, not gemini.google.com.

  • gemini.google.com is the consumer Gemini app
  • aistudio.google.com is the developer surface for Gemini API keys and quickstart workflows
  • ai.google.dev is where Google's Gemini API docs and SDK guides live

Practical rule:

  • If a lesson tells you to set GEMINI_API_KEY, go to Google AI Studio and Gemini API docs, not the consumer Gemini app.
  • If you later decide to use Gemini through Vertex AI instead of the direct Gemini API, the model concepts stay the same and only the platform surface changes. See Cloud Provider Surfaces.

Ollama: local Ollama is not Ollama Cloud

Ollama supports both local and cloud usage, and the names are similar enough to confuse you if you don't know what to look for.

  • local Ollama runs on your machine and serves an API by default at http://localhost:11434/api
  • Ollama Cloud exposes a hosted API at https://ollama.com/api
  • The official docs present the same API shape locally and in the cloud, but this curriculum only relies on direct Ollama Cloud for chat/generation. Embedding-heavy lessons use local Ollama or an explicit hybrid path.

Practical rule:

  • If you are using local Ollama, no API key is required for http://localhost:11434
  • If you are using Ollama Cloud directly, you need an API key and OLLAMA_API_KEY
  • In this curriculum, when a lesson says Ollama Cloud, it means the hosted API on ollama.com, not the local runtime
  • In embedding and vector-retrieval lessons, Ollama (Hybrid) means local Ollama for embeddings plus Ollama Cloud for generation

Hugging Face: HuggingChat is not Inference Providers

Hugging Face is broader than a single model API. For this learning path, the hosted-inference surfaces that matter are your Hugging Face account, your user access token, Inference Providers, and sometimes Inference Endpoints. That is different from huggingface.co/chat.

  • huggingface.co/chat is HuggingChat, the consumer chat app
  • https://huggingface.co/settings/tokens is where you create the HF_TOKEN you'll use from code. Important: create a fine-grained token and enable the Make calls to Inference Providers permission. A default token without this permission won't work for the lessons that call hosted models.
  • https://huggingface.co/docs/inference-providers/index is the easiest hosted API path for this curriculum
  • Inference Endpoints is the dedicated-infrastructure option you may care about later

Practical rules:

  • If a lesson tells you to set HF_TOKEN, go to https://huggingface.co/settings/tokens and create a fine-grained token with Make calls to Inference Providers enabled
  • If you want the simplest hosted path, start at https://huggingface.co/docs/inference-providers/index
  • If you already have another provider account, Hugging Face can also route requests using a custom provider key while keeping the same client surface
  • In this curriculum, "Hugging Face" usually means Hub account + token + Inference Providers unless the lesson explicitly says Endpoints, datasets, or Spaces

GitHub Models: hosted inference is not GitHub Copilot SDK

GitHub Models is a hosted inference surface behind GitHub authentication. In this curriculum, it matters as a way to call models through GitHub's API layer. That is different from GitHub Copilot SDK, which is an agent runtime layer.

  • GitHub Models uses your GitHub auth/token and publisher/model IDs such as openai/gpt-4.1
  • GitHub Copilot SDK is a higher-level runtime with sessions, tools, and control flow
  • In this curriculum, when a lesson says GitHub Models, it means the hosted inference API surface at models.github.ai/inference

Practical rules:

  • If a lesson tells you to set GITHUB_TOKEN, you need a GitHub token that can call GitHub Models
  • If you are using GitHub Models, keep the distinction clear: hosted inference surface here, Copilot SDK elsewhere
  • Do not assume every publisher in the broader AI ecosystem is available through your GitHub Models catalog. On the token I validated on April 1, 2026, the visible publishers were OpenAI, AI21 Labs, Cohere, DeepSeek, Meta, Microsoft, Mistral AI, and xAI. Anthropic and Google were not visible on that token.
  • Because of that catalog reality, the GitHub Models lesson variants in this curriculum are currently validated against OpenAI-published models on GitHub Models, not as a generic "Claude or Gemini through GitHub" path

Pricing and budget notes

Pricing changes frequently, so be sure to check the official provider pricing pages before committing. Here's what was current as of this writing:

  • OpenAI API pricing lives on the OpenAI API pricing page
  • Gemini Developer API pricing lives on the Gemini Developer API pricing page
  • Anthropic API pricing lives on Anthropic's pricing docs / pricing page
  • As of March 24, 2026, Hugging Face's official pricing pages say:
    • PRO: $9 / mo
    • Team: $20 / user / mo
    • Enterprise: starting at $50 / user / mo
    • Inference Providers pricing docs also say free users get $0.10 monthly credits and PRO users get $2.00 monthly credits for routed requests, with pay-as-you-go after that
  • As of March 24, 2026, Ollama's official pricing page lists:
    • Free
    • Pro: $20 / mo or $200 / yr billed annually
    • Max: $100 / mo

If budget is your main concern, Ollama Cloud and Hugging Face are both solid hosted options worth considering. If you want a widely documented direct API surface, OpenAI is still a reasonable place to start. If Claude is the ecosystem you already use daily, Anthropic is a great first-class path. If you like the Ollama ecosystem but do not have local hardware yet, Ollama Cloud is a reasonable place to start.

You can have more than one provider

You don't have to pick one provider for the entire curriculum.

Normal mixed-provider setups include:

  • OpenAI for early hands-on examples, Anthropic for later comparison
  • Gemini for direct API work, Vertex AI later when cloud-platform controls matter
  • Anthropic for generation, another provider for embeddings
  • Hugging Face as a single token for many open models, with direct-provider keys added later only if needed
  • GitHub Models for hosted inference behind GitHub auth, with direct-provider keys added later if you want the native publisher surface
  • Ollama Cloud for hosted chat calls now, local Ollama later

Provider choice is often per capability, not just per company.

One important example: Anthropic's own embeddings guide says Anthropic does not offer its own embedding model and points learners to separate embeddings providers. So "I use Anthropic" doesn't have to mean "I only use Anthropic for every component."

How to choose your first provider

Here's how I'd think about it:

  • Choose OpenAI Platform first if you want a widely documented direct API surface with broad SDK and community example coverage.
  • Choose Gemini API first if you want Google's direct API surface and you like the native structured-output, function-calling, and embeddings path exposed through google-genai.
  • Choose Anthropic's developer platform first if you want Claude as your main runtime and you are comfortable translating a few SDK/API surface differences.
  • Choose Hugging Face first if you want one developer account that can expose many open models/providers, and you like the idea of one token working across chat, embeddings, and later dedicated endpoints.
  • Choose GitHub Models first if you already live in GitHub, want a hosted inference path behind one GitHub auth surface, and are comfortable using publisher/model IDs like openai/gpt-4.1.
  • Choose Ollama (Local) first if you want to stay local from day one, keep your code on your machine, and learn the same chat/embedding patterns without a hosted API bill.
  • Choose Ollama Cloud first if you want a hosted API now and the option to move toward local Ollama later. In my smoke tests, it was a good fit for chat-first and summarization-heavy examples. For strict schema-constrained extraction, local Ollama or another provider was more reliable.
  • Keep a second provider configured if you can afford it. Cross-provider comparison builds better engineering judgment than single-provider habits.

Module 1 provider details

These are the first places where provider choice matters.

LLM Mental Models

Pick one of these seven quickstart variants for your first model call.

Use platform.openai.com, not chatgpt.com.

mkdir llm-experiments && cd llm-experiments
python -m venv .venv && source .venv/bin/activate
pip install openai
export OPENAI_API_KEY="sk-..."
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is a token in the context of LLMs?"},
    ],
    temperature=0,
)

print(response.choices[0].message.content)
print(response.usage.total_tokens)

Prompt Engineering Fundamentals

Use the same provider you chose for llm-mental-models.

  • OpenAI variant: keep using the openai client in prompt_lab.py
  • Gemini variant: use google-genai with GenerateContentConfig(system_instruction=...) and keep the same prompt-contract experiments
  • Anthropic variant: swap the client and call shape, but keep the same prompt-contract experiments
  • Hugging Face variant: keep the openai client but set base_url="https://router.huggingface.co/v1" and use HF_TOKEN
  • GitHub Models variant: keep the openai client, point it at https://models.github.ai/inference, add the GitHub headers, and use GITHUB_TOKEN
  • Ollama (Local) variant: use the Ollama client against http://localhost:11434 and keep the same prompt variants
  • Ollama Cloud variant: use the Ollama client or API with the same prompt variants; the lesson concept does not depend on one provider

The important thing here isn't so much which SDK you use, but that it's running the same prompt variants through the same provider so you can compare the outputs cleanly.

Build with APIs, Not Chat Apps

This lesson is where provider surface differences become most visible.

  • OpenAI: use the OpenAI SDK surface for messages, structured outputs, embeddings, and tool calls with a large surrounding ecosystem of examples
  • Gemini: use the native Gemini API surface with google-genai and native structured outputs
  • Anthropic: use the cross-provider comparison exercise already in the lesson to see the equivalent message/tool surface
  • Hugging Face: use the OpenAI-compatible router or the Hugging Face client surface; model IDs and routed providers become part of the learning
  • GitHub Models: use the GitHub-hosted inference surface with GitHub auth, publisher/model IDs, and the same chat/structured-output/tool-call concepts
  • Ollama (Local): map the same concepts to the local Ollama API and keep model pulls explicit as part of setup
  • Ollama Cloud: map the same concepts to Ollama's chat API, structured output support, and tool-calling support

When reading this lesson, separate:

  • portable concept: messages, structured outputs, tool calls, conversation state
  • provider surface: SDK names, parameter names, response shapes, auth headers

It's worth keeping in mind: "the lesson is written with one SDK" doesn't mean "the concept only exists on one provider." There's a lot of overlap in provider implementations, and they're evolving at an astonishing pace. Master the concepts and you'll be able to adopt any provider's SDK.

Retrieval Fundamentals

Retrieval is where mixed-provider thinking becomes normal.

  • OpenAI path: the current starter lab uses OpenAI embeddings directly
  • Gemini path: Gemini exposes a direct embeddings surface, so generation and embeddings can live on the same provider if you want them to
  • Anthropic path: Anthropic's own embeddings guide points learners to a separate embeddings provider, so mixed-provider retrieval is expected
  • Hugging Face path: Hugging Face supports feature extraction / embeddings through Inference Providers, so it can be your retrieval path even when your generation model lives elsewhere
  • Ollama path: Ollama supports embeddings and exposes an /api/embed endpoint for them

The right beginner mental model is:

  • generation provider and embedding provider can be the same
  • generation provider and embedding provider can also be different
  • that's not a hack; it's normal AI engineering

What's next

Choose Your Track. Once you know which provider account will get you through Module 1, decide whether you're starting cloud-first, local-first, or balanced.

If you want the curriculum's operating principles before you begin building, read Hard Rules.

References

Start here

Build with this

Deep dive

Your Notes
GitHub Sync

Sync your lesson notes to a private GitHub Gist. If you have not entered a token yet, the sync button will open the GitHub token modal.

Glossary
API (Application Programming Interface)Foundational terms
A structured way for programs to communicate. In this context, usually an HTTP endpoint you call to interact with an LLM.
AST (Abstract Syntax Tree)Foundational terms
A tree representation of source code structure. Used by parsers like Tree-sitter to understand code as a hierarchy of functions, classes, and statements. You'll encounter this more deeply in the Code Retrieval module, but the concept appears briefly in retrieval fundamentals.
BM25 (Best Match 25)Foundational terms
A classical ranking function for keyword search. Scores documents by term frequency and inverse document frequency. Often competitive with or complementary to vector search.
ChunkingFoundational terms
Splitting a document into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality. Split at the wrong place and your retrieval will return half a function or the end of one paragraph glued to the start of another.
Context engineeringFoundational terms
The discipline of selecting, packaging, and budgeting the information a model sees at inference time. Prompts, retrieved evidence, tool results, memory, and state are all parts of context. Context engineering is arguably the core skill of AI engineering. Bigger context windows are not a substitute for better context selection.
Context rotFoundational terms
Degradation of output quality caused by stale, noisy, or accumulated context. Symptoms include stale memory facts, conflicting retrieved evidence, bloated prompt history, and accumulated instructions that contradict each other. A form of technical debt in AI systems.
Context windowFoundational terms
The maximum number of tokens an LLM can process in a single request (input + output combined).
EmbeddingFoundational terms
A fixed-length numeric vector representing a piece of text. Used for similarity search: texts with similar meanings have nearby embeddings.
EndpointFoundational terms
A specific URL path that accepts requests and returns responses (e.g., POST /v1/chat/completions).
GGUFFoundational terms
A file format for quantized models used by llama.cpp and Ollama. When you see a model name like qwen2.5:7b-q4_K_M, the suffix indicates the quantization scheme. GGUF supports mixed quantization (different precision for different layers) and is the most common format for local inference.
HallucinationFoundational terms
When a model generates content that sounds confident but isn't supported by the evidence it was given, or fabricates details that don't exist. Not the same as "any wrong answer"; a model that misinterprets ambiguous instructions gave a bad answer but didn't hallucinate. Common causes: weak prompt, missing context, context rot, model limitation, or retrieval failure.
InferenceFoundational terms
Running a trained model to generate output from input. What happens when you call an API. Most AI engineering work is inference-time work: building systems around models, not training them. Use "inference," not "inferencing."
JSON (JavaScript Object Notation)Foundational terms
A lightweight text format for structured data. The lingua franca of API communication.
Lexical searchFoundational terms
Finding items by matching keywords or terms. Includes BM25, TF-IDF (Term Frequency–Inverse Document Frequency), and simple keyword matching. Returns exact term matches, not semantic similarity.
LLM (Large Language Model)Foundational terms
A neural network trained on large text corpora that generates text by predicting the next token. The core technology behind AI engineering; every tool, pattern, and pipeline in this curriculum runs on top of one.
MetadataFoundational terms
Structured information about a document or chunk (file path, language, author, date, symbol type). Used for filtering retrieval results.
Neural networkFoundational terms
A computing system loosely inspired by biological neurons, built from layers of mathematical functions that transform inputs into outputs. LLMs are a specific type of neural network (transformers) trained on text. You don't need to understand neural network internals to do AI engineering, but knowing the term helps when reading external resources.
Reasoning modelFoundational terms
A model optimized for complex multi-step planning, math, and logic (e.g., o3, o4-mini). Slower and more expensive but better on hard problems. Sometimes called "LRM" (large reasoning model), but "reasoning model" is the more consistent term across provider docs.
RerankingFoundational terms
A second-pass scoring step that re-orders retrieved results using a more expensive model. Improves precision after an initial broad retrieval.
SchemaFoundational terms
A formal description of the shape and types of a data structure. Used to validate inputs and outputs.
SLM (small language model)Foundational terms
A compact model (typically 1-7B parameters) that runs on consumer hardware with lower cost, latency, and better privacy (e.g., Phi, small Llama variants, Gemma). The right choice when privacy, offline operation, predictable cost, or low latency matter more than peak capability.
System promptFoundational terms
A special message that sets the model's behavior, role, and constraints for a conversation.
TemperatureFoundational terms
A parameter controlling output randomness. Lower values produce more deterministic output; higher values produce more varied output. Does not affect the model's intelligence.
TokenFoundational terms
The basic unit an LLM processes. Not a word. Tokens are sub-word fragments. "unhappiness" might be three tokens: "un", "happi", "ness". Token count determines cost and context window usage.
Top-kFoundational terms
The number of results returned from a retrieval query. "Top-5" means the five highest-scoring results.
Top-p (nucleus sampling)Foundational terms
An alternative to temperature for controlling output diversity. Selects from the smallest set of tokens whose cumulative probability exceeds p.
Vector searchFoundational terms
Finding items by proximity in embedding space (nearest neighbors). Returns "similar" results, not "exact match" results.
vLLM (virtual LLM)Foundational terms
An inference serving engine (not a model) that hosts open-weight models behind an OpenAI-compatible HTTP endpoint. Infrastructure layer, not model layer. Relevant when moving from hosted APIs to self-hosting.
WeightsFoundational terms
The learned parameters inside a model. Changed during training, fixed during inference.
Workhorse modelFoundational terms
A general-purpose LLM optimized for speed and broad capability (e.g., GPT-4o-mini, Claude Haiku, Gemini Flash). The default for most tasks. When someone says "LLM" without qualification, they usually mean this.
BaselineBenchmark and Harness terms
The first measured performance of your system on a benchmark. Everything else is compared against this. Without a baseline, you can't tell whether a change helped.
BenchmarkBenchmark and Harness terms
A fixed set of questions or tasks with known-good answers, used to measure system performance over time.
Run logBenchmark and Harness terms
A structured record (typically JSONL) of every system run: what input was given, what output was produced, what tools were called, how long it took, and what it cost. The raw data that evals, telemetry, and cost analysis are built from.
A2A (Agent-to-Agent protocol)Agent and Tool Building terms
An open protocol for peer-to-peer agent collaboration. Agents discover each other's capabilities and delegate or negotiate tasks as equals. Different from MCP (which connects agents to tools, not to other agents) and from handoffs (which transfer control within one system).
AgentAgent and Tool Building terms
A system where an LLM decides which tools to call, observes results, and iterates until a task is complete. Agent = model + tools + control loop.
Control loopAgent and Tool Building terms
The code that manages the agent's cycle: send prompt, check for tool calls, execute tools, append results, repeat or finish.
HandoffAgent and Tool Building terms
Passing control from one agent or specialist to another within an orchestrated system.
MCP (Model Context Protocol)Agent and Tool Building terms
An open protocol for exposing tools, resources, and prompts to AI applications in a standardized way. Connects agents to capabilities (tools and data), not to other agents.
Tool calling / function callingAgent and Tool Building terms
The model's ability to request execution of a specific function with structured arguments, rather than just generating text.
Context compilation / context packingCode Retrieval terms
The process of selecting and assembling the smallest useful set of evidence for a specific task. Not "dump everything retrieved into the prompt."
GroundingCode Retrieval terms
Tying model assertions to specific evidence. A grounded answer cites what it found; an ungrounded answer asserts without evidence.
Hybrid retrievalCode Retrieval terms
Combining multiple retrieval methods (e.g., vector search + keyword search + metadata filters) and merging or reranking the results.
Knowledge graphCode Retrieval terms
A data structure that stores entities and their relationships explicitly (e.g., "function A calls function B," "module X imports module Y"). Useful for traversal and dependency reasoning. One retrieval strategy among several, often overused when simpler metadata or adjacency tables would suffice.
RAG (Retrieval-Augmented Generation)Code Retrieval terms
A pattern where the model's response is grounded in retrieved external evidence rather than relying solely on its training data.
Symbol tableCode Retrieval terms
A mapping of code identifiers (functions, classes, variables) to their locations and metadata.
Tree-sitterCode Retrieval terms
An incremental parsing library that builds ASTs for source code. Used in this curriculum for code-aware chunking and symbol extraction.
Context packRAG and Grounded Answers terms
A structured bundle of evidence assembled for a specific task, with metadata about provenance, relevance, and token budget.
Evidence bundleRAG and Grounded Answers terms
A collection of retrieved items grouped for a specific sub-task, with enough metadata to evaluate whether the evidence is relevant and sufficient.
Retrieval routingRAG and Grounded Answers terms
Deciding which retrieval strategy or method to use for a given query. Different questions need different retrieval methods.
EvalObservability and Evals terms
A structured test that measures system quality. Not the same as training. Evals measure, they don't change the model.
Harness (AI harness / eval harness)Observability and Evals terms
The experiment and evaluation framework around your model or agent. It runs benchmark tasks, captures outputs, logs traces, grades results, and compares system versions. It turns ad hoc "try it and see" into repeatable, comparable experiments. Typically includes: input dataset, prompt and tool configuration, model/provider selection, execution loop, logging, grading, and artifact capture.
LLM-as-judgeObservability and Evals terms
Using a language model to evaluate or grade the output of another model or system. Useful for scaling evaluation beyond manual review, but requires rubric quality, judge consistency checks, and human spot-checking. Not a replacement for exact-match checks where they apply.
OpenTelemetry (OTel)Observability and Evals terms
An open standard for collecting and exporting telemetry data (traces, metrics, logs). Vendor-agnostic.
RAGASObservability and Evals terms
A specific eval framework for retrieval-augmented generation. Measures metrics like faithfulness, relevance, and context precision. One tool example, not a foundational concept. Learn the metrics first, then the tool.
SpanObservability and Evals terms
A single operation within a trace (e.g., one tool call, one retrieval query). Traces are made of spans.
TelemetryObservability and Evals terms
Structured data about system behavior: what happened, when, how long it took, what it cost. Includes traces, metrics, and events.
TraceObservability and Evals terms
A structured record of one complete run through the system, including all steps, tool calls, and decisions.
Long-term memoryOrchestration and Memory terms
Persistent facts that survive across conversations. Requires write policies to manage what gets stored, updated, or deleted.
OrchestrationOrchestration and Memory terms
Explicit control over how tasks are routed, delegated, and synthesized across multiple agents or specialists.
RouterOrchestration and Memory terms
A component that decides which specialist or workflow path to use for a given query.
SpecialistOrchestration and Memory terms
An agent or workflow tuned for a narrow task (e.g., "code search," "documentation lookup," "test generation"). Specialists are composed by an orchestrator.
Thread memoryOrchestration and Memory terms
Conversation state that persists within a single session or thread.
Workflow memoryOrchestration and Memory terms
Intermediate state that persists within a multi-step task but doesn't survive beyond the workflow's completion.
Catastrophic forgettingOptimization terms
When fine-tuning causes a model to lose capabilities it had before training. The model gets better at the fine-tuned task but worse at tasks it previously handled. PEFT methods like LoRA reduce this risk by freezing original weights.
DistillationOptimization terms
Training a smaller (student) model to reproduce the behavior of a larger (teacher) model on a specific task.
DPO (Direct Preference Optimization)Optimization terms
A method for preference-based model optimization that's simpler than RLHF, training the model directly on preference pairs without a separate reward model.
Fine-tuningOptimization terms
Updating a model's weights on task-specific data to change its behavior permanently. An umbrella term that includes SFT, instruction tuning, RLHF, DPO, and other techniques. See the fine-tuning landscape table in Lesson 8.3 for how these relate.
Full fine-tuningOptimization terms
Updating all of a model's parameters during training, as opposed to PEFT methods that update only a small subset. Requires significantly more GPU memory and compute. Produces the most thorough adaptation but carries higher risk of catastrophic forgetting.
Inference serverOptimization terms
Software (like vLLM or Ollama) that hosts a model and serves inference requests.
Instruction tuningOptimization terms
A specific application of SFT where the training data consists of instruction-response pairs. This is how base models become chat models: the technique is SFT, the data format is instructions. Not a separate technique from SFT.
LoRA (Low-Rank Adaptation)Optimization terms
A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights. Dramatically reduces GPU memory and compute requirements.
Parameter countOptimization terms
The number of learned weights in a model, commonly expressed in billions (e.g., "7B" = 7 billion parameters). Determines memory requirements (roughly 2 bytes per parameter at FP16) and broadly correlates with capability, though training quality and architecture matter as much as size. See Model Selection and Serving for sizing guidance.
PEFT (Parameter-Efficient Fine-Tuning)Optimization terms
A family of methods (including LoRA) that fine-tune a small subset of parameters instead of the full model.
Preference optimizationOptimization terms
Training methods (RLHF, DPO) that use human or automated preference signals to improve model behavior. "This output is better than that output" rather than "this is the correct output."
QLoRA (Quantized LoRA)Optimization terms
LoRA applied to a quantized (compressed) base model. Further reduces memory requirements, enabling fine-tuning on consumer hardware.
QuantizationOptimization terms
Reducing the precision of model weights (e.g., FP16 → INT4) to shrink memory usage and increase inference speed at some quality cost. A 7B model at FP16 needs ~14 GB VRAM; quantized to 4-bit, it fits in ~4 GB. Common formats include GGUF (llama.cpp/Ollama), GPTQ and AWQ (vLLM/HuggingFace). See Model Selection and Serving for format details and tradeoffs.
OverfittingOptimization terms
When a model memorizes training examples instead of learning generalizable patterns. The model performs well on training data but poorly on new inputs. Detected by monitoring validation loss alongside training loss.
RLHF (Reinforcement Learning from Human Feedback)Optimization terms
A training method that uses human preference signals to improve model behavior through a reward model. More complex than DPO (requires training a separate reward model) but offers more control over the optimization objective.
SFT (Supervised Fine-Tuning)Optimization terms
Fine-tuning using input-output pairs where the desired output is known. The most common fine-tuning approach.
TRL (Transformer Reinforcement Learning)Optimization terms
A Hugging Face library for training language models with reinforcement learning, SFT, and other optimization methods.
Consumer chat appCross-cutting terms
The browser or desktop product meant for human conversation (ChatGPT, Claude, HuggingChat). Useful for experimentation, but not the same as API access.
Developer platformCross-cutting terms
The provider's API, billing, API-key, and developer-docs surface. This is what you need for this learning path.
Hosted APICross-cutting terms
The provider runs the model for you and you call it over HTTP.
Local inferenceCross-cutting terms
You run the model on your own machine.
ProviderCross-cutting terms
The company or service that hosts a model API you call from code.
Prompt cachingCross-cutting terms
Reusing computation from repeated prompt prefixes to reduce latency and cost on subsequent requests with the same prefix.
Rate limitingCross-cutting terms
Constraints on how many API requests you can make per unit of time. An operational concern that affects system design and cost.
Token budgetCross-cutting terms
The maximum number of tokens you allocate for a specific part of the context (e.g., "retrieval evidence gets at most 4K tokens"). A context engineering tool for preventing any single component from dominating the context window.