Model Selection and Serving

Decision framework for choosing the right model for a task and the right way to serve it. Consult this before defaulting to the largest model your budget allows. The right model is often smaller, faster, and cheaper than you might expect.

If you'd like to work with the model/provider pairs I tested with this guide, review the Model-Provider Matrix companion to this page.

Ecosystem layers

These categories get mixed together constantly in vendor docs and community posts. Keeping the layers separate will save you a lot of confusion.

Layer	What it is	Examples
Direct provider API	The model publisher's own API surface	OpenAI Platform, Gemini API, Anthropic's developer platform
Hosted inference / routing platform	A platform that exposes one or more publishers' models behind its own auth surface	Hugging Face Inference Providers, Ollama Cloud, GitHub Models
Local inference runtime	You run the model on your own hardware	Ollama (local), vLLM, llama.cpp
Agent platform / runtime	A higher-level runtime that manages sessions, tools, and control loops	GitHub Copilot SDK
Cloud provider surface	A cloud platform's API for the same underlying models	AWS Bedrock, Vertex AI

A common confusion: Gemini API vs. Vertex AI

Google exposes Gemini through two different product surfaces, and it is easy to blur them together if you are new to the ecosystem.

Gemini API: the simpler developer-facing API used with Google AI Studio and API keys. This is the surface the curriculum means when it says "Gemini" in most runnable lesson examples.
Vertex AI: Google Cloud's platform surface for Gemini. This is the place where project/location setup, IAM, service accounts, and Google Cloud governance enter the picture.

For the curriculum, the practical distinction is:

If you want the fastest start for hosted inference examples, the Gemini API is usually the simpler path.
If you need hosted Gemini tuning, Google currently supports that on Vertex AI, not on the Gemini Developer API.

If you want to dive deeper into the official product docs behind that distinction, start here:

Model families

Three model families serve different purposes. These are introduced in LLM Mental Models and used throughout the curriculum.

Family	Examples	Strengths	Weaknesses	Cost
Workhorse LLM	GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Flash	General-purpose generation, tool calling, instruction following, code	Expensive at scale, provider-dependent	$2-15 per 1M input tokens
Reasoning model	o3, Claude Opus, Gemini 2.5 Pro	Complex multi-step reasoning, math, analysis	Slower, more expensive, overkill for simple tasks	$10-30+ per 1M input tokens
SLM (Small Language Model)	Qwen 2.5 7B, Llama 3.1 8B, Phi-3.5, Gemma 2	Fast, cheap, local, privacy-preserving	Limited capability on complex tasks	Free locally; hosting cost only

Parameters and model sizing

A model's parameter count is the number of learned weights in the network. It's the single most commonly cited measure of model size, and it directly determines how much memory the model needs.

Parameter count	Common label	Typical VRAM (full precision)	Capability range
1-3B	SLM	2-6 GB	Simple extraction, classification, formatting
7-8B	SLM	14-16 GB	General instruction following, code assistance, summarization
13-14B	Mid-range	26-28 GB	Stronger reasoning, multi-step tasks
30-34B	Large	60-68 GB	Near-frontier for open models
70B+	Very large	140+ GB	Frontier open-weight capability

What parameter count tells you: More parameters generally means more capability, and the model can represent more complex patterns. But it also means more memory, more compute, more cost, and more latency.

What parameter count doesn't tell you: Training data quality, architecture efficiency, and post-training alignment matter as much as size. A well-trained 7B model often outperforms a poorly trained 13B model on specific tasks. Don't choose models by parameter count alone. Choose by benchmark performance on your task type.

The practical question: "Can I run this model on my hardware?" See the Hardware and Model Size Guide for VRAM requirements and GPU recommendations.

Quantization

Full-precision models store each parameter as a 16-bit (FP16) or 32-bit (FP32) floating-point number. Quantization reduces the precision of those numbers (typically to 8-bit, 4-bit, or even 2-bit integers) to shrink memory usage and increase inference speed.

Why quantization matters

A 7B parameter model at full precision (FP16) needs ~14 GB of VRAM. Quantized to 4-bit, that same model fits in ~4 GB. This is the difference between needing a $1,000 GPU and running on a laptop.

Precision	Bits per parameter	7B model VRAM	Quality impact
FP16 (full)	16	~14 GB	Baseline (no quality loss)
INT8 (8-bit)	8	~7 GB	Minimal quality loss for most tasks
INT4 (4-bit)	4	~4 GB	Noticeable on complex reasoning, fine for extraction and classification
INT2 (2-bit)	2	~2 GB	Significant degradation; use only for simple tasks

Common quantization formats

Format	Used by	Notes
GGUF	llama.cpp, Ollama	The most common format for local inference. Supports mixed quantization (different precision for different layers). When you see a model like `qwen2.5:7b-q4_K_M`, the `q4_K_M` suffix indicates 4-bit quantization with a specific scheme.
GPTQ	vLLM, HuggingFace	GPU-optimized quantization. Slightly better quality than GGUF at the same bit width but requires GPU.
AWQ	vLLM, HuggingFace	Activation-aware quantization. Preserves quality better than naive quantization by identifying and protecting important weights.
BitsAndBytes (bnb)	HuggingFace Transformers	Used for QLoRA training (4-bit inference + LoRA adapters). You'll encounter this in Module 8: Distillation.

Practical guidance

For curriculum exercises with Ollama: Ollama automatically downloads quantized models. When you run ollama run qwen2.5:7b, you get a quantized version that fits in reasonable VRAM. You don't need to choose a quantization format manually.
For production serving: Start with the highest precision your hardware supports. Only quantize further if you need to reduce memory or improve throughput, and measure the quality impact on your specific task with your eval harness.
For fine-tuning: QLoRA (Module 8) uses 4-bit quantization during training to reduce VRAM requirements. The trained LoRA adapter is applied on top of the quantized base model.

The tradeoff in one sentence

Quantization trades precision for speed and memory. The question isn't "should I quantize?" but "how much precision can I give up before my evals show a problem?"

Choosing by task type

Task	Recommended family	Why	When to escalate
Retrieval routing / classification	SLM (after distillation) or workhorse	Bounded task, well-defined outputs	If SLM accuracy is insufficient
Code generation with evidence	Workhorse LLM	Needs instruction following + code quality	Complex multi-file reasoning → reasoning model
Structured answer generation	Workhorse LLM	Schema following + grounding	Consider SLM if the task is stable and distilled
Embedding generation	Embedding model	Different architecture optimized for similarity	N/A (don't use generative models for embeddings)
LLM-as-judge grading	Workhorse or reasoning	Needs nuanced quality assessment	High-stakes evals → reasoning model for better judgment
Simple extraction / formatting	SLM or small workhorse	Low complexity, speed matters	If extraction quality is low
Multi-step planning	Reasoning model	Needs deliberate step-by-step reasoning	Rarely worth downgrading

Hosted vs. self-hosted

Factor	Hosted API	Self-hosted
Setup effort	Minutes (get API key)	Hours to days (hardware, software, configuration)
Per-request cost	Per-token pricing	Zero marginal cost (fixed hardware)
Scalability	Provider handles scaling	You handle scaling
Privacy	Data leaves your infrastructure	Data stays local
Model updates	Provider updates automatically	You update when you choose
Availability	Provider SLA, potential outages	Your uptime responsibility
Model access	Proprietary models (GPT-4o, Claude)	Open-weight models only

Decision rule: Start with hosted APIs. Move to self-hosted when one of these is true:

Privacy requirements prevent sending data to providers
Per-token costs exceed the hardware cost of running locally
You need a fine-tuned or distilled model not available as a hosted endpoint
Latency requirements demand local inference

Self-hosted serving options

Ollama, LM Studio, and llama.cpp are related tools, not competing alternatives. See Field Terms We Don't Teach for the architectural relationship and an architecture diagram.

Ollama

Best for: Development, prototyping, running models during curriculum exercises.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run qwen2.5:7b

# Serve via REST API
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Ollama downloads, quantizes, and serves models with minimal configuration. It's the right choice for individual development and curriculum work.

vLLM

Best for: Production serving with high throughput and concurrent users.

pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000

vLLM provides an OpenAI-compatible API endpoint, so switching from a hosted API to a local model requires changing only the base URL and model name. Its PagedAttention mechanism enables efficient memory management for concurrent requests.

When to use vLLM over Ollama: When serving models to multiple users, when you need precise control over batching and throughput, or when running in a production environment.

llama.cpp

Best for: Maximum control over quantization, running on CPU, or resource-constrained environments.

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build

# Run a GGUF model
./build/bin/llama-cli -m model.gguf -p "Hello"

llama.cpp supports CPU inference (slower but no GPU required) and fine-grained quantization options. Use it when you need to run models without a GPU or when you need specific quantization configurations.

Model selection by curriculum module

Module	Primary model need	Recommended approach
1: Foundation Sprint	API access for exercises	Hosted workhorse (OpenAI, Gemini, Anthropic, Ollama Cloud, HF Inference)
2: Benchmark and Harness	Consistent model for baselines	Same hosted workhorse; consistency matters more than model choice
3: Agent and Tool Building	Tool-calling capable model	Hosted workhorse (tool calling support varies by provider)
4: Code Retrieval	Embedding model + generation model	Hosted embedding API + hosted workhorse
5: RAG and Grounding	Generation with grounding	Hosted workhorse
6: Observability and Evals	LLM-as-judge + generation	Hosted workhorse (consider reasoning model for judge)
7: Orchestration and Memory	Multiple model calls per request	Hosted workhorse; consider model routing for cost
8: Optimization	Training + local inference	Local SLM with Ollama for inference, QLoRA for training

Cross-references

Hardware and Model Size Guide — VRAM requirements, GPUs, and cloud options
Model-Provider Matrix — the model and provider combinations that actually held the lesson contracts during validation
LLM Mental Models — introduces the three model families
Cost and Reliability Patterns — model routing pattern for cost optimization
Choosing a Provider — provider setup and API key configuration