Model Selection and Serving
Decision framework for choosing the right model for a task and the right way to serve it. Consult this before defaulting to the largest model your budget allows. The right model is often smaller, faster, and cheaper than you might expect.
If you'd like to work with the model/provider pairs I tested with this guide, review the Model-Provider Matrix companion to this page.
Ecosystem layers
These categories get mixed together constantly in vendor docs and community posts. Keeping the layers separate will save you a lot of confusion.
| Layer | What it is | Examples |
|---|---|---|
| Direct provider API | The model publisher's own API surface | OpenAI Platform, Gemini API, Anthropic's developer platform |
| Hosted inference / routing platform | A platform that exposes one or more publishers' models behind its own auth surface | Hugging Face Inference Providers, Ollama Cloud, GitHub Models |
| Local inference runtime | You run the model on your own hardware | Ollama (local), vLLM, llama.cpp |
| Agent platform / runtime | A higher-level runtime that manages sessions, tools, and control loops | GitHub Copilot SDK |
| Cloud provider surface | A cloud platform's API for the same underlying models | AWS Bedrock, Vertex AI |
A common confusion: Gemini API vs. Vertex AI
Google exposes Gemini through two different product surfaces, and it is easy to blur them together if you are new to the ecosystem.
- Gemini API: the simpler developer-facing API used with Google AI Studio and API keys. This is the surface the curriculum means when it says "Gemini" in most runnable lesson examples.
- Vertex AI: Google Cloud's platform surface for Gemini. This is the place where project/location setup, IAM, service accounts, and Google Cloud governance enter the picture.
For the curriculum, the practical distinction is:
- If you want the fastest start for hosted inference examples, the Gemini API is usually the simpler path.
- If you need hosted Gemini tuning, Google currently supports that on Vertex AI, not on the Gemini Developer API.
If you want to dive deeper into the official product docs behind that distinction, start here:
Model families
Three model families serve different purposes. These are introduced in LLM Mental Models and used throughout the curriculum.
| Family | Examples | Strengths | Weaknesses | Cost |
|---|---|---|---|---|
| Workhorse LLM | GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Flash | General-purpose generation, tool calling, instruction following, code | Expensive at scale, provider-dependent | $2-15 per 1M input tokens |
| Reasoning model | o3, Claude Opus, Gemini 2.5 Pro | Complex multi-step reasoning, math, analysis | Slower, more expensive, overkill for simple tasks | $10-30+ per 1M input tokens |
| SLM (Small Language Model) | Qwen 2.5 7B, Llama 3.1 8B, Phi-3.5, Gemma 2 | Fast, cheap, local, privacy-preserving | Limited capability on complex tasks | Free locally; hosting cost only |
Parameters and model sizing
A model's parameter count is the number of learned weights in the network. It's the single most commonly cited measure of model size, and it directly determines how much memory the model needs.
| Parameter count | Common label | Typical VRAM (full precision) | Capability range |
|---|---|---|---|
| 1-3B | SLM | 2-6 GB | Simple extraction, classification, formatting |
| 7-8B | SLM | 14-16 GB | General instruction following, code assistance, summarization |
| 13-14B | Mid-range | 26-28 GB | Stronger reasoning, multi-step tasks |
| 30-34B | Large | 60-68 GB | Near-frontier for open models |
| 70B+ | Very large | 140+ GB | Frontier open-weight capability |
What parameter count tells you: More parameters generally means more capability, and the model can represent more complex patterns. But it also means more memory, more compute, more cost, and more latency.
What parameter count doesn't tell you: Training data quality, architecture efficiency, and post-training alignment matter as much as size. A well-trained 7B model often outperforms a poorly trained 13B model on specific tasks. Don't choose models by parameter count alone. Choose by benchmark performance on your task type.
The practical question: "Can I run this model on my hardware?" See the Hardware and Model Size Guide for VRAM requirements and GPU recommendations.
Quantization
Full-precision models store each parameter as a 16-bit (FP16) or 32-bit (FP32) floating-point number. Quantization reduces the precision of those numbers (typically to 8-bit, 4-bit, or even 2-bit integers) to shrink memory usage and increase inference speed.
Why quantization matters
A 7B parameter model at full precision (FP16) needs ~14 GB of VRAM. Quantized to 4-bit, that same model fits in ~4 GB. This is the difference between needing a $1,000 GPU and running on a laptop.
| Precision | Bits per parameter | 7B model VRAM | Quality impact |
|---|---|---|---|
| FP16 (full) | 16 | ~14 GB | Baseline (no quality loss) |
| INT8 (8-bit) | 8 | ~7 GB | Minimal quality loss for most tasks |
| INT4 (4-bit) | 4 | ~4 GB | Noticeable on complex reasoning, fine for extraction and classification |
| INT2 (2-bit) | 2 | ~2 GB | Significant degradation; use only for simple tasks |
Common quantization formats
| Format | Used by | Notes |
|---|---|---|
| GGUF | llama.cpp, Ollama | The most common format for local inference. Supports mixed quantization (different precision for different layers). When you see a model like qwen2.5:7b-q4_K_M, the q4_K_M suffix indicates 4-bit quantization with a specific scheme. |
| GPTQ | vLLM, HuggingFace | GPU-optimized quantization. Slightly better quality than GGUF at the same bit width but requires GPU. |
| AWQ | vLLM, HuggingFace | Activation-aware quantization. Preserves quality better than naive quantization by identifying and protecting important weights. |
| BitsAndBytes (bnb) | HuggingFace Transformers | Used for QLoRA training (4-bit inference + LoRA adapters). You'll encounter this in Module 8: Distillation. |
Practical guidance
- For curriculum exercises with Ollama: Ollama automatically downloads quantized models. When you run
ollama run qwen2.5:7b, you get a quantized version that fits in reasonable VRAM. You don't need to choose a quantization format manually. - For production serving: Start with the highest precision your hardware supports. Only quantize further if you need to reduce memory or improve throughput, and measure the quality impact on your specific task with your eval harness.
- For fine-tuning: QLoRA (Module 8) uses 4-bit quantization during training to reduce VRAM requirements. The trained LoRA adapter is applied on top of the quantized base model.
The tradeoff in one sentence
Quantization trades precision for speed and memory. The question isn't "should I quantize?" but "how much precision can I give up before my evals show a problem?"
Choosing by task type
| Task | Recommended family | Why | When to escalate |
|---|---|---|---|
| Retrieval routing / classification | SLM (after distillation) or workhorse | Bounded task, well-defined outputs | If SLM accuracy is insufficient |
| Code generation with evidence | Workhorse LLM | Needs instruction following + code quality | Complex multi-file reasoning → reasoning model |
| Structured answer generation | Workhorse LLM | Schema following + grounding | Consider SLM if the task is stable and distilled |
| Embedding generation | Embedding model | Different architecture optimized for similarity | N/A (don't use generative models for embeddings) |
| LLM-as-judge grading | Workhorse or reasoning | Needs nuanced quality assessment | High-stakes evals → reasoning model for better judgment |
| Simple extraction / formatting | SLM or small workhorse | Low complexity, speed matters | If extraction quality is low |
| Multi-step planning | Reasoning model | Needs deliberate step-by-step reasoning | Rarely worth downgrading |
Hosted vs. self-hosted
| Factor | Hosted API | Self-hosted |
|---|---|---|
| Setup effort | Minutes (get API key) | Hours to days (hardware, software, configuration) |
| Per-request cost | Per-token pricing | Zero marginal cost (fixed hardware) |
| Scalability | Provider handles scaling | You handle scaling |
| Privacy | Data leaves your infrastructure | Data stays local |
| Model updates | Provider updates automatically | You update when you choose |
| Availability | Provider SLA, potential outages | Your uptime responsibility |
| Model access | Proprietary models (GPT-4o, Claude) | Open-weight models only |
Decision rule: Start with hosted APIs. Move to self-hosted when one of these is true:
- Privacy requirements prevent sending data to providers
- Per-token costs exceed the hardware cost of running locally
- You need a fine-tuned or distilled model not available as a hosted endpoint
- Latency requirements demand local inference
Self-hosted serving options
Ollama, LM Studio, and llama.cpp are related tools, not competing alternatives. See Field Terms We Don't Teach for the architectural relationship and an architecture diagram.
Ollama
Best for: Development, prototyping, running models during curriculum exercises.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run a model
ollama run qwen2.5:7b
# Serve via REST API
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Hello"}]
}'Ollama downloads, quantizes, and serves models with minimal configuration. It's the right choice for individual development and curriculum work.
vLLM
Best for: Production serving with high throughput and concurrent users.
pip install vllm
# Serve a model with OpenAI-compatible API
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000vLLM provides an OpenAI-compatible API endpoint, so switching from a hosted API to a local model requires changing only the base URL and model name. Its PagedAttention mechanism enables efficient memory management for concurrent requests.
When to use vLLM over Ollama: When serving models to multiple users, when you need precise control over batching and throughput, or when running in a production environment.
llama.cpp
Best for: Maximum control over quantization, running on CPU, or resource-constrained environments.
# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build
# Run a GGUF model
./build/bin/llama-cli -m model.gguf -p "Hello"llama.cpp supports CPU inference (slower but no GPU required) and fine-grained quantization options. Use it when you need to run models without a GPU or when you need specific quantization configurations.
Model selection by curriculum module
| Module | Primary model need | Recommended approach |
|---|---|---|
| 1: Foundation Sprint | API access for exercises | Hosted workhorse (OpenAI, Gemini, Anthropic, Ollama Cloud, HF Inference) |
| 2: Benchmark and Harness | Consistent model for baselines | Same hosted workhorse; consistency matters more than model choice |
| 3: Agent and Tool Building | Tool-calling capable model | Hosted workhorse (tool calling support varies by provider) |
| 4: Code Retrieval | Embedding model + generation model | Hosted embedding API + hosted workhorse |
| 5: RAG and Grounding | Generation with grounding | Hosted workhorse |
| 6: Observability and Evals | LLM-as-judge + generation | Hosted workhorse (consider reasoning model for judge) |
| 7: Orchestration and Memory | Multiple model calls per request | Hosted workhorse; consider model routing for cost |
| 8: Optimization | Training + local inference | Local SLM with Ollama for inference, QLoRA for training |
Cross-references
- Hardware and Model Size Guide — VRAM requirements, GPUs, and cloud options
- Model-Provider Matrix — the model and provider combinations that actually held the lesson contracts during validation
- LLM Mental Models — introduces the three model families
- Cost and Reliability Patterns — model routing pattern for cost optimization
- Choosing a Provider — provider setup and API key configuration