Hardware and Model Size Guide

This reference covers the hardware you'll need for the hands-on work in Module 8 (distillation and fine-tuning) and for running local models throughout the curriculum. Consult it before buying hardware or renting cloud GPUs. The right choice depends on what you're doing, not on what's most powerful.

VRAM requirements by task

Task	Model size	Technique	Minimum VRAM	Comfortable VRAM
Local inference (Ollama)	1-3B (SLM)	4-bit quantization	4 GB	6 GB
Local inference (Ollama)	7B	4-bit quantization	6 GB	8 GB
Local inference (Ollama)	14B	4-bit quantization	10 GB	12 GB
Local inference (vLLM)	7B	FP16	14 GB	16 GB
QLoRA training	3B	4-bit + LoRA adapters	6 GB	8 GB
QLoRA training	7B	4-bit + LoRA adapters	10 GB	12 GB
QLoRA training	14B	4-bit + LoRA adapters	18 GB	24 GB
LoRA training	7B	FP16 + LoRA adapters	16 GB	24 GB
Full fine-tuning	7B	FP16, all parameters	28 GB+	40 GB+

"Minimum" means it will run but may be slow or require small batch sizes. "Comfortable" means you can train at reasonable batch sizes without constant OOM errors.

Consumer GPUs

GPU	VRAM	Approximate price (USD)	Good for
RTX 3060	12 GB	$250-300 (used)	QLoRA on 3B-7B, local inference up to 14B
RTX 3090	24 GB	$700-900 (used)	QLoRA on 7B-14B, LoRA on 7B, comfortable local inference
RTX 4060 Ti (16GB)	16 GB	$400-450	QLoRA on 7B, local inference up to 14B
RTX 4070 Ti Super	16 GB	$750-800	Same as 4060 Ti 16GB with faster training
RTX 4090	24 GB	$1,600-1,900	QLoRA on 14B, LoRA on 7B, fastest consumer training
RTX 5090	32 GB	$2,000+	QLoRA on 14B-32B, LoRA on 14B

Prices are approximate and fluctuate. Check current market prices before purchasing. Used previous-generation cards (30-series) offer the best value for learning. You don't need the latest generation for the exercises in this curriculum.

Cloud GPU options

If you don't have a local GPU or need more VRAM than your card provides, cloud GPUs are available by the hour:

Provider	GPU	VRAM	Approximate cost/hour (USD)	Notes
Google Colab Pro	T4 / L4	16 GB	$0.10-0.15 (subscription)	Easiest to start with. Free tier has a T4 but with usage limits.
Lambda Labs	A10G	24 GB	$0.60-0.75	Good for QLoRA on 7B-14B
RunPod	A40 / A100	48-80 GB	$0.75-2.00	Good for larger models, pay-per-minute
Vast.ai	Various	Varies	$0.30-1.50	Marketplace pricing, cheapest for spot instances
AWS (SageMaker)	Various	Varies	$1.00-5.00+	Enterprise option, more operational overhead

For the exercises in Module 8

A Google Colab Pro subscription or a single Lambda Labs session with an A10G is sufficient. You don't need a multi-GPU setup or a dedicated server for curriculum work.

Model size tradeoffs

Size class	Parameter range	Examples	Inference speed	Quality	Training cost	Good for
Tiny SLM	0.5-1.5B	Qwen 2.5 0.5B, Phi-3.5 Mini	Very fast	Limited	Very low	Classification, routing, simple extraction
Small SLM	1.5-3B	Qwen 2.5 3B, Gemma 2 2B	Fast	Moderate	Low	Distillation targets, structured generation
Medium SLM	7-8B	Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B	Moderate	Good	Moderate	General-purpose local model, fine-tuning target
Large SLM	13-14B	Qwen 2.5 14B, Llama 3.1 13B	Slower	Very good	Higher	Best local quality when VRAM allows
Large LLM	32-70B+	Llama 3.1 70B, Qwen 2.5 72B	Slow locally	Excellent	High	Usually hosted; local only with multi-GPU

The sweet spot for curriculum work is 3B-7B. These models are large enough to produce useful outputs after fine-tuning but small enough to train on a single consumer GPU with QLoRA. Start with 3B for quick iteration, then move to 7B once your training pipeline is working.

When SLMs make sense

Small language models (1-14B parameters) are the right choice when:

Privacy: Data can't leave your infrastructure. Local models keep everything on your hardware.
Offline operation: No internet connection available (air-gapped environments, edge devices).
Predictable cost: No per-token API charges. Cost is fixed hardware + electricity.
Low latency: Local inference on a 3B model can be faster than a round-trip API call to a hosted LLM.
Distillation target: You've identified a bounded task where a smaller model can match teacher quality.

SLMs are the wrong choice when:

General capability matters: For open-ended tasks, a hosted workhorse LLM will significantly outperform a local SLM.
You need the best quality: The gap between a 7B model and GPT-4o / Claude Sonnet on complex reasoning is substantial.
Training is premature: If you haven't exhausted prompt and retrieval improvements, training a local model won't fix the real problem.

Local serving options

Tool	Best for	How it works
Ollama	Quick local inference, development, prototyping	Downloads and serves quantized models with one command. `ollama run qwen2.5:7b` and you're running. REST API at `http://localhost:11434`.
llama.cpp	Maximum control over quantization and inference	C++ inference engine. Supports GGUF format with many quantization levels. More setup than Ollama but more configurable.
vLLM	Production-grade serving with high throughput	Python-based serving with PagedAttention for efficient memory management. Best for serving models to multiple concurrent users.

For curriculum work, Ollama is the default. It handles model downloads, quantization, and serving with minimal configuration. Move to vLLM when you need to serve models in production or to multiple concurrent users.

Cross-references

Model Selection and Serving — decision framework for choosing model type and hosting
Distillation — uses QLoRA training with the VRAM requirements listed here
Fine-Tuning — uses SFT training with the VRAM requirements listed here
LLM Mental Models — introduces the model family taxonomy (workhorse, reasoning, SLM)