How to Run LLMs Locally with Ollama (2026 Guide)
Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization…
Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization, and a REST API into a single binary — you install it, pull a model, and you're running inference in under five minutes.
This guide covers the complete setup: hardware requirements, installation on all three platforms, launching your first model, adding a web interface, and the performance and troubleshooting details that actually matter.
Why Run LLMs Locally?
Privacy. Every prompt you send to a cloud API is logged, potentially reviewed, and stored. Local inference means nothing leaves your machine.
Cost. After the hardware investment, marginal token cost is zero. Teams sending millions of tokens per month save hundreds to thousands of dollars.
Offline access. Airplane, remote site, internet outage — your model is still there.
No rate limits. No 429 errors, no queue wait, no throttling. Run as many requests as your hardware supports.
Hardware Requirements
Minimum Specs
| Use case | GPU VRAM | RAM | Notes |
|---|---|---|---|
| 3B models (Phi-3.5, Llama 3.2 3B) | 3GB | 8GB | CPU-only viable |
| 7B models (Llama 3.1 8B, Mistral 7B) | 6GB | 16GB | RTX 3060 minimum |
| 13-14B models (Phi-4 14B, Gemma 2 12B) | 10GB | 16GB | RTX 3080/4060 Ti |
| 30B+ models | 20GB+ | 32GB+ | RTX 4090 or multi-GPU |
| 70B models (Llama 3.1 70B) | 40GB+ | 64GB | Requires server-class or cloud |
VRAM Budget by Quantization
Default Ollama models use Q4_K_M quantization — good balance of quality and size.
| Model size | Q4_K_M VRAM | Q8_0 VRAM |
|---|---|---|
| 3B | ~2GB | ~3.5GB |
| 7B | ~4.5GB | ~8GB |
| 13B | ~8GB | ~14GB |
| 30B | ~18GB | ~32GB |
| 70B | ~40GB | ~70GB |
If you don't have a GPU that hits these numbers, see the cloud GPU options section below.
Recommended hardware for most users:
- ASUS RTX 4060 8GB — handles all 7B models comfortably
- MSI RTX 4080 16GB — 13B models at full speed, 70B Q4 in split mode
- NVIDIA RTX 4090 24GB — best consumer option for 30B models unquantized
Installing Ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
That's the full install. It sets up the ollama binary, creates a systemd service (ollama.service), and installs CUDA dependencies if an NVIDIA GPU is detected. Verify the service is running:
systemctl status ollama
To start it manually if the service isn't running:
ollama serve
macOS
Download the macOS app from ollama.com or install via Homebrew:
brew install ollama
On Apple Silicon (M1/M2/M3/M4), Ollama uses the Metal backend and runs very efficiently — a MacBook Pro M3 Pro handles 14B models at ~30 tokens/second, comparable to an RTX 4070.
Start the server:
ollama serve
Windows
Download the Windows installer from ollama.com. It installs as a background service and adds ollama to your PATH. CUDA drivers must be installed separately from nvidia.com for GPU acceleration.
Verify the install:
ollama --version
Running Your First Model
Pull and run Llama 3.1 8B:
ollama pull llama3.1:8b
ollama run llama3.1:8b
pull downloads the model (~4.7GB for Q4_K_M). run starts an interactive chat session. Exit with /bye or Ctrl+D.
Other useful commands:
# List downloaded models
ollama list
# See what's currently loaded in memory
ollama ps
# Run a one-shot prompt (no interactive session)
ollama run llama3.1:8b "Explain quantization in one paragraph"
# Delete a model to free disk space
ollama rm llama3.1:8b
API access. Ollama runs a REST API on localhost:11434 by default. OpenAI-compatible endpoint:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "What is 2+2?", "stream": false}'
Setting Up Open WebUI
Open WebUI gives you a ChatGPT-style interface connected to your local Ollama instance. It supports conversation history, model switching, file uploads, and RAG.
Requires Docker. Install Docker from docker.com if needed.
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. On first launch, create an admin account. Open WebUI auto-detects Ollama at http://host.docker.internal:11434.
From the interface you can pull new models directly, switch between models per conversation, and add documents for RAG without leaving the browser.
Popular Models in 2026
| Model | Size | Best for | Pull command |
|---|---|---|---|
| Llama 3.1 8B | 4.7GB | General use, instruction following | ollama pull llama3.1:8b |
| Llama 3.1 70B | 43GB | High-quality generation, complex reasoning | ollama pull llama3.1:70b |
| Mistral 7B | 4.1GB | Fast inference, European language support | ollama pull mistral:7b |
| CodeLlama 7B | 3.8GB | Code generation and completion | ollama pull codellama:7b |
| DeepSeek-R1 7B | 4.7GB | Reasoning, math, structured thinking | ollama pull deepseek-r1:7b |
| Phi-4 14B | 8.5GB | Strong reasoning relative to size | ollama pull phi4:14b |
| Gemma 2 9B | 5.5GB | Google's capable mid-size option | ollama pull gemma2:9b |
| Qwen 2.5 7B | 4.7GB | Strong multilingual and code tasks | ollama pull qwen2.5:7b |
For most users, Llama 3.1 8B is the right starting point — it runs on any GPU with 6GB+ VRAM and covers 90% of general tasks. DeepSeek-R1 7B is worth pulling if you do structured reasoning or math.
If you want a GUI comparison of other local LLM launchers alongside Ollama, see LM Studio vs Jan vs GPT4All.
GPU vs CPU Inference
Ollama automatically uses your GPU if one is detected. The performance difference is significant:
| Hardware | Model | Speed (tokens/sec) |
|---|---|---|
| RTX 4060 8GB | Llama 3.1 8B Q4 | ~55-70 t/s |
| RTX 4080 16GB | Llama 3.1 8B Q4 | ~90-110 t/s |
| RTX 4090 24GB | Llama 3.1 8B Q4 | ~130-160 t/s |
| M3 Pro (Apple) | Llama 3.1 8B Q4 | ~35-45 t/s |
| Ryzen 9 7950X (CPU only) | Llama 3.1 8B Q4 | ~8-12 t/s |
| Intel Core i9-14900K (CPU only) | Llama 3.1 8B Q4 | ~6-9 t/s |
CPU inference is acceptable for:
- 3B models (fast enough for real-time chat at ~15-25 t/s)
- Batch processing where latency doesn't matter
- Testing models before committing to download size
- ARM servers with high memory bandwidth (Graviton, Ampere)
CPU inference is too slow for:
- 13B+ models (falls below comfortable reading speed)
- Production API serving with concurrent requests
- Anything requiring real-time interactive use
To force CPU-only inference (useful for debugging):
OLLAMA_NUM_GPU_LAYERS=0 ollama run llama3.1:8b
Tips and Troubleshooting
VRAM Overflow (Model Falls Back to CPU)
If your model is too large for your GPU VRAM, Ollama offloads layers to system RAM. This makes inference slow but not impossible. You'll see this in the output of ollama ps — a layer count less than the model's total means partial GPU offload.
Fix: use a smaller quantization or a smaller model. Pull the 3B variant instead of 8B, or switch to Q4 from Q8:
ollama pull llama3.1:8b-instruct-q4_K_M
Increase Context Length
Default context window is 2048 tokens for most models. Extend it with a Modelfile:
# Create a custom Modelfile
cat > Modelfile << EOF
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF
ollama create llama3.1-8k -f Modelfile
ollama run llama3.1-8k
Or set via API:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "options": {"num_ctx": 8192}, "prompt": "..."}'
Note: larger context windows consume more VRAM. 8K context on a 7B model adds ~1.5GB VRAM over the base 2K context.
Ollama Not Detecting GPU
Check that drivers are installed and the GPU is visible:
nvidia-smi # Should show your GPU
ollama run llama3.1:8b # Check if GPU is used in `ollama ps`
On Windows, confirm the CUDA version in nvidia-smi matches your driver. Ollama requires CUDA 11.8+.
Slow First Response
The first request after loading a model takes longer — this is normal. Ollama loads the model weights into VRAM on first call. Subsequent requests are fast. You can preload a model into memory:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "keep_alive": "1h", "prompt": ""}'
Q4 vs Q8: Which to Use?
- Q4_K_M: Default. ~10-15% quality loss vs full precision. Use this for interactive chat and code gen.
- Q5_K_M: Better quality, ~25% more VRAM. Worth it if you have headroom.
- Q8_0: Near-lossless. ~2x the Q4 size. Use only if you have the VRAM and care about output quality for evals or production.
For most use cases, Q4_K_M is the right call.
Environment Variables
OLLAMA_NUM_GPU_LAYERS=35 # Manually set layers on GPU (rest on CPU)
OLLAMA_KEEP_ALIVE=10m # How long to keep model loaded (default 5m)
OLLAMA_HOST=0.0.0.0:11434 # Expose API on all interfaces (for LAN access)
OLLAMA_MODELS=/data/models # Custom model storage path
No GPU? Use Cloud Inference
If you want local LLM control without the hardware cost, Vast.ai rents individual GPUs by the hour. An RTX 4090 runs at ~$0.25-0.40/hour — enough to run Llama 3.1 70B at full speed. You install Ollama on the rented instance and expose the API to your local machine over SSH tunnel.
For a full comparison of GPU cloud platforms and pricing, see Best GPU Cloud Platforms for AI in 2026.
For production multi-agent setups with Ollama, see OpenClaw + Ollama Production Config.
Summary
Ollama makes local LLM deployment straightforward. The real decision points are:
1. Model size vs hardware: 7B on 8GB VRAM is the most practical starting point
2. Interface: Add Open WebUI if you want a browser UI; use the API directly for programmatic access
3. Quantization: Q4_K_M for most use cases, Q8 if you have VRAM headroom and care about output quality
4. GPU vs CPU: GPU for anything interactive; CPU-only works for small models or batch jobs
Start with ollama pull llama3.1:8b and ollama run llama3.1:8b. The rest follows from there.
*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you. We only recommend hardware we'd actually use.*
Frequently Asked Questions
How does Ollama simplify the process of running LLMs locally?
Ollama simplifies the process by wrapping model download, quantization, and a REST API into a single binary, allowing users to install it, pull a model, and start inference in under five minutes.
What are the hardware requirements for running larger LLMs like Llama 3.1 70B?
For running larger models such as Llama 3.1 70B, you need a GPU with 40GB+ VRAM and at least 64GB of RAM.
Is it possible to run LLMs locally without a GPU?
Yes, it is possible to run smaller models like Phi-3.5 or Llama 3.2 3B locally without a GPU, though performance will be significantly slower.
How does running LLMs locally compare in terms of cost to using cloud APIs?
Running LLMs locally is more cost-effective after the initial hardware investment, as the marginal cost for each token is zero, unlike cloud APIs which charge per token.
What are some alternatives to Ollama for running LLMs locally?
Alternatives to Ollama include using platforms like Hugging Face's Transformers library, running models with PyTorch or TensorFlow, or using other tools like GPT-J or GPT-NeoX, though these often require more manual setup and configuration.
Frequently Asked Questions
How does Ollama simplify the process of running LLMs locally?
What are the hardware requirements for running larger LLMs like Llama 3.1 70B?
Is it possible to run LLMs locally without a GPU?
How does running LLMs locally compare in terms of cost to using cloud APIs?
What are some alternatives to Ollama for running LLMs locally?
🔧 Tools in This Article
All tools →Related Guides
All guides →Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026
Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.
12 min read
Local LLMQwen 3.5 Small: Best Open-Source LLM for Running AI on Your Phone
Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't…
7 min read
Local LLMLlama 3 vs Mistral vs Phi-4: Which Open Source LLM Wins in 2026?
Three model families dominate local AI in 2026: Meta's Llama 3, Mistral AI's Mistral, and Microsoft's Phi-4. Each has genuine strengths, genuine…
9 min read