Local LLM

How to Run LLMs Locally with Ollama (2026 Guide)

Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization…

March 27, 2026·8 min read·1,566 words

Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization, and a REST API into a single binary — you install it, pull a model, and you're running inference in under five minutes.

This guide covers the complete setup: hardware requirements, installation on all three platforms, launching your first model, adding a web interface, and the performance and troubleshooting details that actually matter.

Why Run LLMs Locally?

Privacy. Every prompt you send to a cloud API is logged, potentially reviewed, and stored. Local inference means nothing leaves your machine.

Cost. After the hardware investment, marginal token cost is zero. Teams sending millions of tokens per month save hundreds to thousands of dollars.

Offline access. Airplane, remote site, internet outage — your model is still there.

No rate limits. No 429 errors, no queue wait, no throttling. Run as many requests as your hardware supports.

Hardware Requirements

Minimum Specs

Use case	GPU VRAM	RAM	Notes
3B models (Phi-3.5, Llama 3.2 3B)	3GB	8GB	CPU-only viable
7B models (Llama 3.1 8B, Mistral 7B)	6GB	16GB	RTX 3060 minimum
13-14B models (Phi-4 14B, Gemma 2 12B)	10GB	16GB	RTX 3080/4060 Ti
30B+ models	20GB+	32GB+	RTX 4090 or multi-GPU
70B models (Llama 3.1 70B)	40GB+	64GB	Requires server-class or cloud

VRAM Budget by Quantization

Default Ollama models use Q4_K_M quantization — good balance of quality and size.

Model size	Q4_K_M VRAM	Q8_0 VRAM
3B	~2GB	~3.5GB
7B	~4.5GB	~8GB
13B	~8GB	~14GB
30B	~18GB	~32GB
70B	~40GB	~70GB

If you don't have a GPU that hits these numbers, see the cloud GPU options section below.

Recommended hardware for most users:

ASUS RTX 4060 8GB — handles all 7B models comfortably
MSI RTX 4080 16GB — 13B models at full speed, 70B Q4 in split mode
NVIDIA RTX 4090 24GB — best consumer option for 30B models unquantized

Installing Ollama

Linux


curl -fsSL https://ollama.com/install.sh | sh

That's the full install. It sets up the ollama binary, creates a systemd service (ollama.service), and installs CUDA dependencies if an NVIDIA GPU is detected. Verify the service is running:


systemctl status ollama

To start it manually if the service isn't running:


ollama serve

macOS

Download the macOS app from ollama.com or install via Homebrew:


brew install ollama

On Apple Silicon (M1/M2/M3/M4), Ollama uses the Metal backend and runs very efficiently — a MacBook Pro M3 Pro handles 14B models at ~30 tokens/second, comparable to an RTX 4070.

Start the server:


ollama serve

Windows

Download the Windows installer from ollama.com. It installs as a background service and adds ollama to your PATH. CUDA drivers must be installed separately from nvidia.com for GPU acceleration.

Verify the install:


ollama --version

Running Your First Model

Pull and run Llama 3.1 8B:


ollama pull llama3.1:8b
ollama run llama3.1:8b

pull downloads the model (~4.7GB for Q4_K_M). run starts an interactive chat session. Exit with /bye or Ctrl+D.

Other useful commands:


# List downloaded models
ollama list

# See what's currently loaded in memory
ollama ps

# Run a one-shot prompt (no interactive session)
ollama run llama3.1:8b "Explain quantization in one paragraph"

# Delete a model to free disk space
ollama rm llama3.1:8b

API access. Ollama runs a REST API on localhost:11434 by default. OpenAI-compatible endpoint:


curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "What is 2+2?", "stream": false}'

Setting Up Open WebUI

Open WebUI gives you a ChatGPT-style interface connected to your local Ollama instance. It supports conversation history, model switching, file uploads, and RAG.

Requires Docker. Install Docker from docker.com if needed.


docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. On first launch, create an admin account. Open WebUI auto-detects Ollama at http://host.docker.internal:11434.

From the interface you can pull new models directly, switch between models per conversation, and add documents for RAG without leaving the browser.

Popular Models in 2026

Model	Size	Best for	Pull command
Llama 3.1 8B	4.7GB	General use, instruction following	`ollama pull llama3.1:8b`
Llama 3.1 70B	43GB	High-quality generation, complex reasoning	`ollama pull llama3.1:70b`
Mistral 7B	4.1GB	Fast inference, European language support	`ollama pull mistral:7b`
CodeLlama 7B	3.8GB	Code generation and completion	`ollama pull codellama:7b`
DeepSeek-R1 7B	4.7GB	Reasoning, math, structured thinking	`ollama pull deepseek-r1:7b`
Phi-4 14B	8.5GB	Strong reasoning relative to size	`ollama pull phi4:14b`
Gemma 2 9B	5.5GB	Google's capable mid-size option	`ollama pull gemma2:9b`
Qwen 2.5 7B	4.7GB	Strong multilingual and code tasks	`ollama pull qwen2.5:7b`

For most users, Llama 3.1 8B is the right starting point — it runs on any GPU with 6GB+ VRAM and covers 90% of general tasks. DeepSeek-R1 7B is worth pulling if you do structured reasoning or math.

If you want a GUI comparison of other local LLM launchers alongside Ollama, see LM Studio vs Jan vs GPT4All.

GPU vs CPU Inference

Ollama automatically uses your GPU if one is detected. The performance difference is significant:

Hardware	Model	Speed (tokens/sec)
RTX 4060 8GB	Llama 3.1 8B Q4	~55-70 t/s
RTX 4080 16GB	Llama 3.1 8B Q4	~90-110 t/s
RTX 4090 24GB	Llama 3.1 8B Q4	~130-160 t/s
M3 Pro (Apple)	Llama 3.1 8B Q4	~35-45 t/s
Ryzen 9 7950X (CPU only)	Llama 3.1 8B Q4	~8-12 t/s
Intel Core i9-14900K (CPU only)	Llama 3.1 8B Q4	~6-9 t/s

CPU inference is acceptable for:

3B models (fast enough for real-time chat at ~15-25 t/s)
Batch processing where latency doesn't matter
Testing models before committing to download size
ARM servers with high memory bandwidth (Graviton, Ampere)

CPU inference is too slow for:

13B+ models (falls below comfortable reading speed)
Production API serving with concurrent requests
Anything requiring real-time interactive use

To force CPU-only inference (useful for debugging):


OLLAMA_NUM_GPU_LAYERS=0 ollama run llama3.1:8b

Tips and Troubleshooting

VRAM Overflow (Model Falls Back to CPU)

If your model is too large for your GPU VRAM, Ollama offloads layers to system RAM. This makes inference slow but not impossible. You'll see this in the output of ollama ps — a layer count less than the model's total means partial GPU offload.

Fix: use a smaller quantization or a smaller model. Pull the 3B variant instead of 8B, or switch to Q4 from Q8:


ollama pull llama3.1:8b-instruct-q4_K_M

Increase Context Length

Default context window is 2048 tokens for most models. Extend it with a Modelfile:


# Create a custom Modelfile
cat > Modelfile << EOF
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF

ollama create llama3.1-8k -f Modelfile
ollama run llama3.1-8k

Or set via API:


curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "options": {"num_ctx": 8192}, "prompt": "..."}'

Note: larger context windows consume more VRAM. 8K context on a 7B model adds ~1.5GB VRAM over the base 2K context.

Ollama Not Detecting GPU

Check that drivers are installed and the GPU is visible:


nvidia-smi  # Should show your GPU
ollama run llama3.1:8b  # Check if GPU is used in `ollama ps`

On Windows, confirm the CUDA version in nvidia-smi matches your driver. Ollama requires CUDA 11.8+.

Slow First Response

The first request after loading a model takes longer — this is normal. Ollama loads the model weights into VRAM on first call. Subsequent requests are fast. You can preload a model into memory:


curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "keep_alive": "1h", "prompt": ""}'

Q4 vs Q8: Which to Use?

Q4_K_M: Default. ~10-15% quality loss vs full precision. Use this for interactive chat and code gen.
Q5_K_M: Better quality, ~25% more VRAM. Worth it if you have headroom.
Q8_0: Near-lossless. ~2x the Q4 size. Use only if you have the VRAM and care about output quality for evals or production.

For most use cases, Q4_K_M is the right call.

Environment Variables


OLLAMA_NUM_GPU_LAYERS=35    # Manually set layers on GPU (rest on CPU)
OLLAMA_KEEP_ALIVE=10m       # How long to keep model loaded (default 5m)
OLLAMA_HOST=0.0.0.0:11434  # Expose API on all interfaces (for LAN access)
OLLAMA_MODELS=/data/models  # Custom model storage path

No GPU? Use Cloud Inference

If you want local LLM control without the hardware cost, Vast.ai rents individual GPUs by the hour. An RTX 4090 runs at ~$0.25-0.40/hour — enough to run Llama 3.1 70B at full speed. You install Ollama on the rented instance and expose the API to your local machine over SSH tunnel.

For a full comparison of GPU cloud platforms and pricing, see Best GPU Cloud Platforms for AI in 2026.

For production multi-agent setups with Ollama, see OpenClaw + Ollama Production Config.

Summary

Ollama makes local LLM deployment straightforward. The real decision points are:

1. Model size vs hardware: 7B on 8GB VRAM is the most practical starting point

2. Interface: Add Open WebUI if you want a browser UI; use the API directly for programmatic access

3. Quantization: Q4_K_M for most use cases, Q8 if you have VRAM headroom and care about output quality

4. GPU vs CPU: GPU for anything interactive; CPU-only works for small models or batch jobs

Start with ollama pull llama3.1:8b and ollama run llama3.1:8b. The rest follows from there.

*Disclosure: This article contains affiliate links. ToolHalla may earn a commission at no extra cost to you. We only recommend hardware we'd actually use.*

Frequently Asked Questions

How does Ollama simplify the process of running LLMs locally?

Ollama simplifies the process by wrapping model download, quantization, and a REST API into a single binary, allowing users to install it, pull a model, and start inference in under five minutes.

What are the hardware requirements for running larger LLMs like Llama 3.1 70B?

For running larger models such as Llama 3.1 70B, you need a GPU with 40GB+ VRAM and at least 64GB of RAM.

Is it possible to run LLMs locally without a GPU?

Yes, it is possible to run smaller models like Phi-3.5 or Llama 3.2 3B locally without a GPU, though performance will be significantly slower.

How does running LLMs locally compare in terms of cost to using cloud APIs?

Running LLMs locally is more cost-effective after the initial hardware investment, as the marginal cost for each token is zero, unlike cloud APIs which charge per token.

What are some alternatives to Ollama for running LLMs locally?

Alternatives to Ollama include using platforms like Hugging Face's Transformers library, running models with PyTorch or TensorFlow, or using other tools like GPT-J or GPT-NeoX, though these often require more manual setup and configuration.

Frequently Asked Questions

How does Ollama simplify the process of running LLMs locally?

Ollama simplifies the process by wrapping model download, quantization, and a REST API into a single binary, allowing users to install it, pull a model, and start inference in under five minutes.

What are the hardware requirements for running larger LLMs like Llama 3.1 70B?

For running larger models such as Llama 3.1 70B, you need a GPU with 40GB+ VRAM and at least 64GB of RAM.

Is it possible to run LLMs locally without a GPU?

Yes, it is possible to run smaller models like Phi-3.5 or Llama 3.2 3B locally without a GPU, though performance will be significantly slower.

How does running LLMs locally compare in terms of cost to using cloud APIs?

Running LLMs locally is more cost-effective after the initial hardware investment, as the marginal cost for each token is zero, unlike cloud APIs which charge per token.

What are some alternatives to Ollama for running LLMs locally?

🔧 Tools in This Article

Make (Integromat)

Hugging Face

Open WebUI

LM Studio

OpenClaw

GPT4All

Ollama

Jan

Related Guides

All guides →

Local LLM

Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026

Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.

12 min read

Local LLM

Qwen 3.5 Small: Best Open-Source LLM for Running AI on Your Phone

Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't…

7 min read

Local LLM

Llama 3 vs Mistral vs Phi-4: Which Open Source LLM Wins in 2026?

Three model families dominate local AI in 2026: Meta's Llama 3, Mistral AI's Mistral, and Microsoft's Phi-4. Each has genuine strengths, genuine…

9 min read