Local LLM

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware. Th

March 16, 2026·7 min read·1,540 words

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

The full R1 is a 671B Mixture-of-Experts model (37B active parameters per inference). You can't run that on a consumer GPU. But DeepSeek released six distilled versions — from 1.5B to 70B — that capture R1's reasoning ability in sizes that run on everything from a laptop to a single desktop GPU.

Here's how to get it running in under 5 minutes.

Quick Start: Ollama (Fastest Path)

If you just want DeepSeek R1 running immediately:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:14b

That's it. Ollama downloads the model, configures everything, and drops you into a chat. You'll see R1's characteristic <think> tags where it reasons through problems before answering.

For more on Ollama setup and other great models to run, see our Best Ollama Models 2026 guide.

Which Size Should You Run?

DeepSeek released six distilled models. Each is a full model (not a quantized version of R1) — they were trained by distilling R1's reasoning capabilities into smaller architectures based on Qwen 2.5 and Llama 3.

Model	VRAM (Q4)	VRAM (FP16)	AIME 2024	MATH-500	Speed*	Ollama Command
R1-Distill 1.5B	~2GB	~3GB	28.9%	83.9%	⚡⚡⚡⚡⚡	`ollama run deepseek-r1:1.5b`
R1-Distill 7B	~5GB	~14GB	55.5%	92.9%	⚡⚡⚡⚡	`ollama run deepseek-r1:7b`
R1-Distill 8B	~5GB	~16GB	50.4%	89.1%	⚡⚡⚡⚡	`ollama run deepseek-r1:8b`
R1-Distill 14B	~9GB	~28GB	69.7%	93.9%	⚡⚡⚡	`ollama run deepseek-r1:14b`
R1-Distill 32B	~20GB	~64GB	72.6%	94.3%	⚡⚡	`ollama run deepseek-r1:32b`
R1-Distill 70B	~40GB	~140GB	79.8%	94.5%	⚡	`ollama run deepseek-r1:70b`

*Speed relative to each other on equivalent hardware. Q4 = 4-bit quantization (default in Ollama).

Our Recommendations

8GB VRAM (RTX 4060, laptop GPUs): Run the 7B model. Solid reasoning, fast enough for interactive use.

12-16GB VRAM (RTX 4070 Ti, M2/M3 Pro Mac): Run the 14B model. This is the sweet spot — AIME score jumps 14 points from 7B to 14B, the biggest quality leap in the lineup.

24GB VRAM (RTX 4090, RTX 3090): Run the 32B model. Near-70B quality in a single-GPU package.

48GB+ (dual GPUs, Mac Studio, A100): Run the 70B model for maximum reasoning quality.

GPU Buying Guide for DeepSeek R1

If you're buying a GPU specifically for running DeepSeek R1 (and other local LLMs), here's what we recommend:

Best Value: RTX 4090 (24GB) — Run 32B

The NVIDIA RTX 4090 remains the best consumer GPU for local AI. 24GB VRAM handles the 32B distill at Q4 with room to spare, and still runs every other popular model (Qwen 3, Gemma 3, Llama 4 Scout). Fast enough for interactive use with R1's chain-of-thought output.

Future-Proof: RTX 5090 (32GB) — Run 32B at Higher Quality

The NVIDIA RTX 5090 adds 8GB over the 4090, which means you can run the 32B at Q5 or Q6 quantization (better quality) or handle longer context windows. If you're buying new in 2026, the extra headroom is worth it.

Budget: RTX 3090 (24GB) — Still Great

Used RTX 3090 cards offer 24GB VRAM at roughly half the price of a 4090. Inference is ~40% slower, but for the 14B model that still means 30+ tokens/sec — perfectly fine for interactive use. The best budget option for serious local AI.

Mac Users: Apple Silicon

If you're on Mac, Ollama runs natively on Apple Silicon. The unified memory architecture means:

M2/M3/M4 Pro (18-36GB) → 14B comfortably, 32B tight
M4 Max (48-128GB) → 32B comfortably, 70B possible
Mac Studio M4 Ultra (192GB) → Full 70B at high quantization

Speed is slower than NVIDIA (no CUDA), but the massive unified memory pool means you can run larger models than any single consumer GPU.

Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.

For the full GPU comparison, see our Best GPU for AI 2026 guide.

Understanding DeepSeek R1's Chain-of-Thought

What makes R1 different from other models is the visible reasoning. When you ask it a question, you'll see output like this:

<think>
The user is asking me to solve a probability problem. Let me break this down:

There are 52 cards in a deck
We need to find P(two aces in a row)
First draw: 4/52 = 1/13
Second draw (given first was ace): 3/51 = 1/17
Combined: (4/52) × (3/51) = 12/2652 = 1/221

Let me verify this is correct...
</think>
The probability of drawing two aces in a row from a standard deck is 1/221 (approximately 0.45%).

This isn't just for show — the thinking process actually improves accuracy. On AIME 2024 (competition math), the 14B distill scores 69.7% with chain-of-thought enabled versus ~45% without.

Tip: If you want just the final answer (faster, fewer tokens), you can instruct the model: "Answer directly without showing your reasoning."

Advanced Setup: vLLM for Production

If you're serving DeepSeek R1 to multiple users or building an API:

pip install vllm Serve the 14B distill with OpenAI-compatible API vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --max-model-len 32768

vLLM delivers 19x higher throughput than Ollama for concurrent requests. Use Ollama for personal use, vLLM for anything serving multiple users.

R1 vs Alternatives: When to Use What

Task	Best Choice	Why
Math/science reasoning	DeepSeek R1	Chain-of-thought dominates benchmarks
General coding	Qwen 3 14B or Qwen 3 Coder	Faster, broader coding support
General chat	Qwen 3.5 or Gemma 3	Better instruction-following
Creative writing	Llama 4 Scout	More natural, less robotic
Logic puzzles	DeepSeek R1	Reasoning traces catch errors
Quick Q&A	Qwen 3 8B or Gemma 3 4B	Faster, R1's thinking adds latency

R1 isn't the best model for everything — the chain-of-thought process adds latency, and for simple tasks that overhead doesn't help. Use R1 when you need the model to actually reason through a problem, and faster models for everything else.

See our Open Source LLM Leaderboard for the full comparison.

Performance Tips

1. Use the right quantization.

Ollama defaults to Q4_K_M, which is the best balance of speed and quality. If you have VRAM headroom, pull a higher quant:

# Higher quality (needs more VRAM) ollama run deepseek-r1:14b-q5_K_M ollama run deepseek-r1:14b-q8_0

For more on how quantization works, see our What is Quantization guide.

2. Set context length wisely.

R1's chain-of-thought uses tokens. A problem that produces a 200-token answer might use 800+ tokens of thinking. Default context of 4096 can run out on complex problems. Increase it:

ollama run deepseek-r1:14b --ctx-size 8192

3. GPU offloading on low-VRAM systems.

If the model doesn't fully fit in VRAM, Ollama automatically spills to CPU RAM. This works but is 5-10x slower for the spilled layers. Either:

Use a smaller model that fits entirely in VRAM
Add more VRAM (the GPU recs above)
Accept the speed penalty for occasional heavy reasoning tasks

4. Temperature for reasoning.

Keep temperature at 0.6 or lower for reasoning tasks. Higher temperatures introduce randomness into the chain-of-thought, which can derail multi-step logic:

# In Ollama Modelfile or API call PARAMETER temperature 0.6

The Bottom Line

DeepSeek R1 is the go-to model for reasoning tasks you want to run locally. The 14B distill on a 24GB GPU gives you math and science capabilities that rival GPT-4o — without sending your data anywhere.

Quickstart recap:

1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

2. Pull the model: ollama run deepseek-r1:14b

3. Start asking hard questions

For the full model landscape, see our Best Ollama Models 2026. For GPU recommendations, see our Best GPU for AI 2026 guide.

FAQ

What is the easiest way to run DeepSeek R1 locally?

Ollama is easiest: ollama pull deepseek-r1:7b then ollama run deepseek-r1:7b. Most users run the 7B, 14B, or 32B distilled variants — the full 671B requires ~400GB VRAM across multiple GPUs.

How much VRAM do I need for DeepSeek R1?

R1 Distill 7B Q4: ~5GB. 14B Q4: ~9GB. 32B Q4: ~20GB. 70B Q4: ~40GB. Full R1 671B: 8× H100 80GB. Distilled models retain most of the reasoning ability of the full model.

What is DeepSeek R1 best at?

DeepSeek R1 excels at multi-step reasoning, math, and code generation. It uses chain-of-thought 'thinking' tokens before answering, significantly improving accuracy on complex problems. Less suited for casual conversation.

Is DeepSeek R1 safe to run locally?

Yes — local inference means prompts never leave your machine. Released under MIT license. The safety training is somewhat less restrictive than OpenAI/Anthropic models by design, which is an advantage for technical use cases.

Can I use DeepSeek R1 via API instead of running locally?

Yes — DeepSeek's API at api.deepseek.com charges $0.14/M input tokens for R1. OpenRouter also hosts it. For occasional use, the API is cheaper than local electricity costs.

Frequently Asked Questions

What is the easiest way to run DeepSeek R1 locally?

Ollama is easiest: ollama pull deepseek-r1:7b then ollama run deepseek-r1:7b. Most users run the 7B, 14B, or 32B distilled variants — the full 671B requires 400GB VRAM across multiple GPUs.

How much VRAM do I need for DeepSeek R1?

R1 Distill 7B Q4: 5GB. 14B Q4: 9GB. 32B Q4: 20GB. 70B Q4: 40GB. Full R1 671B: 8× H100 80GB. Distilled models retain most of the reasoning ability of the full model.

What is DeepSeek R1 best at?

Is DeepSeek R1 safe to run locally?

Can I use DeepSeek R1 via API instead of running locally?

Yes — DeepSeek's API at api.deepseek.com charges $0.14/M input tokens for R1. OpenRouter also hosts it. For occasional use, the API is cheaper than local electricity costs.

🔧 Tools in This Article

Make (Integromat)

Stable Diffusion

OpenRouter

Ollama

vLLM

Related Guides

All guides →

Local LLM

Open Source LLM Leaderboard 2026: The 12 Best Models Right Now

The open source LLM landscape in March 2026 barely resembles what it looked like a year ago. Chinese labs now hold most top positions. Models from Moonshot, Zhipu, and Alibaba consistently match or beat GPT-4o on major benchmarks. And the "small" models are getting scary good — Qwen 3.5 27B threaten

8 min read

Local LLM

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary. Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a speci

9 min read

Comparison

Ollama vs LM Studio vs llama.cpp: Which Should You Use in 2026?

Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.

10 min read

#local-llm#guide

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

Quick Start: Ollama (Fastest Path)

Which Size Should You Run?

Our Recommendations

GPU Buying Guide for DeepSeek R1

Best Value: RTX 4090 (24GB) — Run 32B

Future-Proof: RTX 5090 (32GB) — Run 32B at Higher Quality

Budget: RTX 3090 (24GB) — Still Great

Mac Users: Apple Silicon

Understanding DeepSeek R1's Chain-of-Thought

Advanced Setup: vLLM for Production

Serve the 14B distill with OpenAI-compatible API

R1 vs Alternatives: When to Use What

Performance Tips

The Bottom Line

Related Articles

FAQ

What is the easiest way to run DeepSeek R1 locally?

How much VRAM do I need for DeepSeek R1?

What is DeepSeek R1 best at?

Is DeepSeek R1 safe to run locally?

Can I use DeepSeek R1 via API instead of running locally?

Frequently Asked Questions

🔧 Tools in This Article

Related Guides

Open Source LLM Leaderboard 2026: The 12 Best Models Right Now

How to Fine-Tune an LLM Locally: Complete Guide (2026)

Ollama vs LM Studio vs llama.cpp: Which Should You Use in 2026?