Local LLM

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware. Th

March 16, 2026·7 min read·1,540 words

How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)

DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware.

The full R1 is a 671B Mixture-of-Experts model (37B active parameters per inference). You can't run that on a consumer GPU. But DeepSeek released six distilled versions — from 1.5B to 70B — that capture R1's reasoning ability in sizes that run on everything from a laptop to a single desktop GPU.

Here's how to get it running in under 5 minutes.

Quick Start: Ollama (Fastest Path)

If you just want DeepSeek R1 running immediately:

# Install Ollama (if not already installed)

curl -fsSL https://ollama.com/install.sh | sh

ollama run deepseek-r1:14b

That's it. Ollama downloads the model, configures everything, and drops you into a chat. You'll see R1's characteristic <think> tags where it reasons through problems before answering.

For more on Ollama setup and other great models to run, see our Best Ollama Models 2026 guide.

Which Size Should You Run?

DeepSeek released six distilled models. Each is a full model (not a quantized version of R1) — they were trained by distilling R1's reasoning capabilities into smaller architectures based on Qwen 2.5 and Llama 3.

Model VRAM (Q4) VRAM (FP16) AIME 2024 MATH-500 Speed* Ollama Command
R1-Distill 1.5B ~2GB ~3GB 28.9% 83.9% ⚡⚡⚡⚡⚡ ollama run deepseek-r1:1.5b
R1-Distill 7B ~5GB ~14GB 55.5% 92.9% ⚡⚡⚡⚡ ollama run deepseek-r1:7b
R1-Distill 8B ~5GB ~16GB 50.4% 89.1% ⚡⚡⚡⚡ ollama run deepseek-r1:8b
R1-Distill 14B ~9GB ~28GB 69.7% 93.9% ⚡⚡⚡ ollama run deepseek-r1:14b
R1-Distill 32B ~20GB ~64GB 72.6% 94.3% ⚡⚡ ollama run deepseek-r1:32b
R1-Distill 70B ~40GB ~140GB 79.8% 94.5% ollama run deepseek-r1:70b

*Speed relative to each other on equivalent hardware. Q4 = 4-bit quantization (default in Ollama).

Our Recommendations

  • 8GB VRAM (RTX 4060, laptop GPUs): Run the 7B model. Solid reasoning, fast enough for interactive use.
  • 12-16GB VRAM (RTX 4070 Ti, M2/M3 Pro Mac): Run the 14B model. This is the sweet spot — AIME score jumps 14 points from 7B to 14B, the biggest quality leap in the lineup.
  • 24GB VRAM (RTX 4090, RTX 3090): Run the 32B model. Near-70B quality in a single-GPU package.
  • 48GB+ (dual GPUs, Mac Studio, A100): Run the 70B model for maximum reasoning quality.

GPU Buying Guide for DeepSeek R1

If you're buying a GPU specifically for running DeepSeek R1 (and other local LLMs), here's what we recommend:

Best Value: RTX 4090 (24GB) — Run 32B

The NVIDIA RTX 4090 remains the best consumer GPU for local AI. 24GB VRAM handles the 32B distill at Q4 with room to spare, and still runs every other popular model (Qwen 3, Gemma 3, Llama 4 Scout). Fast enough for interactive use with R1's chain-of-thought output.

Future-Proof: RTX 5090 (32GB) — Run 32B at Higher Quality

The NVIDIA RTX 5090 adds 8GB over the 4090, which means you can run the 32B at Q5 or Q6 quantization (better quality) or handle longer context windows. If you're buying new in 2026, the extra headroom is worth it.

Budget: RTX 3090 (24GB) — Still Great

Used RTX 3090 cards offer 24GB VRAM at roughly half the price of a 4090. Inference is ~40% slower, but for the 14B model that still means 30+ tokens/sec — perfectly fine for interactive use. The best budget option for serious local AI.

Mac Users: Apple Silicon

If you're on Mac, Ollama runs natively on Apple Silicon. The unified memory architecture means:

  • M2/M3/M4 Pro (18-36GB) → 14B comfortably, 32B tight
  • M4 Max (48-128GB) → 32B comfortably, 70B possible
  • Mac Studio M4 Ultra (192GB) → Full 70B at high quantization

Speed is slower than NVIDIA (no CUDA), but the massive unified memory pool means you can run larger models than any single consumer GPU.

Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.

For the full GPU comparison, see our Best GPU for AI 2026 guide.

Understanding DeepSeek R1's Chain-of-Thought

What makes R1 different from other models is the visible reasoning. When you ask it a question, you'll see output like this:

<think>

The user is asking me to solve a probability problem. Let me break this down:

  • There are 52 cards in a deck
  • We need to find P(two aces in a row)
  • First draw: 4/52 = 1/13
  • Second draw (given first was ace): 3/51 = 1/17
  • Combined: (4/52) × (3/51) = 12/2652 = 1/221

Let me verify this is correct...

</think>

The probability of drawing two aces in a row from a standard deck is 1/221 (approximately 0.45%).

This isn't just for show — the thinking process actually improves accuracy. On AIME 2024 (competition math), the 14B distill scores 69.7% with chain-of-thought enabled versus ~45% without.

Tip: If you want just the final answer (faster, fewer tokens), you can instruct the model: "Answer directly without showing your reasoning."

Advanced Setup: vLLM for Production

If you're serving DeepSeek R1 to multiple users or building an API:

pip install vllm

Serve the 14B distill with OpenAI-compatible API

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \

--tensor-parallel-size 1 \

--gpu-memory-utilization 0.9 \

--max-model-len 32768

vLLM delivers 19x higher throughput than Ollama for concurrent requests. Use Ollama for personal use, vLLM for anything serving multiple users.

R1 vs Alternatives: When to Use What

Task Best Choice Why
Math/science reasoning DeepSeek R1 Chain-of-thought dominates benchmarks
General coding Qwen 3 14B or Qwen 3 Coder Faster, broader coding support
General chat Qwen 3.5 or Gemma 3 Better instruction-following
Creative writing Llama 4 Scout More natural, less robotic
Logic puzzles DeepSeek R1 Reasoning traces catch errors
Quick Q&A Qwen 3 8B or Gemma 3 4B Faster, R1's thinking adds latency

R1 isn't the best model for everything — the chain-of-thought process adds latency, and for simple tasks that overhead doesn't help. Use R1 when you need the model to actually reason through a problem, and faster models for everything else.

See our Open Source LLM Leaderboard for the full comparison.

Performance Tips

1. Use the right quantization.

Ollama defaults to Q4_K_M, which is the best balance of speed and quality. If you have VRAM headroom, pull a higher quant:

# Higher quality (needs more VRAM)

ollama run deepseek-r1:14b-q5_K_M

ollama run deepseek-r1:14b-q8_0

For more on how quantization works, see our What is Quantization guide.

2. Set context length wisely.

R1's chain-of-thought uses tokens. A problem that produces a 200-token answer might use 800+ tokens of thinking. Default context of 4096 can run out on complex problems. Increase it:

ollama run deepseek-r1:14b --ctx-size 8192

3. GPU offloading on low-VRAM systems.

If the model doesn't fully fit in VRAM, Ollama automatically spills to CPU RAM. This works but is 5-10x slower for the spilled layers. Either:

  • Use a smaller model that fits entirely in VRAM
  • Add more VRAM (the GPU recs above)
  • Accept the speed penalty for occasional heavy reasoning tasks

4. Temperature for reasoning.

Keep temperature at 0.6 or lower for reasoning tasks. Higher temperatures introduce randomness into the chain-of-thought, which can derail multi-step logic:

# In Ollama Modelfile or API call

PARAMETER temperature 0.6

The Bottom Line

DeepSeek R1 is the go-to model for reasoning tasks you want to run locally. The 14B distill on a 24GB GPU gives you math and science capabilities that rival GPT-4o — without sending your data anywhere.

Quickstart recap:

1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

2. Pull the model: ollama run deepseek-r1:14b

3. Start asking hard questions

For the full model landscape, see our Best Ollama Models 2026. For GPU recommendations, see our Best GPU for AI 2026 guide.


Related: Best Ollama Models 2026 | vLLM vs Ollama vs TGI | Open Source LLM Leaderboard 2026 | Best GPU for AI 2026 | What is Quantization?


FAQ

What is the easiest way to run DeepSeek R1 locally?

Ollama is easiest: ollama pull deepseek-r1:7b then ollama run deepseek-r1:7b. Most users run the 7B, 14B, or 32B distilled variants — the full 671B requires ~400GB VRAM across multiple GPUs.

How much VRAM do I need for DeepSeek R1?

R1 Distill 7B Q4: ~5GB. 14B Q4: ~9GB. 32B Q4: ~20GB. 70B Q4: ~40GB. Full R1 671B: 8× H100 80GB. Distilled models retain most of the reasoning ability of the full model.

What is DeepSeek R1 best at?

DeepSeek R1 excels at multi-step reasoning, math, and code generation. It uses chain-of-thought 'thinking' tokens before answering, significantly improving accuracy on complex problems. Less suited for casual conversation.

Is DeepSeek R1 safe to run locally?

Yes — local inference means prompts never leave your machine. Released under MIT license. The safety training is somewhat less restrictive than OpenAI/Anthropic models by design, which is an advantage for technical use cases.

Can I use DeepSeek R1 via API instead of running locally?

Yes — DeepSeek's API at api.deepseek.com charges $0.14/M input tokens for R1. OpenRouter also hosts it. For occasional use, the API is cheaper than local electricity costs.

Frequently Asked Questions

What is the easiest way to run DeepSeek R1 locally?
Ollama is easiest: ollama pull deepseek-r1:7b then ollama run deepseek-r1:7b. Most users run the 7B, 14B, or 32B distilled variants — the full 671B requires 400GB VRAM across multiple GPUs.
How much VRAM do I need for DeepSeek R1?
R1 Distill 7B Q4: 5GB. 14B Q4: 9GB. 32B Q4: 20GB. 70B Q4: 40GB. Full R1 671B: 8× H100 80GB. Distilled models retain most of the reasoning ability of the full model.
What is DeepSeek R1 best at?
DeepSeek R1 excels at multi-step reasoning, math, and code generation. It uses chain-of-thought 'thinking' tokens before answering, significantly improving accuracy on complex problems. Less suited for casual conversation.
Is DeepSeek R1 safe to run locally?
Yes — local inference means prompts never leave your machine. Released under MIT license. The safety training is somewhat less restrictive than OpenAI/Anthropic models by design, which is an advantage for technical use cases.
Can I use DeepSeek R1 via API instead of running locally?
Yes — DeepSeek's API at api.deepseek.com charges $0.14/M input tokens for R1. OpenRouter also hosts it. For occasional use, the API is cheaper than local electricity costs.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#local-llm#guide