How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)
DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware. Th
How to Run DeepSeek R1 Locally: Complete Setup Guide (2026)
DeepSeek R1 is the most capable open-source reasoning model available. Its chain-of-thought approach — where the model explicitly shows its thinking before answering — beats GPT-4o on math, science, and coding benchmarks. And unlike closed-source alternatives, you can run it on your own hardware.
The full R1 is a 671B Mixture-of-Experts model (37B active parameters per inference). You can't run that on a consumer GPU. But DeepSeek released six distilled versions — from 1.5B to 70B — that capture R1's reasoning ability in sizes that run on everything from a laptop to a single desktop GPU.
Here's how to get it running in under 5 minutes.
Quick Start: Ollama (Fastest Path)
If you just want DeepSeek R1 running immediately:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:14b
That's it. Ollama downloads the model, configures everything, and drops you into a chat. You'll see R1's characteristic <think> tags where it reasons through problems before answering.
For more on Ollama setup and other great models to run, see our Best Ollama Models 2026 guide.
Which Size Should You Run?
DeepSeek released six distilled models. Each is a full model (not a quantized version of R1) — they were trained by distilling R1's reasoning capabilities into smaller architectures based on Qwen 2.5 and Llama 3.
| Model | VRAM (Q4) | VRAM (FP16) | AIME 2024 | MATH-500 | Speed* | Ollama Command |
|---|---|---|---|---|---|---|
| R1-Distill 1.5B | ~2GB | ~3GB | 28.9% | 83.9% | ⚡⚡⚡⚡⚡ | ollama run deepseek-r1:1.5b |
| R1-Distill 7B | ~5GB | ~14GB | 55.5% | 92.9% | ⚡⚡⚡⚡ | ollama run deepseek-r1:7b |
| R1-Distill 8B | ~5GB | ~16GB | 50.4% | 89.1% | ⚡⚡⚡⚡ | ollama run deepseek-r1:8b |
| R1-Distill 14B | ~9GB | ~28GB | 69.7% | 93.9% | ⚡⚡⚡ | ollama run deepseek-r1:14b |
| R1-Distill 32B | ~20GB | ~64GB | 72.6% | 94.3% | ⚡⚡ | ollama run deepseek-r1:32b |
| R1-Distill 70B | ~40GB | ~140GB | 79.8% | 94.5% | ⚡ | ollama run deepseek-r1:70b |
*Speed relative to each other on equivalent hardware. Q4 = 4-bit quantization (default in Ollama).
Our Recommendations
- 8GB VRAM (RTX 4060, laptop GPUs): Run the 7B model. Solid reasoning, fast enough for interactive use.
- 12-16GB VRAM (RTX 4070 Ti, M2/M3 Pro Mac): Run the 14B model. This is the sweet spot — AIME score jumps 14 points from 7B to 14B, the biggest quality leap in the lineup.
- 24GB VRAM (RTX 4090, RTX 3090): Run the 32B model. Near-70B quality in a single-GPU package.
- 48GB+ (dual GPUs, Mac Studio, A100): Run the 70B model for maximum reasoning quality.
GPU Buying Guide for DeepSeek R1
If you're buying a GPU specifically for running DeepSeek R1 (and other local LLMs), here's what we recommend:
Best Value: RTX 4090 (24GB) — Run 32B
The NVIDIA RTX 4090 remains the best consumer GPU for local AI. 24GB VRAM handles the 32B distill at Q4 with room to spare, and still runs every other popular model (Qwen 3, Gemma 3, Llama 4 Scout). Fast enough for interactive use with R1's chain-of-thought output.
Future-Proof: RTX 5090 (32GB) — Run 32B at Higher Quality
The NVIDIA RTX 5090 adds 8GB over the 4090, which means you can run the 32B at Q5 or Q6 quantization (better quality) or handle longer context windows. If you're buying new in 2026, the extra headroom is worth it.
Budget: RTX 3090 (24GB) — Still Great
Used RTX 3090 cards offer 24GB VRAM at roughly half the price of a 4090. Inference is ~40% slower, but for the 14B model that still means 30+ tokens/sec — perfectly fine for interactive use. The best budget option for serious local AI.
Mac Users: Apple Silicon
If you're on Mac, Ollama runs natively on Apple Silicon. The unified memory architecture means:
- M2/M3/M4 Pro (18-36GB) → 14B comfortably, 32B tight
- M4 Max (48-128GB) → 32B comfortably, 70B possible
- Mac Studio M4 Ultra (192GB) → Full 70B at high quantization
Speed is slower than NVIDIA (no CUDA), but the massive unified memory pool means you can run larger models than any single consumer GPU.
Disclosure: GPU links are Amazon affiliate links. We earn a commission at no extra cost to you.
For the full GPU comparison, see our Best GPU for AI 2026 guide.
Understanding DeepSeek R1's Chain-of-Thought
What makes R1 different from other models is the visible reasoning. When you ask it a question, you'll see output like this:
<think>
The user is asking me to solve a probability problem. Let me break this down:
- There are 52 cards in a deck
- We need to find P(two aces in a row)
- First draw: 4/52 = 1/13
- Second draw (given first was ace): 3/51 = 1/17
- Combined: (4/52) × (3/51) = 12/2652 = 1/221
Let me verify this is correct...
</think>
The probability of drawing two aces in a row from a standard deck is 1/221 (approximately 0.45%).
This isn't just for show — the thinking process actually improves accuracy. On AIME 2024 (competition math), the 14B distill scores 69.7% with chain-of-thought enabled versus ~45% without.
Tip: If you want just the final answer (faster, fewer tokens), you can instruct the model: "Answer directly without showing your reasoning."
Advanced Setup: vLLM for Production
If you're serving DeepSeek R1 to multiple users or building an API:
pip install vllm
Serve the 14B distill with OpenAI-compatible API
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
vLLM delivers 19x higher throughput than Ollama for concurrent requests. Use Ollama for personal use, vLLM for anything serving multiple users.
R1 vs Alternatives: When to Use What
| Task | Best Choice | Why |
|---|---|---|
| Math/science reasoning | DeepSeek R1 | Chain-of-thought dominates benchmarks |
| General coding | Qwen 3 14B or Qwen 3 Coder | Faster, broader coding support |
| General chat | Qwen 3.5 or Gemma 3 | Better instruction-following |
| Creative writing | Llama 4 Scout | More natural, less robotic |
| Logic puzzles | DeepSeek R1 | Reasoning traces catch errors |
| Quick Q&A | Qwen 3 8B or Gemma 3 4B | Faster, R1's thinking adds latency |
R1 isn't the best model for everything — the chain-of-thought process adds latency, and for simple tasks that overhead doesn't help. Use R1 when you need the model to actually reason through a problem, and faster models for everything else.
See our Open Source LLM Leaderboard for the full comparison.
Performance Tips
1. Use the right quantization.
Ollama defaults to Q4_K_M, which is the best balance of speed and quality. If you have VRAM headroom, pull a higher quant:
# Higher quality (needs more VRAM)
ollama run deepseek-r1:14b-q5_K_M
ollama run deepseek-r1:14b-q8_0
For more on how quantization works, see our What is Quantization guide.
2. Set context length wisely.
R1's chain-of-thought uses tokens. A problem that produces a 200-token answer might use 800+ tokens of thinking. Default context of 4096 can run out on complex problems. Increase it:
ollama run deepseek-r1:14b --ctx-size 8192
3. GPU offloading on low-VRAM systems.
If the model doesn't fully fit in VRAM, Ollama automatically spills to CPU RAM. This works but is 5-10x slower for the spilled layers. Either:
- Use a smaller model that fits entirely in VRAM
- Add more VRAM (the GPU recs above)
- Accept the speed penalty for occasional heavy reasoning tasks
4. Temperature for reasoning.
Keep temperature at 0.6 or lower for reasoning tasks. Higher temperatures introduce randomness into the chain-of-thought, which can derail multi-step logic:
# In Ollama Modelfile or API call
PARAMETER temperature 0.6
The Bottom Line
DeepSeek R1 is the go-to model for reasoning tasks you want to run locally. The 14B distill on a 24GB GPU gives you math and science capabilities that rival GPT-4o — without sending your data anywhere.
Quickstart recap:
1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
2. Pull the model: ollama run deepseek-r1:14b
3. Start asking hard questions
For the full model landscape, see our Best Ollama Models 2026. For GPU recommendations, see our Best GPU for AI 2026 guide.
Related: Best Ollama Models 2026 | vLLM vs Ollama vs TGI | Open Source LLM Leaderboard 2026 | Best GPU for AI 2026 | What is Quantization?
Related Articles
- How to Fine-Tune an LLM Locally: Complete Guide (2026)
- Stable Diffusion Setup Guide 2026: Run AI Image Generation Locally
- How to Run LLMs on a Raspberry Pi in 2026: Complete Setup Guide
FAQ
What is the easiest way to run DeepSeek R1 locally?
Ollama is easiest: ollama pull deepseek-r1:7b then ollama run deepseek-r1:7b. Most users run the 7B, 14B, or 32B distilled variants — the full 671B requires ~400GB VRAM across multiple GPUs.
How much VRAM do I need for DeepSeek R1?
R1 Distill 7B Q4: ~5GB. 14B Q4: ~9GB. 32B Q4: ~20GB. 70B Q4: ~40GB. Full R1 671B: 8× H100 80GB. Distilled models retain most of the reasoning ability of the full model.
What is DeepSeek R1 best at?
DeepSeek R1 excels at multi-step reasoning, math, and code generation. It uses chain-of-thought 'thinking' tokens before answering, significantly improving accuracy on complex problems. Less suited for casual conversation.
Is DeepSeek R1 safe to run locally?
Yes — local inference means prompts never leave your machine. Released under MIT license. The safety training is somewhat less restrictive than OpenAI/Anthropic models by design, which is an advantage for technical use cases.
Can I use DeepSeek R1 via API instead of running locally?
Yes — DeepSeek's API at api.deepseek.com charges $0.14/M input tokens for R1. OpenRouter also hosts it. For occasional use, the API is cheaper than local electricity costs.
Frequently Asked Questions
What is the easiest way to run DeepSeek R1 locally?
How much VRAM do I need for DeepSeek R1?
What is DeepSeek R1 best at?
Is DeepSeek R1 safe to run locally?
Can I use DeepSeek R1 via API instead of running locally?
🔧 Tools in This Article
All tools →Related Guides
All guides →Open Source LLM Leaderboard 2026: The 12 Best Models Right Now
The open source LLM landscape in March 2026 barely resembles what it looked like a year ago. Chinese labs now hold most top positions. Models from Moonshot, Zhipu, and Alibaba consistently match or beat GPT-4o on major benchmarks. And the "small" models are getting scary good — Qwen 3.5 27B threaten
8 min read
Local LLMHow to Fine-Tune an LLM Locally: Complete Guide (2026)
Fine-tuning is the nuclear option. It's powerful, time-consuming, and — in 2026 — often unnecessary. Base models like Qwen 3.5, Llama 4, and Gemma 3 handle tasks out of the box that required fine-tuning 18 months ago. But when you genuinely need a model to speak your domain's language, match a speci
9 min read
ComparisonOllama vs LM Studio vs llama.cpp: Which Should You Use in 2026?
Three tools, one goal: run AI locally. Ollama for simplicity, LM Studio for a GUI, llama.cpp for power users. Here is how to choose.
10 min read