Qwen 3.5 Small: Best Open-Source LLM for Running AI on Your Phone
Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't…
Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't supposed to happen — but the benchmarks are real, and you can run it on a phone with 8GB of RAM. This changes the practical floor for on-device AI. Here's what makes Qwen 3.5 Small different, how it compares to Llama and Gemma at similar sizes, and exactly how to run it on your phone and laptop, similar to how you would run LLMs locally with Ollama.
Why Qwen 3.5 Small Matters
Most "small" LLMs trade quality for size — they run on limited hardware but produce noticeably worse output. Qwen 3.5 breaks that pattern. The 8B model matches or exceeds models with 70B+ parameters on specific reasoning benchmarks, while fitting in 5GB of VRAM.
Key facts:
- 8B parameters (also available as 0.6B, 1.7B, 4B, 14B, 32B)
- Apache 2.0 license — fully open, commercial use allowed
- Hybrid thinking mode — toggle
/thinkfor chain-of-thought reasoning on hard problems - MoE variant: 30B-A3B (30B total parameters, only 3B active per token) — explained below
Benchmark Comparison
| Benchmark | Qwen 3.5 8B | Llama 3.3 8B | Gemma 3 9B | Qwen 2.5 72B |
|---|---|---|---|---|
| GPQA Diamond | ~45% | ~33% | ~38% | ~42% |
| MATH-500 | ~82% | ~68% | ~72% | ~80% |
| LiveCodeBench | ~35% | ~25% | ~28% | ~33% |
| Arena ELO (approx) | ~1180 | ~1120 | ~1140 | ~1175 |
*Sources: Qwen team blog, community benchmarks. Scores are approximate — verify against latest papers before citation.*
The standout number: Qwen 3.5 8B scores ~45% on GPQA Diamond (Graduate-Level Google-Proof Q&A), beating the 72B Qwen 2.5 that needs 40GB+ VRAM. It also leads MATH-500 and LiveCodeBench in the sub-10B category. Llama 3.3 8B and Gemma 3 9B are good models — but on reasoning benchmarks, Qwen 3.5 8B has a clear lead. If you're interested in running larger models like the 100B parameter ones on a single CPU, check out Microsoft BitNet for a no-GPU solution.
The Full Qwen 3.5 Lineup
| Model | Parameters | VRAM (Q4_K_M) | RAM for CPU | Best for |
|---|
--|---|---|---|
| Qwen 3.5 0.6B | 0.6B | ~0.5GB | 2GB | Embedded, IoT, basic tasks |
|---|---|---|---|---|
| Qwen 3.5 4B | 4B | ~2.5GB | 6GB | Phone with 8GB RAM |
| Qwen 3.5 8B | 8B | ~5GB | 10GB | Sweet spot — phone/laptop |
| Qwen 3.5 14B | 14B | ~8GB | 18GB | Laptop with dedicated GPU |
| Qwen 3.5 32B | 32B | ~18GB | 38GB | Desktop with RTX 4090/5070 Ti |
| Qwen 3.5 30B-A3B (MoE) | 30B (3B active) | ~18GB | 38GB | Quality of 30B, speed of 3B |
The MoE Secret Weapon: Qwen 3.5 30B-A3B
This is the sleeper model in the lineup. Qwen 3.5 30B-A3B uses Mixture of Experts — 30B total parameters but only 3B are active for any given token. That means:
- Quality close to 30B — because the full parameter count contributes to learned knowledge
- Speed close to 3B — because only 3B parameters run per inference step
- VRAM matches 30B — you still need to load all parameters into memory
The catch: VRAM usage is the same as a 30B dense model (~18GB at Q4). But inference speed is dramatically faster. If you have the memory budget, the MoE variant gives you the best quality-per-second at this size.
Running Qwen 3.5 on Your Phone
Android (8GB+ RAM)
The most practical option for on-device inference is MLC Chat (from the MLC LLM project):
1. Install MLC Chat from the Play Store or build from the MLC LLM GitHub repo
2. Download the Qwen 3.5 4B or 8B model (Q4 quantized)
3. Run — expect 5-15 tokens/second on modern flagship phones
Which model size for your phone:
| Phone RAM | Recommended model | Performance |
|---|---|---|
| 8GB | Qwen 3.5 4B (Q4) | Usable, ~10 tok/s |
| 12GB | Qwen 3.5 8B (Q4) | Good, ~8 tok/s |
| 16GB | Qwen 3.5 8B (Q8) | Better quality, ~6 tok/s |
iOS
On iOS, MLC Chat also supports iPhone 15 Pro and later (6GB+ RAM). The 4B model runs well. The 8B model works on iPhones with 8GB RAM (iPhone 16 Pro) but may be slow.
Practical Expectations
Phone inference is real but limited. Good for:
- Quick Q&A and lookups
- Translation while traveling
- Private note summarization
- Coding assistance on the go
Not great for: long document analysis, complex multi-turn conversations, or anything requiring sustained high throughput. For those, use a laptop or desktop.
Running on Laptop with Ollama
Ollama is the fastest path to running Qwen 3.5 on a laptop or desktop.
Installation
curl -fsSL https://ollama.com/install.sh | sh
On macOS, you can also download the app from ollama.com.
Pull and Run
# Download and run the 8B model
ollama pull qwen3.5:8b
ollama run qwen3.5:8b
# For the 14B model (needs 10GB+ VRAM)
ollama pull qwen3.5:14b
ollama run qwen3.5:14b
That's it. No conda, no pip, no CUDA setup. Ollama handles quantization and GPU offloading automatically.
Enable Thinking Mode
Qwen 3.5 supports a hybrid /think mode for harder problems:
>>> /think What is the time complexity of Dijkstra's algorithm and why?
This activates chain-of-thought reasoning, producing longer but more accurate responses for math, logic, and coding problems.
Performance by Hardware
| Hardware | Model | Speed (tok/s) |
|---|---|---|
| Apple M3 (16GB) | Qwen 3.5 8B Q4 | ~25 tok/s |
| Apple M4 (24GB) | Qwen 3.5 14B Q4 | ~20 tok/s |
| Apple M4 Pro (48GB) | Qwen 3.5 32B Q4 | ~15 tok/s |
| RTX 4060 (8GB) | Qwen 3.5 8B Q4 | ~40 tok/s |
| RTX 4090 (24GB) | Qwen 3.5 14B Q4 | ~60 tok/s |
| RTX 5070 Ti (16GB) | Qwen 3.5 14B Q4 | ~55 tok/s |
*Approximate speeds — varies by system configuration and prompt length.*
Recommended Hardware
- Apple Mac Mini M4 — unified memory makes it ideal for LLM inference. 16GB model handles 8B comfortably, 24GB handles 14B.
- NVIDIA RTX 4060 8GB — budget GPU that runs all 8B models at full speed.
- NVIDIA RTX 5070 Ti 16GB — 14B models at Q4, 8B at Q8 for higher quality.
- NVIDIA RTX 5090 32GB — runs the 32B model natively.
No GPU? Rent cloud GPUs on Vast.ai — RTX 4090 instances start under $0.50/hour.
Qwen 3.5 8B vs Llama 3.3 8B vs Gemma 3 9B
All three are strong sub-10B models. Here's how they compare for practical use:
| Qwen 3.5 8B | Llama 3.3 8B | Gemma 3 9B | |
|---|---|---|---|
| Reasoning | Best in class | Good | Good |
| Coding | Strong (LiveCodeBench lead) | Solid | Average |
| Math | MATH-500 ~82% | ~68% | ~72% |
| License | Apache 2.0 | Llama License | Google Terms |
| Thinking mode | Yes (/think) | No | No |
| Multilingual | Excellent (Chinese/English) | Good | Good |
| MoE variant | Yes (30B-A3B) | No | No |
| Ecosystem | Growing | Largest | Google-backed |
Verdict: Qwen 3.5 8B wins on reasoning, math, and coding. Llama 3.3 8B has the largest ecosystem and community support. Gemma 3 9B is solid but doesn't lead in any category at this size.
If raw reasoning quality matters most, go with Qwen 3.5. If you want the most community resources, tutorials, and fine-tunes, Llama has the edge.
Best Use Cases for Qwen 3.5 Small
- Coding assistant on the go — run on your laptop with Ollama for private, offline code completion and debugging
- Private AI chat — no data leaves your device, Apache 2.0 license, no usage restrictions
- Offline translation — excellent Chinese-English, solid on other language pairs
- Document summarization — feed documents through the Ollama API for local summarization pipelines
- Education and research — /think mode makes it useful for math tutoring and step-by-step problem solving
- Edge deployment — the 0.6B and 1.7B variants run on embedded devices and IoT hardware
FAQ
Can Qwen 3.5 8B really compete with 70B models?
On specific reasoning benchmarks (GPQA Diamond), yes — the 8B model matches or exceeds Qwen 2.5 72B. On general conversation quality and breadth, larger models still win. The 8B model punches above its weight on structured reasoning tasks.
What's the best Qwen 3.5 model for my phone?
For 8GB RAM phones: Qwen 3.5 4B at Q4 quantization. For 12GB+: Qwen 3.5 8B at Q4. Use MLC Chat from llm.mlc.ai for the easiest setup.
How does Ollama compare to running llama.cpp directly?
Ollama wraps llama.cpp with automatic model management, GPU detection, and a REST API. Performance is nearly identical. Use Ollama unless you need fine-grained control over quantization or sampling. See our full Ollama guide.
Is Qwen 3.5 safe for commercial use?
Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no restrictions beyond attribution.
What about Qwen 3.5 for coding specifically?
Qwen 3.5 8B leads LiveCodeBench in the sub-10B category. For serious coding work, also consider Qwen 3.5 32B or the frontier models in our Best LLM for Coding 2026 comparison. For a comparison of AI coding tools, see Claude Code vs Cursor vs GitHub Copilot.
Frequently Asked Questions
Can Qwen 3.5 8B really compete with 70B models?
What's the best Qwen 3.5 model for my phone?
How does Ollama compare to running llama.cpp directly?
Is Qwen 3.5 safe for commercial use?
What about Qwen 3.5 for coding specifically?
🔧 Tools in This Article
All tools →Related Guides
All guides →Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026
Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.
12 min read
Local LLMHow to Run LLMs Locally with Ollama (2026 Guide)
Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization…
8 min read
Local LLMLlama 3 vs Mistral vs Phi-4: Which Open Source LLM Wins in 2026?
Three model families dominate local AI in 2026: Meta's Llama 3, Mistral AI's Mistral, and Microsoft's Phi-4. Each has genuine strengths, genuine…
9 min read