Local LLM

Qwen 3.5 Small: Best Open-Source LLM for Running AI on Your Phone

Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't…

March 16, 2026·7 min read·1,402 words

Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't supposed to happen — but the benchmarks are real, and you can run it on a phone with 8GB of RAM. This changes the practical floor for on-device AI. Here's what makes Qwen 3.5 Small different, how it compares to Llama and Gemma at similar sizes, and exactly how to run it on your phone and laptop, similar to how you would run LLMs locally with Ollama.


Why Qwen 3.5 Small Matters

Most "small" LLMs trade quality for size — they run on limited hardware but produce noticeably worse output. Qwen 3.5 breaks that pattern. The 8B model matches or exceeds models with 70B+ parameters on specific reasoning benchmarks, while fitting in 5GB of VRAM.

Key facts:

  • 8B parameters (also available as 0.6B, 1.7B, 4B, 14B, 32B)
  • Apache 2.0 license — fully open, commercial use allowed
  • Hybrid thinking mode — toggle /think for chain-of-thought reasoning on hard problems
  • MoE variant: 30B-A3B (30B total parameters, only 3B active per token) — explained below

Benchmark Comparison

Benchmark Qwen 3.5 8B Llama 3.3 8B Gemma 3 9B Qwen 2.5 72B
GPQA Diamond ~45% ~33% ~38% ~42%
MATH-500 ~82% ~68% ~72% ~80%
LiveCodeBench ~35% ~25% ~28% ~33%
Arena ELO (approx) ~1180 ~1120 ~1140 ~1175

*Sources: Qwen team blog, community benchmarks. Scores are approximate — verify against latest papers before citation.*

The standout number: Qwen 3.5 8B scores ~45% on GPQA Diamond (Graduate-Level Google-Proof Q&A), beating the 72B Qwen 2.5 that needs 40GB+ VRAM. It also leads MATH-500 and LiveCodeBench in the sub-10B category. Llama 3.3 8B and Gemma 3 9B are good models — but on reasoning benchmarks, Qwen 3.5 8B has a clear lead. If you're interested in running larger models like the 100B parameter ones on a single CPU, check out Microsoft BitNet for a no-GPU solution.


The Full Qwen 3.5 Lineup

Model Parameters VRAM (Q4_K_M) RAM for CPU Best for

--|---|---|---|

Qwen 3.5 0.6B 0.6B ~0.5GB 2GB Embedded, IoT, basic tasks
Qwen 3.5 4B 4B ~2.5GB 6GB Phone with 8GB RAM
Qwen 3.5 8B 8B ~5GB 10GB Sweet spot — phone/laptop
Qwen 3.5 14B 14B ~8GB 18GB Laptop with dedicated GPU
Qwen 3.5 32B 32B ~18GB 38GB Desktop with RTX 4090/5070 Ti
Qwen 3.5 30B-A3B (MoE) 30B (3B active) ~18GB 38GB Quality of 30B, speed of 3B

The MoE Secret Weapon: Qwen 3.5 30B-A3B

This is the sleeper model in the lineup. Qwen 3.5 30B-A3B uses Mixture of Experts — 30B total parameters but only 3B are active for any given token. That means:

  • Quality close to 30B — because the full parameter count contributes to learned knowledge
  • Speed close to 3B — because only 3B parameters run per inference step
  • VRAM matches 30B — you still need to load all parameters into memory

The catch: VRAM usage is the same as a 30B dense model (~18GB at Q4). But inference speed is dramatically faster. If you have the memory budget, the MoE variant gives you the best quality-per-second at this size.


Running Qwen 3.5 on Your Phone

Android (8GB+ RAM)

The most practical option for on-device inference is MLC Chat (from the MLC LLM project):

1. Install MLC Chat from the Play Store or build from the MLC LLM GitHub repo

2. Download the Qwen 3.5 4B or 8B model (Q4 quantized)

3. Run — expect 5-15 tokens/second on modern flagship phones

Which model size for your phone:

Phone RAM Recommended model Performance
8GB Qwen 3.5 4B (Q4) Usable, ~10 tok/s
12GB Qwen 3.5 8B (Q4) Good, ~8 tok/s
16GB Qwen 3.5 8B (Q8) Better quality, ~6 tok/s

iOS

On iOS, MLC Chat also supports iPhone 15 Pro and later (6GB+ RAM). The 4B model runs well. The 8B model works on iPhones with 8GB RAM (iPhone 16 Pro) but may be slow.

Practical Expectations

Phone inference is real but limited. Good for:

  • Quick Q&A and lookups
  • Translation while traveling
  • Private note summarization
  • Coding assistance on the go

Not great for: long document analysis, complex multi-turn conversations, or anything requiring sustained high throughput. For those, use a laptop or desktop.


Running on Laptop with Ollama

Ollama is the fastest path to running Qwen 3.5 on a laptop or desktop.

Installation


curl -fsSL https://ollama.com/install.sh | sh

On macOS, you can also download the app from ollama.com.

Pull and Run


# Download and run the 8B model
ollama pull qwen3.5:8b
ollama run qwen3.5:8b

# For the 14B model (needs 10GB+ VRAM)
ollama pull qwen3.5:14b
ollama run qwen3.5:14b

That's it. No conda, no pip, no CUDA setup. Ollama handles quantization and GPU offloading automatically.

Enable Thinking Mode

Qwen 3.5 supports a hybrid /think mode for harder problems:


>>> /think What is the time complexity of Dijkstra's algorithm and why?

This activates chain-of-thought reasoning, producing longer but more accurate responses for math, logic, and coding problems.

Performance by Hardware

Hardware Model Speed (tok/s)
Apple M3 (16GB) Qwen 3.5 8B Q4 ~25 tok/s
Apple M4 (24GB) Qwen 3.5 14B Q4 ~20 tok/s
Apple M4 Pro (48GB) Qwen 3.5 32B Q4 ~15 tok/s
RTX 4060 (8GB) Qwen 3.5 8B Q4 ~40 tok/s
RTX 4090 (24GB) Qwen 3.5 14B Q4 ~60 tok/s
RTX 5070 Ti (16GB) Qwen 3.5 14B Q4 ~55 tok/s

*Approximate speeds — varies by system configuration and prompt length.*

No GPU? Rent cloud GPUs on Vast.ai — RTX 4090 instances start under $0.50/hour.


Qwen 3.5 8B vs Llama 3.3 8B vs Gemma 3 9B

All three are strong sub-10B models. Here's how they compare for practical use:

Qwen 3.5 8B Llama 3.3 8B Gemma 3 9B
Reasoning Best in class Good Good
Coding Strong (LiveCodeBench lead) Solid Average
Math MATH-500 ~82% ~68% ~72%
License Apache 2.0 Llama License Google Terms
Thinking mode Yes (/think) No No
Multilingual Excellent (Chinese/English) Good Good
MoE variant Yes (30B-A3B) No No
Ecosystem Growing Largest Google-backed

Verdict: Qwen 3.5 8B wins on reasoning, math, and coding. Llama 3.3 8B has the largest ecosystem and community support. Gemma 3 9B is solid but doesn't lead in any category at this size.

If raw reasoning quality matters most, go with Qwen 3.5. If you want the most community resources, tutorials, and fine-tunes, Llama has the edge.


Best Use Cases for Qwen 3.5 Small

  • Coding assistant on the go — run on your laptop with Ollama for private, offline code completion and debugging
  • Private AI chat — no data leaves your device, Apache 2.0 license, no usage restrictions
  • Offline translation — excellent Chinese-English, solid on other language pairs
  • Document summarization — feed documents through the Ollama API for local summarization pipelines
  • Education and research — /think mode makes it useful for math tutoring and step-by-step problem solving
  • Edge deployment — the 0.6B and 1.7B variants run on embedded devices and IoT hardware

FAQ

Can Qwen 3.5 8B really compete with 70B models?

On specific reasoning benchmarks (GPQA Diamond), yes — the 8B model matches or exceeds Qwen 2.5 72B. On general conversation quality and breadth, larger models still win. The 8B model punches above its weight on structured reasoning tasks.

What's the best Qwen 3.5 model for my phone?

For 8GB RAM phones: Qwen 3.5 4B at Q4 quantization. For 12GB+: Qwen 3.5 8B at Q4. Use MLC Chat from llm.mlc.ai for the easiest setup.

How does Ollama compare to running llama.cpp directly?

Ollama wraps llama.cpp with automatic model management, GPU detection, and a REST API. Performance is nearly identical. Use Ollama unless you need fine-grained control over quantization or sampling. See our full Ollama guide.

Is Qwen 3.5 safe for commercial use?

Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no restrictions beyond attribution.

What about Qwen 3.5 for coding specifically?

Qwen 3.5 8B leads LiveCodeBench in the sub-10B category. For serious coding work, also consider Qwen 3.5 32B or the frontier models in our Best LLM for Coding 2026 comparison. For a comparison of AI coding tools, see Claude Code vs Cursor vs GitHub Copilot.

Frequently Asked Questions

Can Qwen 3.5 8B really compete with 70B models?
On specific reasoning benchmarks (GPQA Diamond), yes — the 8B model matches or exceeds Qwen 2.5 72B. On general conversation quality and breadth, larger models still win. The 8B model punches above its weight on structured reasoning tasks.
What's the best Qwen 3.5 model for my phone?
For 8GB RAM phones: Qwen 3.5 4B at Q4 quantization. For 12GB+: Qwen 3.5 8B at Q4. Use MLC Chat from llm.mlc.ai for the easiest setup.
How does Ollama compare to running llama.cpp directly?
Ollama wraps llama.cpp with automatic model management, GPU detection, and a REST API. Performance is nearly identical. Use Ollama unless you need fine-grained control over quantization or sampling. See our full Ollama guide.
Is Qwen 3.5 safe for commercial use?
Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no restrictions beyond attribution.
What about Qwen 3.5 for coding specifically?
Qwen 3.5 8B leads LiveCodeBench in the sub-10B category. For serious coding work, also consider Qwen 3.5 32B or the frontier models in our Best LLM for Coding 2026 comparison. For a comparison of AI coding tools, see Claude Code vs Cursor vs GitHub Copilot.

🔧 Tools in This Article

All tools →

Related Guides

All guides →