Local LLM

Qwen 3.5 Small: Best Open-Source LLM for Running AI on Your Phone

Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't…

March 16, 2026·7 min read·1,402 words

Alibaba's Qwen 3.5 8B outperforms models 13x its size on graduate-level reasoning. A 9-billion-parameter model beating 70B+ models on GPQA Diamond isn't supposed to happen — but the benchmarks are real, and you can run it on a phone with 8GB of RAM. This changes the practical floor for on-device AI. Here's what makes Qwen 3.5 Small different, how it compares to Llama and Gemma at similar sizes, and exactly how to run it on your phone and laptop, similar to how you would run LLMs locally with Ollama.

Why Qwen 3.5 Small Matters

Most "small" LLMs trade quality for size — they run on limited hardware but produce noticeably worse output. Qwen 3.5 breaks that pattern. The 8B model matches or exceeds models with 70B+ parameters on specific reasoning benchmarks, while fitting in 5GB of VRAM.

Key facts:

8B parameters (also available as 0.6B, 1.7B, 4B, 14B, 32B)
Apache 2.0 license — fully open, commercial use allowed
Hybrid thinking mode — toggle /think for chain-of-thought reasoning on hard problems
MoE variant: 30B-A3B (30B total parameters, only 3B active per token) — explained below

Benchmark Comparison

Benchmark	Qwen 3.5 8B	Llama 3.3 8B	Gemma 3 9B	Qwen 2.5 72B
GPQA Diamond	~45%	~33%	~38%	~42%
MATH-500	~82%	~68%	~72%	~80%
LiveCodeBench	~35%	~25%	~28%	~33%
Arena ELO (approx)	~1180	~1120	~1140	~1175

*Sources: Qwen team blog, community benchmarks. Scores are approximate — verify against latest papers before citation.*

The standout number: Qwen 3.5 8B scores ~45% on GPQA Diamond (Graduate-Level Google-Proof Q&A), beating the 72B Qwen 2.5 that needs 40GB+ VRAM. It also leads MATH-500 and LiveCodeBench in the sub-10B category. Llama 3.3 8B and Gemma 3 9B are good models — but on reasoning benchmarks, Qwen 3.5 8B has a clear lead. If you're interested in running larger models like the 100B parameter ones on a single CPU, check out Microsoft BitNet for a no-GPU solution.

The Full Qwen 3.5 Lineup

Model	Parameters	VRAM (Q4_K_M)	RAM for CPU	Best for

--|---|---|---|

Qwen 3.5 0.6B	0.6B	~0.5GB	2GB	Embedded, IoT, basic tasks
Qwen 3.5 4B	4B	~2.5GB	6GB	Phone with 8GB RAM
Qwen 3.5 8B	8B	~5GB	10GB	Sweet spot — phone/laptop
Qwen 3.5 14B	14B	~8GB	18GB	Laptop with dedicated GPU
Qwen 3.5 32B	32B	~18GB	38GB	Desktop with RTX 4090/5070 Ti
Qwen 3.5 30B-A3B (MoE)	30B (3B active)	~18GB	38GB	Quality of 30B, speed of 3B

The MoE Secret Weapon: Qwen 3.5 30B-A3B

This is the sleeper model in the lineup. Qwen 3.5 30B-A3B uses Mixture of Experts — 30B total parameters but only 3B are active for any given token. That means:

Quality close to 30B — because the full parameter count contributes to learned knowledge
Speed close to 3B — because only 3B parameters run per inference step
VRAM matches 30B — you still need to load all parameters into memory

The catch: VRAM usage is the same as a 30B dense model (~18GB at Q4). But inference speed is dramatically faster. If you have the memory budget, the MoE variant gives you the best quality-per-second at this size.

Running Qwen 3.5 on Your Phone

Android (8GB+ RAM)

The most practical option for on-device inference is MLC Chat (from the MLC LLM project):

1. Install MLC Chat from the Play Store or build from the MLC LLM GitHub repo

2. Download the Qwen 3.5 4B or 8B model (Q4 quantized)

3. Run — expect 5-15 tokens/second on modern flagship phones

Which model size for your phone:

Phone RAM	Recommended model	Performance
8GB	Qwen 3.5 4B (Q4)	Usable, ~10 tok/s
12GB	Qwen 3.5 8B (Q4)	Good, ~8 tok/s
16GB	Qwen 3.5 8B (Q8)	Better quality, ~6 tok/s

iOS

On iOS, MLC Chat also supports iPhone 15 Pro and later (6GB+ RAM). The 4B model runs well. The 8B model works on iPhones with 8GB RAM (iPhone 16 Pro) but may be slow.

Practical Expectations

Phone inference is real but limited. Good for:

Quick Q&A and lookups
Translation while traveling
Private note summarization
Coding assistance on the go

Not great for: long document analysis, complex multi-turn conversations, or anything requiring sustained high throughput. For those, use a laptop or desktop.

Running on Laptop with Ollama

Ollama is the fastest path to running Qwen 3.5 on a laptop or desktop.

Installation


curl -fsSL https://ollama.com/install.sh | sh

On macOS, you can also download the app from ollama.com.

Pull and Run


# Download and run the 8B model
ollama pull qwen3.5:8b
ollama run qwen3.5:8b

# For the 14B model (needs 10GB+ VRAM)
ollama pull qwen3.5:14b
ollama run qwen3.5:14b

That's it. No conda, no pip, no CUDA setup. Ollama handles quantization and GPU offloading automatically.

Enable Thinking Mode

Qwen 3.5 supports a hybrid /think mode for harder problems:


>>> /think What is the time complexity of Dijkstra's algorithm and why?

This activates chain-of-thought reasoning, producing longer but more accurate responses for math, logic, and coding problems.

Performance by Hardware

Hardware	Model	Speed (tok/s)
Apple M3 (16GB)	Qwen 3.5 8B Q4	~25 tok/s
Apple M4 (24GB)	Qwen 3.5 14B Q4	~20 tok/s
Apple M4 Pro (48GB)	Qwen 3.5 32B Q4	~15 tok/s
RTX 4060 (8GB)	Qwen 3.5 8B Q4	~40 tok/s
RTX 4090 (24GB)	Qwen 3.5 14B Q4	~60 tok/s
RTX 5070 Ti (16GB)	Qwen 3.5 14B Q4	~55 tok/s

*Approximate speeds — varies by system configuration and prompt length.*

Recommended Hardware

Apple Mac Mini M4 — unified memory makes it ideal for LLM inference. 16GB model handles 8B comfortably, 24GB handles 14B.
NVIDIA RTX 4060 8GB — budget GPU that runs all 8B models at full speed.
NVIDIA RTX 5070 Ti 16GB — 14B models at Q4, 8B at Q8 for higher quality.
NVIDIA RTX 5090 32GB — runs the 32B model natively.

No GPU? Rent cloud GPUs on Vast.ai — RTX 4090 instances start under $0.50/hour.

Qwen 3.5 8B vs Llama 3.3 8B vs Gemma 3 9B

All three are strong sub-10B models. Here's how they compare for practical use:

Qwen 3.5 8B	Llama 3.3 8B	Gemma 3 9B
Reasoning	Best in class	Good	Good
Coding	Strong (LiveCodeBench lead)	Solid	Average
Math	MATH-500 ~82%	~68%	~72%
License	Apache 2.0	Llama License	Google Terms
Thinking mode	Yes (/think)	No	No
Multilingual	Excellent (Chinese/English)	Good	Good
MoE variant	Yes (30B-A3B)	No	No
Ecosystem	Growing	Largest	Google-backed

Verdict: Qwen 3.5 8B wins on reasoning, math, and coding. Llama 3.3 8B has the largest ecosystem and community support. Gemma 3 9B is solid but doesn't lead in any category at this size.

If raw reasoning quality matters most, go with Qwen 3.5. If you want the most community resources, tutorials, and fine-tunes, Llama has the edge.

Best Use Cases for Qwen 3.5 Small

Coding assistant on the go — run on your laptop with Ollama for private, offline code completion and debugging
Private AI chat — no data leaves your device, Apache 2.0 license, no usage restrictions
Offline translation — excellent Chinese-English, solid on other language pairs
Document summarization — feed documents through the Ollama API for local summarization pipelines
Education and research — /think mode makes it useful for math tutoring and step-by-step problem solving
Edge deployment — the 0.6B and 1.7B variants run on embedded devices and IoT hardware

FAQ

Can Qwen 3.5 8B really compete with 70B models?

On specific reasoning benchmarks (GPQA Diamond), yes — the 8B model matches or exceeds Qwen 2.5 72B. On general conversation quality and breadth, larger models still win. The 8B model punches above its weight on structured reasoning tasks.

What's the best Qwen 3.5 model for my phone?

For 8GB RAM phones: Qwen 3.5 4B at Q4 quantization. For 12GB+: Qwen 3.5 8B at Q4. Use MLC Chat from llm.mlc.ai for the easiest setup.

How does Ollama compare to running llama.cpp directly?

Ollama wraps llama.cpp with automatic model management, GPU detection, and a REST API. Performance is nearly identical. Use Ollama unless you need fine-grained control over quantization or sampling. See our full Ollama guide.

Is Qwen 3.5 safe for commercial use?

Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no restrictions beyond attribution.

What about Qwen 3.5 for coding specifically?

Qwen 3.5 8B leads LiveCodeBench in the sub-10B category. For serious coding work, also consider Qwen 3.5 32B or the frontier models in our Best LLM for Coding 2026 comparison. For a comparison of AI coding tools, see Claude Code vs Cursor vs GitHub Copilot.

Frequently Asked Questions

Can Qwen 3.5 8B really compete with 70B models?

What's the best Qwen 3.5 model for my phone?

For 8GB RAM phones: Qwen 3.5 4B at Q4 quantization. For 12GB+: Qwen 3.5 8B at Q4. Use MLC Chat from llm.mlc.ai for the easiest setup.

How does Ollama compare to running llama.cpp directly?

Is Qwen 3.5 safe for commercial use?

Yes. The Apache 2.0 license allows commercial use, modification, and distribution with no restrictions beyond attribution.

What about Qwen 3.5 for coding specifically?

🔧 Tools in This Article

Make (Integromat)

GitHub Copilot

Claude Code

Ollama

Cursor

Related Guides

All guides →

Local LLM

Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026

Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.

12 min read

Local LLM

How to Run LLMs Locally with Ollama (2026 Guide)

Running LLMs locally used to mean fighting CUDA drivers and manually patching model loaders. Ollama changed that. It wraps model download, quantization…

8 min read

Local LLM

Llama 3 vs Mistral vs Phi-4: Which Open Source LLM Wins in 2026?

Three model families dominate local AI in 2026: Meta's Llama 3, Mistral AI's Mistral, and Microsoft's Phi-4. Each has genuine strengths, genuine…

9 min read