Local LLM

MiniMax M3 VRAM requirements: workstation-class memory

MiniMax M3 is open weight with 428B total parameters and 23B active parameters. That makes it a serious local-inference story — but not a casual desktop model. Here is the practical VRAM and quantization picture.

June 13, 2026·8 min read·1,614 words

Last verified: 2026-06-13.

In short: Treat MiniMax M3 as a workstation, server, or cloud-GPU model. At about 428B total parameters, even a rough 4-bit weight-only estimate lands near 214GB before overhead, far beyond a single 24GB consumer GPU. Evaluate it via API or hosted inference first, and keep smaller models for desktop work.

MiniMax M3 is now an open-weight model on Hugging Face, but it is not a casual laptop-local model. The official model card describes MiniMax-M3 as a native multimodal model with a 1M context window, about 428B total parameters, and about 23B activated parameters (MiniMax M3 on Hugging Face). That sparse active-parameter count matters for compute. The total 428B parameter count still matters for storage, loading, quantization, and practical VRAM planning.

The short answer: treat MiniMax M3 as a workstation, server, or cloud-GPU model unless you are deliberately testing extreme low-bit quantization. Toolhalla has not run MiniMax M3 locally. This guide separates official model facts from rough memory arithmetic and community quant estimates.

*Disclosure: Toolhalla may earn from affiliate/referral links, such as eligible Amazon links or the Vast.ai referral link, when they are useful buying or compute alternatives for readers.*

Quick answer

Choose MiniMax M3 API or hosted inference if you want to evaluate its coding, agent, long-context, or multimodal behavior before buying hardware. MiniMax links its API and product page from the official model card, and the official product page says M3 API supports up to 1M tokens with a guaranteed minimum of 512K tokens via MiniMax Sparse Attention (MiniMax product page).

Choose local MiniMax M3 only if your goal is serious local inference testing, quantization work, or high-memory deployment research. The Hugging Face repository currently exposes 59 safetensor weight shards; Toolhalla measured the model weight files at about 854GB from the Hugging Face model API on 2026-06-13. That is consistent with a 428B-parameter model stored near 16-bit precision.

Choose a smaller local model if you want a practical desktop assistant. For normal desktop local AI, start with Toolhalla's Ollama local LLM guide, LM Studio vs Jan vs GPT4All comparison, or hardware-focused Ryzen AI Halo vs Mac Studio vs DGX Spark guide.

What MiniMax officially says M3 is

The official Hugging Face model card says MiniMax-M3 is:

a native multimodal model for image-text-to-text use cases,
a mixture-of-experts model tagged for agent, coding, and video workflows,
roughly 428B total parameters,
roughly 23B activated parameters,
built for a 1M context window,
released under the MiniMax Community License (model card, license).

The license detail matters. Hugging Face labels the license as other with the minimax-community license name. The license text grants non-commercial use and includes commercial-use conditions such as displaying “Built with MiniMax M3” for products or services using the software (MiniMax M3 license). Do not treat “open weight” as the same thing as an unrestricted permissive license.

MiniMax's own M3 page also emphasizes the architecture: MiniMax Sparse Attention, or MSA. The product page says the API supports up to 1M tokens with a guaranteed minimum of 512K tokens, and frames that context window around long-range agent tasks, long-range coding, and long-video understanding (MiniMax M3 product page).

Why 23B active parameters does not mean 23B storage

Mixture-of-experts models can activate only part of the model for each token. That is why “~23B activated parameters” is useful for inference compute. It does not mean the whole model weighs 23B parameters on disk.

For deployment planning, you still need to store the full parameter set or the quantized representation of it. That is why MiniMax M3 can be efficient per token while still requiring large memory and storage budgets.

A rough weight-only estimate for a 428B-parameter model looks like this:

Format	Rough weight-only size	Practical meaning
16-bit / BF16-like	~856GB	close to the full released weight footprint
8-bit	~428GB	still server/workstation territory
6-bit	~321GB	multi-GPU or high-memory unified-memory systems
5-bit	~268GB	possible only for specialized local rigs
4-bit	~214GB	the first range that looks plausible for extreme enthusiast/server setups
3-bit	~161GB	aggressive quality tradeoff territory
2-bit	~107GB	last-resort compression, not a default recommendation

These are arithmetic estimates before runtime overhead, KV cache, multimodal components, framework overhead, batching, context length, and quantization metadata. Real requirements can be higher, and quality can drop as quantization becomes more aggressive.

What the TeksEdge post adds

David Hendrickson's TeksEdge post is useful because it translates the release into the local-AI buyer question: what sizes should people expect once MiniMax M3 gets quantized for GGUF-style local use? His community estimates put Q8 around 430–450GB, Q6 around 340–360GB, Q5 around 280–310GB, Q4 around 220–250GB, Q3 around 170–200GB, and Q2 around 110–140GB (TeksEdge X post).

Those numbers are not an official MiniMax sizing table. They are still directionally useful because they line up with the simple parameter-count math: even at 4-bit, a 428B model remains far beyond one 24GB consumer GPU.

Hardware fit: who can actually run this locally?

MiniMax M3 is better viewed as a local-infrastructure test than a normal desktop model.

Practical options:

Hosted/API first: use MiniMax API or supported gateways to evaluate quality before hardware spend.
Cloud GPU rental: use rented high-VRAM machines when you only need experiments. Toolhalla's canonical cloud-GPU referral is Vast.ai.
Multi-GPU workstation: consider this only if you already understand tensor parallelism, framework support, power, cooling, and VRAM limits. If you are pricing parts, start from current NVIDIA GeForce GPU information and compare against renting.
Unified-memory boxes: systems with 128GB unified memory may help for smaller open models, but MiniMax M3 at useful quant levels can exceed that once overhead is included. See Toolhalla's Ryzen AI Halo vs Mac Studio vs DGX Spark guide for the broader local-AI hardware tradeoff.

Do not buy hardware for MiniMax M3 from a headline alone. Wait for tested inference recipes, quantized artifacts, framework support notes, and real throughput measurements for your workload.

Benchmark caveats

MiniMax's official M3 blog reports strong coding and agent benchmark results: SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, and MCP Atlas 74.2% (MiniMax M3 blog). Treat those as vendor-provided claims unless you have independent benchmark runs under your target inference stack.

The MiniMax Sparse Attention paper adds technical context. Its abstract says MSA reduced per-token attention compute by 28.4x at 1M context on a 109B-parameter native multimodal model, and that a co-designed kernel achieved 14.2x prefill and 7.6x decoding wall-clock speedups on H800 (MiniMax Sparse Attention paper). That supports the architectural story, but it does not automatically answer the desktop question: can your actual rig run M3 at acceptable speed, quality, and context length?

Decision matrix

Reader	Best next step	Why
Builder testing M3 quality	Use API/hosted access first	isolates model quality from local inference pain
Local AI hobbyist	wait for mature quants and recipes	raw weight size is too large for casual desktop use
Inference engineer	track HF files, vLLM/SGLang/Transformers support, quant artifacts	this is an infrastructure project, not just a model download
Company evaluating commercial use	read the MiniMax Community License first	open weight does not mean unrestricted commercial use
Hardware buyer	compare rented GPUs before buying	one model release is not enough reason to build a multi-GPU rig

What remains unclear

The release answers the big identity questions: M3 has official weights, a model card, a license, and a public architecture story. The practical local story still needs more evidence:

stable inference recipes across vLLM, SGLang, Transformers, llama.cpp, and other local stacks,
official or widely tested GGUF quant artifacts,
throughput and memory reports on real multi-GPU consumer rigs,
quality loss at Q4, Q3, and Q2,
long-context memory behavior with real prompts, images, or video inputs,
commercial-use interpretation for specific product deployments.

Toolhalla verdict

MiniMax M3 is important because it pushes open-weight models further into coding, agent, multimodal, and long-context territory. It is also a clean reminder that “open weights” and “locally practical” are different claims.

For most readers, the right move is to evaluate M3 through hosted access, watch quant and runtime support, and keep smaller local models for day-to-day desktop work. For infrastructure teams, M3 is worth tracking because its sparse-attention and MoE design may shape how future open-weight agent models are served.

FAQ

Is MiniMax M3 open source?

MiniMax M3 is open weight on Hugging Face, but it uses the MiniMax Community License rather than a standard permissive open-source license (model card, license). Check the license before commercial use.

How much VRAM does MiniMax M3 need?

There is no single official VRAM number for every runtime and quant. A rough 4-bit weight-only estimate for 428B parameters is about 214GB before overhead. Community estimates in the TeksEdge post put Q4-style GGUF sizes around 220–250GB (TeksEdge X post).

Does 23B active parameters mean I can run it like a 23B model?

No. The ~23B activated-parameter figure helps explain per-token compute, but deployment still needs access to the full 428B-parameter weight set or its quantized representation (MiniMax M3 model card).

Should I buy GPUs for MiniMax M3?

Not from the launch details alone. Test hosted access first, then wait for proven local recipes, quant quality reports, and throughput numbers. Renting GPUs can be lower-risk than buying hardware for one model.

What should Toolhalla track next?

Track official quant releases, llama.cpp/GGUF support, vLLM and SGLang deployment notes, independent benchmark runs, and real memory/throughput reports on RTX 5090-class multi-GPU systems.

Directory update pack

Add or update model entry: MiniMax M3 / MiniMax-M3.
Tags: open-weight, multimodal, MoE, coding, agent model, long context, video, local AI, high-memory inference.
License field: MiniMax Community License; not standard permissive open source.
Access fields: Hugging Face weights, MiniMax API/product page, hosted evaluation first.
Watchlist: GGUF quants, vLLM/SGLang/Transformers support, llama.cpp support, independent benchmarks, commercial-use clarifications.

Frequently Asked Questions

Is MiniMax M3 open source?

How much VRAM does MiniMax M3 need?

Does 23B active parameters mean I can run it like a 23B model?

No. The 23B activated-parameter figure helps explain per-token compute, but deployment still needs access to the full 428B-parameter weight set or its quantized representation (MiniMax M3 model card).

Should I buy GPUs for MiniMax M3?

What should Toolhalla track next?

Track official quant releases, llama.cpp/GGUF support, vLLM and SGLang deployment notes, independent benchmark runs, and real memory/throughput reports on RTX 5090-class multi-GPU systems.

🔧 Tools in This Article

Make (Integromat)

Hugging Face

LM Studio

GPT4All

Ollama

Modal

vLLM

Jan

Related Guides

All guides →

Local LLM

AMD Ryzen AI Halo vs Mac mini, Mac Studio, and DGX Spark

AMD Ryzen AI Halo is positioned as a compact local AI developer platform with 128GB unified memory, ROCm, Windows/Linux support, and direct comparisons against Mac mini and DGX Spark. Here is where it fits, with vendor-claim caveats.

11 min read

Local LLM

Qwen3.6-27B for local coding: useful small tasks, review still wins

Georgi Gerganov says Qwen3.6-27B has helped with small ggml-org maintainer tasks locally. Treat that as useful operator evidence, not permission to skip review.

8 min read

AI Models

Gemma 4: where Google’s new open model family fits

Gemma 4 is Google's open model family for local, long-context, vision, and agentic workflows. Here's where the 2B, 4B, 26B MoE, and 31B Dense models fit.

7 min read

#MiniMax M3#local AI#open-weight models#GGUF#VRAM#MoE#AI workstations#cloud GPU