How to Run Qwen 3.6 on an M5 Pro MacBook (64 GB)
A recipe-style guide to running Alibaba’s Qwen3.6-27B dense model (or the 35B-A3B MoE) on the new M5 Pro MacBook — MLX and llama.cpp side-by-side, MTP speculative decoding, and an OpenAI-compatible HTTP server you can point your agent at.
Ingredients
- An M5 Pro MacBook Pro with 64 GB unified memory (~307 GB/s bandwidth, 16- or 20-core GPU with the new per-core Neural Accelerators)
- macOS Tahoe 26 (or Sequoia 15), with at least 60 GB free disk — Q4 weights are 16–22 GB but you’ll want room for two model variants
- Xcode command-line tools (
xcode-select --install) for building llama.cpp - Python 3.12 and
uv(orconda) for the MLX path - Homebrew — we’ll use it to install
cmakeandllama.cpp - A Hugging Face account (for downloading the larger MLX/GGUF repos)
By the end of this recipe you will have an OpenAI-compatible HTTP
endpoint on http://localhost:8080 serving Qwen3.6 at
a comfortable 128k context (or the full 256k
native window if you cap concurrency), with MTP
speculative decoding doing the heavy lifting on the
dense model.
Pick your runtime
Two engines are worth your time on Apple Silicon. Both are free, both are mature, and they have meaningfully different strengths.
MLX (mlx-lm)
- Strengths
- Fastest decode on M-series, native MoE, Apple-maintained, simple
pip install - Weakness
- Classical speculative decoding currently broken for Qwen3 (issue #846); use MTP path instead
- Server
mlx_lm.server— OpenAI-compatible
llama.cpp
- Strengths
- Biggest GGUF quant zoo, robust MTP support, mature KV-cache quantization, ecosystem of tools
- Weakness
- Decode is typically 10–20% behind MLX on M-series; current Metal regression on IQ4_XS
- Server
llama-server— OpenAI-compatible
We’ll set up both below — they coexist happily, and you’ll want to swap between them depending on workload. Skip directly to your preferred path if you’re in a hurry.
Pick your model
Qwen released two open-weight flagships on April 16, 2026. On a 64 GB M5 Pro both fit comfortably at 4-bit with room for long context, so the decision is about workload shape rather than memory.
Qwen3.6-27B
- Params
- 27 B (all active)
- Context
- 262,144 native, 1 M with YaRN
- 4-bit footprint
- ~16–18 GB weights
- Best for
- Coding, long-form reasoning, single-stream quality
Qwen3.6-35B-A3B
- Params
- 35 B total / 3 B active
- Context
- 262,144 native, 1 M with YaRN
- 4-bit footprint
- ~20–22 GB weights
- Best for
- Fast generation, vision, cheap tokens
1 Free up VRAM with iogpu.wired_limit_mb
macOS reserves ~75% of unified memory for the GPU by default. On a 64 GB machine that’s ~48 GB — enough for most cases, but tight if you want 27B at 8-bit or 100k+ KV cache. Raise the cap with a single sysctl. The OS still gets the rest; don’t starve it.
# Hand the GPU up to 56 GB; keep 8 GB for the OS and browser tabs.
sudo sysctl iogpu.wired_limit_mb=57344
# Verify
sysctl iogpu.wired_limit_mb
The setting resets on reboot. If you want it persistent, drop a
/Library/LaunchDaemons/iogpu.wired.plist with the
sysctl baked in — but honestly, a 5-second one-liner before
you start serving is fine.
2 Path A — MLX (the fast path)
2.1 Install mlx-lm
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.12 ~/.venvs/mlx
source ~/.venvs/mlx/bin/activate
uv pip install --upgrade "mlx-lm" "mlx-vlm" "huggingface_hub[cli]"
# Sanity check
python -c "import mlx_lm; print(mlx_lm.__version__)"
mlx-vlm is only needed if you want the multimodal
(vision) side of the 35B-A3B. Text-only? You can skip it.
2.2 Pull the weights
The mlx-community org publishes pre-converted
checkpoints. There are also OptiQ (sensitivity-aware mixed
precision) variants and Unsloth’s “Dynamic 2.0”
quants that recover most of the Q4 accuracy gap.
hf auth login # paste an HF read token
# Option A: dense 27B, plain 4-bit (~16 GB)
hf download mlx-community/Qwen3.6-27B-4bit \
--local-dir ~/models/Qwen3.6-27B-mlx-4bit
# Option B: dense 27B, Unsloth Dynamic 4-bit (slightly higher quality, same size)
hf download unsloth/Qwen3.6-27B-UD-MLX-4bit \
--local-dir ~/models/Qwen3.6-27B-mlx-UD-4bit
# Option C: MoE 35B-A3B, 4-bit (~21 GB)
hf download mlx-community/Qwen3.6-35B-A3B-4bit \
--local-dir ~/models/Qwen3.6-35B-A3B-mlx-4bit
2.3 Start the OpenAI-compatible server
mlx_lm.server ships in the same package. Default
--max-tokens 512 is silly — raise it.
mlx_lm.server \
--model ~/models/Qwen3.6-27B-mlx-4bit \
--host 127.0.0.1 \
--port 8080 \
--max-tokens 8192
Smoke-test with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-27b",
"messages": [{"role": "user", "content": "Say hi."}],
"max_tokens": 64
}'
2.4 Speculative decoding on MLX — read this first
--draft-model speculative decoding
skips tokens on the Qwen3 family across q4/q6/bf16
drafts. The output looks plausible but is subtly wrong. Until
it’s fixed, do not use the generic
--draft-model flag with the standard checkpoints.
The workaround is to use a checkpoint that bakes Qwen’s trained Multi-Token Prediction (MTP) head into the same file. That avoids the broken external-drafter code path entirely. Two community projects ship MTP-ready MLX builds:
Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed— uses the MTPLX runtime, ~2.2× decode reported at temperature 0.6bstnxbt/dflash-mlx— DFlash port to MLX, comparable speedup, slightly different licence
Both are drop-in: download the repo, point
mlx_lm.server --model at it, and the MTP head is
picked up automatically. Don’t pass
--draft-model.
3 Path B — llama.cpp
3.1 Install
Homebrew ships a recent build, but you’ll likely want the
head of master for MTP support and the latest Metal
kernels. Both options below work.
# Quick path: Homebrew
brew install llama.cpp
# Or build from source for the freshest kernels
brew install cmake
git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build -j --config Release
# Add to PATH
echo 'export PATH="$HOME/src/llama.cpp/build/bin:$PATH"' >> ~/.zshrc
exec zsh
3.2 Pull the GGUF
Unsloth and bartowski both publish full quant ladders. The Unsloth “UD” (Dynamic 2.0) repos calibrate per-tensor and recover most of the Q4 accuracy gap.
# Dense 27B, Q4_K_M (~16 GB) — safe default
hf download unsloth/Qwen3.6-27B-GGUF \
--include "Qwen3.6-27B-UD-Q4_K_M.gguf" \
--local-dir ~/models/Qwen3.6-27B-gguf
# MoE 35B-A3B, Q4_K_M (~21 GB)
hf download unsloth/Qwen3.6-35B-A3B-GGUF \
--include "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \
--local-dir ~/models/Qwen3.6-35B-A3B-gguf
# For MTP speculative decoding (next step), also fetch the MTP variant
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
--include "Qwen3.6-27B-MTP-Q4_K_M.gguf" \
--local-dir ~/models/Qwen3.6-27B-mtp-gguf
3.3 Start the server
llama-server exposes the same OpenAI-compatible
endpoints as MLX. The flags below turn on flash attention
(required for KV quant), pin KV cache to q8_0 (cuts
KV memory roughly in half with <0.1 PPL drop), and reserve
a 128k window.
llama-server \
--model ~/models/Qwen3.6-27B-gguf/Qwen3.6-27B-UD-Q4_K_M.gguf \
--alias qwen3.6-27b \
--host 127.0.0.1 --port 8080 \
--ctx-size 131072 \
--n-gpu-layers 999 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--parallel 1 \
--jinja
What the important flags do:
--n-gpu-layers 999— offload every layer to the GPU. Apple Silicon has unified memory, so this is essentially free.--flash-attn on— Metal flash attention. Required for KV cache quantization to work.--cache-type-k q8_0 --cache-type-v q8_0— quantize the KV cache.q8_0is the safe choice;q4_0is more aggressive and shows visible quality loss, with value-side quant the more sensitive of the two.--ctx-size 131072— 128k window. Push to 262144 if you really need the full native range, but watch memory pressure.--parallel 1— single-user laptop, so don’t split KV across slots.--jinja— use the model’s shipped chat template (matters for Qwen3.6 reasoning blocks).
3.4 MTP speculative decoding on llama.cpp
Qwen ships a trained MTP head with Qwen3.6. llama.cpp added
native support in mid-2026; you’ll need a build that
includes the --spec-type mtp flag (recent
master or the linked PR). Reported gains on
M4 Max for the 35B-A3B MoE: +24–36% tokens/sec.
For the dense 27B the speedup is bigger because the draft is
essentially free compared to the 27 B verify pass.
llama-server \
--model ~/models/Qwen3.6-27B-mtp-gguf/Qwen3.6-27B-MTP-Q4_K_M.gguf \
--alias qwen3.6-27b-mtp \
--host 127.0.0.1 --port 8080 \
--ctx-size 131072 \
--n-gpu-layers 999 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--spec-type mtp \
--spec-draft-n-max 3 \
--jinja
If your llama.cpp build doesn’t know --spec-type,
update first — the older
--model-draft <separate-tiny-gguf> path
(classical speculative decoding) works too, but on Qwen3.6 it’s
been shown to be net-negative on the MoE and only modestly
positive on the dense. MTP is the recommended route.
4 Pick the right quant
Two practical realities: the M5 Pro has plenty of memory for higher-precision quants, and on a memory-bandwidth-bound device smaller weights mean faster decode. So the sweet spot isn’t “biggest your RAM allows” — it’s the lowest precision where quality is still indistinguishable from BF16 for your task.
No Qwen3.6-specific accuracy table is published yet (May 2026). The numbers below are from mlx-lm’s BENCHMARKS.md on the closest published proxies — Qwen3-30B-A3B for the MoE and Qwen3-4B for the small-dense behaviour — and the 2025 quantization survey for the general llama.cpp ladder. Take exact deltas with a grain of salt; the shape of the curve is what matters.
MLX ladder (MMLU Pro drop, larger models)
- Q8 — <0.2 pt drop. Lossless for all practical purposes. Use if you have RAM and don’t mind 2× the bandwidth cost.
- Q6 — ~0.2 pt drop. Excellent quality, ~25% smaller than Q8.
- Q5 (group 32) — ~0.5 pt drop. A reasonable middle ground if Q4 feels lossy for your task.
- Q4 (group 64, default) — ~1.5–2 pt drop. The standard. Fits everything with room for long context.
- Q4 OptiQ / Unsloth UD — same size as Q4, ~0.5–1 pt drop. Use these instead of plain Q4 when available.
- Q3 — not in BENCHMARKS.md. Expect 4–8 pt drop; do not use for the 27B on coding tasks.
llama.cpp / GGUF ladder
- Q8_0 — ~8.5 bpw, lossless. Same caveat as MLX Q8.
- Q6_K — ~6.5 bpw, near-lossless.
- Q5_K_M — ~5.5 bpw, <0.5 pt drop.
- Q4_K_M — ~4.5 bpw, ~1–2 pt drop. Recommended default on Apple Silicon right now.
- IQ4_XS / IQ4_NL — ~4.25 bpw, <1 pt drop on paper. Avoid on Metal until the kernel regression is fixed — 3× slower than Q4_K_M.
- Q3_K_M — ~3.7 bpw, 3–6 pt drop. Last resort.
5 What you should see
Caveat. The M5 Pro launched too recently for a full Qwen3.6 benchmark set to exist in the wild. The table below is projected from measured M4 Pro / M4 Max / M5 Max numbers, scaled by the +12% bandwidth uplift on decode and the ~4× Neural-Accelerator speedup on prefill. Treat them as ballparks — if you measure your own, please publish them.
Prefill is where the M5 generation really shines — the new per-GPU-core Neural Accelerators make the compute-bound prompt phase about 4× faster than M4 at the same memory footprint. For agentic coding loops where you re-read large contexts on every turn, this is the bigger practical win than the modest decode bump.
6 Point your tools at it
Whichever runtime you picked, you now have an OpenAI-compatible
server on http://127.0.0.1:8080/v1. Aider, Continue,
LibreChat, Open WebUI, the OpenAI Python SDK — all of them
just work with a dummy API key.
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="not-needed",
)
resp = client.chat.completions.create(
model="qwen3.6-27b",
messages=[{"role": "user", "content": "Refactor this function..."}],
)
print(resp.choices[0].message.content)
For Aider:
OPENAI_API_BASE=http://127.0.0.1:8080/v1 \
OPENAI_API_KEY=not-needed \
aider --model openai/qwen3.6-27b
Troubleshooting
Memory pressure goes yellow / system gets sluggish
You’ve over-allocated. Drop
iogpu.wired_limit_mb by 4–8 GB, or move
from 8-bit to 4-bit, or cut --ctx-size in half.
Swap is death on a laptop — one page-out and decode tok/s
collapses.
llama.cpp tok/s is mysteriously bad on IQ4_XS
That’s the upstream Metal kernel regression. Re-download as Q4_K_M. Worth checking the llama.cpp issue tracker occasionally — this will get fixed.
MLX speculative decoding produces gibberish or wrong answers
You’ve hit
issue #846.
Drop the --draft-model flag. Use an MTP checkpoint
(MTPLX or DFlash variant) instead.
Server starts but the first response takes ages
Cold-start Metal kernel compilation, plus mmap warmup. The second request is much faster; budget one throwaway request after a server restart.
Want the full 1M context?
Both engines support YaRN scaling. In llama.cpp add
--rope-scaling yarn --yarn-orig-ctx 262144 --rope-scale 4
and bump --ctx-size. Realistically, 1M tokens of
KV cache will not fit on a 64 GB laptop alongside 4-bit
weights — cap concurrency at one request and accept the
tradeoff.
What you end up with
A laptop you can close the lid on and carry to the cafe, that serves Qwen3.6 over plain HTTP at coding-assistant-grade speed and quality. No per-token fees, no cloud calls, no rate limits — just an M5 Pro humming quietly under ~30 W, returning tokens to whatever editor or agent framework you point at port 8080.
References
- QwenLM/Qwen3.6 on GitHub — official repo and model cards
- ml-explore/mlx-lm — Apple’s MLX language model toolkit
- mlx-lm BENCHMARKS.md — MMLU Pro across quants for Qwen3 family
- mlx_lm.server docs — OpenAI-compatible endpoint reference
- mlx-lm issue #846 — the Qwen3 speculative-decoding bug
- ggml-org/llama.cpp — the canonical C++/Metal runtime
- youssofal/MTPLX — MTP speculative decoding on MLX
- z-lab/dflash — alternative MTP runtime
- unsloth/Qwen3.6-27B-GGUF — full GGUF quant ladder with Dynamic 2.0 calibration
- mlx-community/Qwen3.6-27B-4bit — pre-converted MLX 4-bit dense
- mlx-community/Qwen3.6-35B-A3B-4bit — pre-converted MLX 4-bit MoE
- How to Host Qwen 3.6 with vLLM on Two RTX 3090s — the GPU-desktop companion to this recipe