How to Run Qwen 3.6 on an M5 Pro MacBook (64 GB)

A recipe-style guide to running Alibaba’s Qwen3.6-27B dense model (or the 35B-A3B MoE) on the new M5 Pro MacBook — MLX and llama.cpp side-by-side, MTP speculative decoding, and an OpenAI-compatible HTTP server you can point your agent at.

Prep time 30–60 minutes (mostly model download)
One-time cost The MacBook you already bought
Going cost ~30 W under load — a rounding error on your power bill

Ingredients

By the end of this recipe you will have an OpenAI-compatible HTTP endpoint on http://localhost:8080 serving Qwen3.6 at a comfortable 128k context (or the full 256k native window if you cap concurrency), with MTP speculative decoding doing the heavy lifting on the dense model.

Companion article: if you have a desktop with two RTX 3090s instead of a laptop, see How to Host Qwen 3.6 with vLLM on Two RTX 3090s for the GPU-server recipe. Same models, different machine.

Pick your runtime

Two engines are worth your time on Apple Silicon. Both are free, both are mature, and they have meaningfully different strengths.

Apple-native

MLX (mlx-lm)

Strengths
Fastest decode on M-series, native MoE, Apple-maintained, simple pip install
Weakness
Classical speculative decoding currently broken for Qwen3 (issue #846); use MTP path instead
Server
mlx_lm.server — OpenAI-compatible
Battle-tested

llama.cpp

Strengths
Biggest GGUF quant zoo, robust MTP support, mature KV-cache quantization, ecosystem of tools
Weakness
Decode is typically 10–20% behind MLX on M-series; current Metal regression on IQ4_XS
Server
llama-server — OpenAI-compatible

We’ll set up both below — they coexist happily, and you’ll want to swap between them depending on workload. Skip directly to your preferred path if you’re in a hurry.

Pick your model

Qwen released two open-weight flagships on April 16, 2026. On a 64 GB M5 Pro both fit comfortably at 4-bit with room for long context, so the decision is about workload shape rather than memory.

Dense · Text only

Qwen3.6-27B

Params
27 B (all active)
Context
262,144 native, 1 M with YaRN
4-bit footprint
~16–18 GB weights
Best for
Coding, long-form reasoning, single-stream quality
MoE · Multimodal

Qwen3.6-35B-A3B

Params
35 B total / 3 B active
Context
262,144 native, 1 M with YaRN
4-bit footprint
~20–22 GB weights
Best for
Fast generation, vision, cheap tokens
Rule of thumb for the agentic / long-context coder: run the 27B dense with MTP speculative decoding for code quality and the 35B-A3B MoE when you need bulk-tokens-per-second (the 3 B active params make decode unusually fast). The MoE’s gains from external speculative decoding are small to negative — more on that below.

1 Free up VRAM with iogpu.wired_limit_mb

macOS reserves ~75% of unified memory for the GPU by default. On a 64 GB machine that’s ~48 GB — enough for most cases, but tight if you want 27B at 8-bit or 100k+ KV cache. Raise the cap with a single sysctl. The OS still gets the rest; don’t starve it.

# Hand the GPU up to 56 GB; keep 8 GB for the OS and browser tabs.
sudo sysctl iogpu.wired_limit_mb=57344

# Verify
sysctl iogpu.wired_limit_mb

The setting resets on reboot. If you want it persistent, drop a /Library/LaunchDaemons/iogpu.wired.plist with the sysctl baked in — but honestly, a 5-second one-liner before you start serving is fine.

Watch Memory Pressure. Activity Monitor → Memory. If the graph goes yellow while serving, lower the cap or switch to a smaller quant. Swap on a MacBook will tank tok/s faster than any quantization choice.

2 Path A — MLX (the fast path)

2.1 Install mlx-lm

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv --python 3.12 ~/.venvs/mlx
source ~/.venvs/mlx/bin/activate

uv pip install --upgrade "mlx-lm" "mlx-vlm" "huggingface_hub[cli]"

# Sanity check
python -c "import mlx_lm; print(mlx_lm.__version__)"

mlx-vlm is only needed if you want the multimodal (vision) side of the 35B-A3B. Text-only? You can skip it.

2.2 Pull the weights

The mlx-community org publishes pre-converted checkpoints. There are also OptiQ (sensitivity-aware mixed precision) variants and Unsloth’s “Dynamic 2.0” quants that recover most of the Q4 accuracy gap.

hf auth login   # paste an HF read token

# Option A: dense 27B, plain 4-bit (~16 GB)
hf download mlx-community/Qwen3.6-27B-4bit \
    --local-dir ~/models/Qwen3.6-27B-mlx-4bit

# Option B: dense 27B, Unsloth Dynamic 4-bit (slightly higher quality, same size)
hf download unsloth/Qwen3.6-27B-UD-MLX-4bit \
    --local-dir ~/models/Qwen3.6-27B-mlx-UD-4bit

# Option C: MoE 35B-A3B, 4-bit (~21 GB)
hf download mlx-community/Qwen3.6-35B-A3B-4bit \
    --local-dir ~/models/Qwen3.6-35B-A3B-mlx-4bit

2.3 Start the OpenAI-compatible server

mlx_lm.server ships in the same package. Default --max-tokens 512 is silly — raise it.

mlx_lm.server \
    --model ~/models/Qwen3.6-27B-mlx-4bit \
    --host 127.0.0.1 \
    --port 8080 \
    --max-tokens 8192

Smoke-test with curl:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "qwen3.6-27b",
      "messages": [{"role": "user", "content": "Say hi."}],
      "max_tokens": 64
    }'

2.4 Speculative decoding on MLX — read this first

Heads-up: as of mlx-lm 0.31.x there is an open bug (#846) where classical --draft-model speculative decoding skips tokens on the Qwen3 family across q4/q6/bf16 drafts. The output looks plausible but is subtly wrong. Until it’s fixed, do not use the generic --draft-model flag with the standard checkpoints.

The workaround is to use a checkpoint that bakes Qwen’s trained Multi-Token Prediction (MTP) head into the same file. That avoids the broken external-drafter code path entirely. Two community projects ship MTP-ready MLX builds:

Both are drop-in: download the repo, point mlx_lm.server --model at it, and the MTP head is picked up automatically. Don’t pass --draft-model.

Why no external draft for the 35B-A3B MoE? Speculative decoding only pays off when the draft model is much cheaper than the target. The MoE’s 3 B active params are already in the same ballpark as a Qwen3-1.7B draft — verify cost dominates and you can end up slower. A published mlx-lm benchmark on a related Qwen3.5 MoE saw -35% throughput. Use MTP, not external drafting, for the MoE.

3 Path B — llama.cpp

3.1 Install

Homebrew ships a recent build, but you’ll likely want the head of master for MTP support and the latest Metal kernels. Both options below work.

# Quick path: Homebrew
brew install llama.cpp

# Or build from source for the freshest kernels
brew install cmake
git clone https://github.com/ggml-org/llama.cpp ~/src/llama.cpp
cd ~/src/llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_CURL=ON
cmake --build build -j --config Release

# Add to PATH
echo 'export PATH="$HOME/src/llama.cpp/build/bin:$PATH"' >> ~/.zshrc
exec zsh

3.2 Pull the GGUF

Unsloth and bartowski both publish full quant ladders. The Unsloth “UD” (Dynamic 2.0) repos calibrate per-tensor and recover most of the Q4 accuracy gap.

# Dense 27B, Q4_K_M (~16 GB) — safe default
hf download unsloth/Qwen3.6-27B-GGUF \
    --include "Qwen3.6-27B-UD-Q4_K_M.gguf" \
    --local-dir ~/models/Qwen3.6-27B-gguf

# MoE 35B-A3B, Q4_K_M (~21 GB)
hf download unsloth/Qwen3.6-35B-A3B-GGUF \
    --include "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \
    --local-dir ~/models/Qwen3.6-35B-A3B-gguf

# For MTP speculative decoding (next step), also fetch the MTP variant
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
    --include "Qwen3.6-27B-MTP-Q4_K_M.gguf" \
    --local-dir ~/models/Qwen3.6-27B-mtp-gguf
Avoid IQ4_XS on Apple Silicon right now. On paper IQ4_XS is the sweet spot (~4.25 bpw, near-lossless). In practice there is a current upstream Metal kernel regression that makes it roughly 3× slower than Q4_K_M on M4-class machines (5.5 t/s vs 16.6 t/s in community benchmarks). Q4_K_M is the right default until that’s patched.

3.3 Start the server

llama-server exposes the same OpenAI-compatible endpoints as MLX. The flags below turn on flash attention (required for KV quant), pin KV cache to q8_0 (cuts KV memory roughly in half with <0.1 PPL drop), and reserve a 128k window.

llama-server \
    --model ~/models/Qwen3.6-27B-gguf/Qwen3.6-27B-UD-Q4_K_M.gguf \
    --alias qwen3.6-27b \
    --host 127.0.0.1 --port 8080 \
    --ctx-size 131072 \
    --n-gpu-layers 999 \
    --flash-attn on \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --parallel 1 \
    --jinja

What the important flags do:

3.4 MTP speculative decoding on llama.cpp

Qwen ships a trained MTP head with Qwen3.6. llama.cpp added native support in mid-2026; you’ll need a build that includes the --spec-type mtp flag (recent master or the linked PR). Reported gains on M4 Max for the 35B-A3B MoE: +24–36% tokens/sec. For the dense 27B the speedup is bigger because the draft is essentially free compared to the 27 B verify pass.

llama-server \
    --model ~/models/Qwen3.6-27B-mtp-gguf/Qwen3.6-27B-MTP-Q4_K_M.gguf \
    --alias qwen3.6-27b-mtp \
    --host 127.0.0.1 --port 8080 \
    --ctx-size 131072 \
    --n-gpu-layers 999 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --spec-type mtp \
    --spec-draft-n-max 3 \
    --jinja

If your llama.cpp build doesn’t know --spec-type, update first — the older --model-draft <separate-tiny-gguf> path (classical speculative decoding) works too, but on Qwen3.6 it’s been shown to be net-negative on the MoE and only modestly positive on the dense. MTP is the recommended route.

4 Pick the right quant

Two practical realities: the M5 Pro has plenty of memory for higher-precision quants, and on a memory-bandwidth-bound device smaller weights mean faster decode. So the sweet spot isn’t “biggest your RAM allows” — it’s the lowest precision where quality is still indistinguishable from BF16 for your task.

No Qwen3.6-specific accuracy table is published yet (May 2026). The numbers below are from mlx-lm’s BENCHMARKS.md on the closest published proxies — Qwen3-30B-A3B for the MoE and Qwen3-4B for the small-dense behaviour — and the 2025 quantization survey for the general llama.cpp ladder. Take exact deltas with a grain of salt; the shape of the curve is what matters.

MLX ladder (MMLU Pro drop, larger models)

llama.cpp / GGUF ladder

5 What you should see

Caveat. The M5 Pro launched too recently for a full Qwen3.6 benchmark set to exist in the wild. The table below is projected from measured M4 Pro / M4 Max / M5 Max numbers, scaled by the +12% bandwidth uplift on decode and the ~4× Neural-Accelerator speedup on prefill. Treat them as ballparks — if you measure your own, please publish them.

27B dense, Q4, MLX Decode 18–24 tok/s · prefill ~2–3 s for 32k tokens [projected]
27B dense, Q4_K_M, llama.cpp Decode 16–22 tok/s · prefill ~3–5 s for 32k tokens [projected]
27B dense + MTP speculative Decode 22–35 tok/s · prefill unchanged [projected]
35B-A3B MoE, Q4, MLX Decode 40–55 tok/s · prefill ~1–2 s for 32k tokens [projected]
35B-A3B MoE + MTP Decode 60–95 tok/s · prefill unchanged [projected]

Prefill is where the M5 generation really shines — the new per-GPU-core Neural Accelerators make the compute-bound prompt phase about 4× faster than M4 at the same memory footprint. For agentic coding loops where you re-read large contexts on every turn, this is the bigger practical win than the modest decode bump.

6 Point your tools at it

Whichever runtime you picked, you now have an OpenAI-compatible server on http://127.0.0.1:8080/v1. Aider, Continue, LibreChat, Open WebUI, the OpenAI Python SDK — all of them just work with a dummy API key.

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="not-needed",
)

resp = client.chat.completions.create(
    model="qwen3.6-27b",
    messages=[{"role": "user", "content": "Refactor this function..."}],
)
print(resp.choices[0].message.content)

For Aider:

OPENAI_API_BASE=http://127.0.0.1:8080/v1 \
OPENAI_API_KEY=not-needed \
aider --model openai/qwen3.6-27b

Troubleshooting

Memory pressure goes yellow / system gets sluggish

You’ve over-allocated. Drop iogpu.wired_limit_mb by 4–8 GB, or move from 8-bit to 4-bit, or cut --ctx-size in half. Swap is death on a laptop — one page-out and decode tok/s collapses.

llama.cpp tok/s is mysteriously bad on IQ4_XS

That’s the upstream Metal kernel regression. Re-download as Q4_K_M. Worth checking the llama.cpp issue tracker occasionally — this will get fixed.

MLX speculative decoding produces gibberish or wrong answers

You’ve hit issue #846. Drop the --draft-model flag. Use an MTP checkpoint (MTPLX or DFlash variant) instead.

Server starts but the first response takes ages

Cold-start Metal kernel compilation, plus mmap warmup. The second request is much faster; budget one throwaway request after a server restart.

Want the full 1M context?

Both engines support YaRN scaling. In llama.cpp add --rope-scaling yarn --yarn-orig-ctx 262144 --rope-scale 4 and bump --ctx-size. Realistically, 1M tokens of KV cache will not fit on a 64 GB laptop alongside 4-bit weights — cap concurrency at one request and accept the tradeoff.

What you end up with

A laptop you can close the lid on and carry to the cafe, that serves Qwen3.6 over plain HTTP at coding-assistant-grade speed and quality. No per-token fees, no cloud calls, no rate limits — just an M5 Pro humming quietly under ~30 W, returning tokens to whatever editor or agent framework you point at port 8080.

References