How to Host Qwen 3.6 with vLLM on Two RTX 3090s
A recipe-style guide to running Alibaba’s freshly-released Qwen3.6-27B dense model (or the 35B-A3B MoE) on a dual RTX 3090 desktop — AWQ 4-bit weights, FP8 KV cache, full 256k context.
Ingredients
- A desktop with 2× NVIDIA RTX 3090 (24 GB each, 48 GB total VRAM)
- An NVLink bridge if your motherboard supports it — roughly +50% throughput in tensor-parallel mode
- At least 64 GB system RAM and 200 GB free disk (the raw BF16 weights are 55 GB; AWQ is ~18 GB)
- Ubuntu 24.04 with NVIDIA driver ≥ 560 and CUDA 12.8 (or 13.0)
- Python 3.12 and
uv(orconda) - vLLM 0.19 or newer — earlier versions lack Qwen3.6 support
- A Hugging Face account for downloading gated/large repos
By the end of this recipe you will have an OpenAI-compatible HTTP
endpoint on http://localhost:8000 serving Qwen3.6 at
the full native 262,144-token context window, with
both GPUs sharing the load through tensor parallelism.
Pick your model
Qwen released two open-weight flagships on April 16, 2026. Both are excellent — the choice comes down to whether you prefer raw single-stream quality (dense) or faster generation and multimodal input (MoE).
Qwen3.6-27B
- Params
- 27 B (all active)
- Context
- 262,144 native, 1 M with YaRN
- AWQ weights
- ~15–16 GB
- Best for
- Coding, long-form reasoning, agentic loops
Qwen3.6-35B-A3B
- Params
- 35 B total / 3 B active (256 experts, 8+1 active)
- Context
- 262,144 native, 1 M with YaRN
- AWQ weights
- ~18–20 GB
- Best for
- Fast generation, vision/video, cheap tokens
1 Verify the GPUs and driver
SSH or sit down at the box and check the driver can see both cards.
nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv
You should see two rows like NVIDIA GeForce RTX 3090,
24576 MiB, 560.xx. If the driver is older than 560, upgrade
it first — CUDA 12.8 runtime requires it.
If the two cards are on the same PCIe root complex, enable peer-to-peer so vLLM can shuttle tensors between them efficiently.
nvidia-smi topo -m
# Look for "NV1" or "NV2" between GPU0 and GPU1 if you have an NVLink bridge.
# "PIX" or "PHB" is PCIe — still fine, just a bit slower.
2 Install vLLM in a clean environment
Use uv for speed — it creates an isolated Python
3.12 venv and installs vLLM with its CUDA 12.8 wheels in under a
minute. (Swap uv for conda or
python -m venv if you prefer.)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.12 ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate
uv pip install --upgrade "vllm>=0.19.0" "transformers>=5.5.4"
# Sanity check — should print 0.19.x or newer
python -c "import vllm; print(vllm.__version__)"
uv pip install "vllm[serve]". The base install already
includes the OpenAI-compatible server.
3 Download the AWQ weights
Use the hf CLI (formerly huggingface-cli)
to pull the quantized checkpoint. Pick one of the two
commands below depending on which model you chose.
Option A — Dense 27B
uv pip install "huggingface_hub[cli]"
hf auth login # paste an HF read token
hf download QuantTrio/Qwen3.6-27B-AWQ \
--local-dir ~/models/Qwen3.6-27B-AWQ
Option B — MoE 35B-A3B
hf download QuantTrio/Qwen3.6-35B-A3B-AWQ \
--local-dir ~/models/Qwen3.6-35B-A3B-AWQ
Expect a download of 15–20 GB. Both repos ship
config.json with quantization_config
already set, so vLLM will auto-detect AWQ; no extra CLI flag
needed.
4 Understand the memory budget
Two 3090s give you 48 GB of VRAM total. With tensor-parallel size 2 the model weights are split in half across the cards, but the KV cache is also sharded — so the headroom equation is:
free_per_gpu = 24 GB
− (weights / 2) # AWQ 4-bit, split
− ~1.5 GB activations # forward pass buffers
− ~0.5 GB CUDA overhead
= KV cache budget per GPU
A single 262,144-token sequence with BF16 KV would need roughly 32 GB per GPU — impossible. Two tricks make 256k fit on dual 3090s:
--kv-cache-dtype fp8_e5m2halves KV memory.fp8_e5m2is the only FP8 KV dtype that actually works on Ampere (E4M3 is Hopper-only in Triton).- Chunked prefill (on by default in vLLM 0.19+) keeps activation memory bounded no matter the prompt length.
After these two, a 262,144-token conversation fits in roughly
16 GB of KV cache per GPU — tight but workable. If you
need to serve multiple concurrent requests at full context, drop
--max-model-len or raise
--kv-cache-dtype aggressiveness.
5 Launch vLLM
This is the command. Save it as a shell script (serve.sh)
so you can restart the server with a single invocation. The values
below are tuned to maximise context length first and throughput
second — this is the “single power user” profile.
For Qwen3.6-27B (dense)
#!/usr/bin/env bash
set -euo pipefail
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export NCCL_P2P_DISABLE=0 # keep P2P on; disable only if NCCL hangs
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
vllm serve ~/models/Qwen3.6-27B-AWQ \
--served-model-name qwen3.6-27b \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--kv-cache-dtype fp8_e5m2 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \
--max-num-seqs 4 \
--reasoning-parser qwen3 \
--host 0.0.0.0 \
--port 8000
For Qwen3.6-35B-A3B (MoE)
#!/usr/bin/env bash
set -euo pipefail
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export NCCL_P2P_DISABLE=0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
vllm serve ~/models/Qwen3.6-35B-A3B-AWQ \
--served-model-name qwen3.6-35b-a3b \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--kv-cache-dtype fp8_e5m2 \
--gpu-memory-utilization 0.95 \
--enable-expert-parallel \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \
--max-num-seqs 4 \
--reasoning-parser qwen3 \
--host 0.0.0.0 \
--port 8000
What the important flags do:
--tensor-parallel-size 2— shard every layer across both GPUs. Qwen3.6’s attention-head count divides evenly by 2, so no padding is needed.--max-model-len 262144— request the full 256k native context. No YaRN scaling required.--kv-cache-dtype fp8_e5m2— 2× more tokens fit in the same VRAM.--gpu-memory-utilization 0.95— vLLM preallocates 95% of each card for weights+KV. Lower this to 0.90 if you also run a display server on the same box.--enable-expert-parallel(MoE only) — distributes the 256 experts across the two GPUs rather than replicating them.--max-num-seqs 4— cap concurrent sequences so a single long-context request can grab most of the KV cache. Raise this for many-short-request workloads.
INFO: Application startup complete line.
Subsequent launches are faster thanks to the torch.compile
cache in ~/.cache/vllm/.
6 Smoke-test the endpoint
vLLM speaks the OpenAI API. A plain curl is enough to
confirm tokens come back.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-27b",
"messages": [
{"role": "user", "content": "In one sentence: what makes a 3090 different from a 4090 for LLM inference?"}
],
"max_tokens": 128,
"temperature": 0.7
}'
Point any OpenAI-compatible client at
http://<your-box>:8000/v1 with any dummy API
key. Aider, Continue, LibreChat, Open WebUI, LangChain, the
OpenAI Python SDK — all just work.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
resp = client.chat.completions.create(
model="qwen3.6-27b",
messages=[{"role": "user", "content": "Hi!"}],
)
print(resp.choices[0].message.content)
7 Stress-test the 256k window
A common mistake is trusting --max-model-len 262144
without ever feeding the model a long prompt. Generate a 200k-token
needle-in-a-haystack and check latency and coherence.
python - <<'PY'
import random, string, time
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
# ~200k tokens of filler with a needle two-thirds through
filler = " ".join("".join(random.choices(string.ascii_lowercase, k=5))
for _ in range(150_000))
needle = "The secret code is GRAVITY-47-MAGNOLIA."
prompt = filler[:len(filler)*2//3] + " " + needle + " " + filler[len(filler)*2//3:]
t0 = time.time()
r = client.chat.completions.create(
model="qwen3.6-27b",
messages=[
{"role": "system", "content": "Answer questions from the provided context."},
{"role": "user", "content": prompt + "\n\nWhat is the secret code?"},
],
max_tokens=64,
)
print(f"Latency: {time.time()-t0:.1f}s")
print(r.choices[0].message.content)
PY
On a 2× 3090 rig with NVLink, expect 60–90 seconds of prefill for a 200k-token prompt and 15–25 tokens/sec during generation for the dense 27B. The MoE 35B-A3B prefills slower (more memory traffic in the router) but generates meaningfully faster.
Troubleshooting
“CUDA out of memory” on startup
vLLM allocates the KV cache before serving the first token, so OOMs show up immediately. Step down in this order:
- Drop
--gpu-memory-utilizationto0.90. - Reduce
--max-model-lento131072(still a very generous 128k). - Lower
--max-num-seqsto2.
NCCL hangs on first inference
Some motherboards have flaky P2P over PCIe. Disable it:
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=0
Throughput feels slow
- Confirm NVLink is actually in use:
nvidia-smi nvlink --status. - Pin power limit high:
sudo nvidia-smi -pl 350on both cards. - Check CPU isn’t the bottleneck — vLLM’s
tokenizer and scheduler run on one core. If that core is pinned
at 100%, you’re CPU-bound; try a faster chip or raise
--max-num-batched-tokens.
Want even more context (up to 1M)?
Qwen3.6 supports YaRN scaling to 1,010,000 tokens. Add this flag — but be warned that 1M tokens will not fit KV-wise on two 3090s without extreme batch limits:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144}' \
--max-model-len 1010000
What you end up with
A self-hosted, OpenAI-compatible endpoint running Qwen3.6 at the full 256k context on hardware you own. No per-token fees, no rate limits, no data leaving your desk — just two 3090s humming quietly at around 600 W, returning tokens over plain HTTP. Point your editor, your agent framework, or your chat UI at it and keep building.
References
- QwenLM/Qwen3.6 on GitHub — official repo and model cards
- Qwen3.5 & Qwen3.6 Usage Guide (vLLM Recipes)
- QuantTrio/Qwen3.6-35B-A3B-AWQ — community AWQ checkpoint
- vLLM: Quantized KV Cache — background on the FP8 KV trick