How to Host Qwen 3.6 with vLLM on Two RTX 3090s

A recipe-style guide to running Alibaba’s freshly-released Qwen3.6-27B dense model (or the 35B-A3B MoE) on a dual RTX 3090 desktop — AWQ 4-bit weights, FP8 KV cache, full 256k context.

Prep time 1–2 hours (mostly model download)
One-time cost ~€1,500–2,000 for a used 2× 3090 desktop
Going cost Your electricity bill (~600 W under load)

Ingredients

By the end of this recipe you will have an OpenAI-compatible HTTP endpoint on http://localhost:8000 serving Qwen3.6 at the full native 262,144-token context window, with both GPUs sharing the load through tensor parallelism.

Pick your model

Qwen released two open-weight flagships on April 16, 2026. Both are excellent — the choice comes down to whether you prefer raw single-stream quality (dense) or faster generation and multimodal input (MoE).

Dense · Text only

Qwen3.6-27B

Params
27 B (all active)
Context
262,144 native, 1 M with YaRN
AWQ weights
~15–16 GB
Best for
Coding, long-form reasoning, agentic loops
MoE · Multimodal

Qwen3.6-35B-A3B

Params
35 B total / 3 B active (256 experts, 8+1 active)
Context
262,144 native, 1 M with YaRN
AWQ weights
~18–20 GB
Best for
Fast generation, vision/video, cheap tokens
Why AWQ and not FP8? The RTX 3090 is Ampere (SM 8.6) and has no native FP8 math units — those arrived with Ada Lovelace / Hopper. Loading an FP8 checkpoint on a 3090 works, but vLLM will decompress weights to FP16 on the fly, which is both larger and slower than 4-bit AWQ. Community benchmarks put FP8 roughly 13% behind AWQ on the same MoE model when run on Ampere. Stick to AWQ 4-bit on 3090s.

1 Verify the GPUs and driver

SSH or sit down at the box and check the driver can see both cards.

nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv

You should see two rows like NVIDIA GeForce RTX 3090, 24576 MiB, 560.xx. If the driver is older than 560, upgrade it first — CUDA 12.8 runtime requires it.

If the two cards are on the same PCIe root complex, enable peer-to-peer so vLLM can shuttle tensors between them efficiently.

nvidia-smi topo -m
# Look for "NV1" or "NV2" between GPU0 and GPU1 if you have an NVLink bridge.
# "PIX" or "PHB" is PCIe — still fine, just a bit slower.

2 Install vLLM in a clean environment

Use uv for speed — it creates an isolated Python 3.12 venv and installs vLLM with its CUDA 12.8 wheels in under a minute. (Swap uv for conda or python -m venv if you prefer.)

curl -LsSf https://astral.sh/uv/install.sh | sh

uv venv --python 3.12 ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate

uv pip install --upgrade "vllm>=0.19.0" "transformers>=5.5.4"

# Sanity check — should print 0.19.x or newer
python -c "import vllm; print(vllm.__version__)"
Tip: If you plan to serve over the network, also uv pip install "vllm[serve]". The base install already includes the OpenAI-compatible server.

3 Download the AWQ weights

Use the hf CLI (formerly huggingface-cli) to pull the quantized checkpoint. Pick one of the two commands below depending on which model you chose.

Option A — Dense 27B

uv pip install "huggingface_hub[cli]"
hf auth login   # paste an HF read token

hf download QuantTrio/Qwen3.6-27B-AWQ \
    --local-dir ~/models/Qwen3.6-27B-AWQ

Option B — MoE 35B-A3B

hf download QuantTrio/Qwen3.6-35B-A3B-AWQ \
    --local-dir ~/models/Qwen3.6-35B-A3B-AWQ

Expect a download of 15–20 GB. Both repos ship config.json with quantization_config already set, so vLLM will auto-detect AWQ; no extra CLI flag needed.

4 Understand the memory budget

Two 3090s give you 48 GB of VRAM total. With tensor-parallel size 2 the model weights are split in half across the cards, but the KV cache is also sharded — so the headroom equation is:

free_per_gpu  = 24 GB
                − (weights / 2)           # AWQ 4-bit, split
                − ~1.5 GB activations     # forward pass buffers
                − ~0.5 GB CUDA overhead
                = KV cache budget per GPU

A single 262,144-token sequence with BF16 KV would need roughly 32 GB per GPU — impossible. Two tricks make 256k fit on dual 3090s:

  1. --kv-cache-dtype fp8_e5m2 halves KV memory. fp8_e5m2 is the only FP8 KV dtype that actually works on Ampere (E4M3 is Hopper-only in Triton).
  2. Chunked prefill (on by default in vLLM 0.19+) keeps activation memory bounded no matter the prompt length.

After these two, a 262,144-token conversation fits in roughly 16 GB of KV cache per GPU — tight but workable. If you need to serve multiple concurrent requests at full context, drop --max-model-len or raise --kv-cache-dtype aggressiveness.

5 Launch vLLM

This is the command. Save it as a shell script (serve.sh) so you can restart the server with a single invocation. The values below are tuned to maximise context length first and throughput second — this is the “single power user” profile.

For Qwen3.6-27B (dense)

#!/usr/bin/env bash
set -euo pipefail

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export NCCL_P2P_DISABLE=0          # keep P2P on; disable only if NCCL hangs
export VLLM_ATTENTION_BACKEND=FLASH_ATTN

vllm serve ~/models/Qwen3.6-27B-AWQ \
    --served-model-name qwen3.6-27b \
    --tensor-parallel-size 2 \
    --max-model-len 262144 \
    --kv-cache-dtype fp8_e5m2 \
    --gpu-memory-utilization 0.95 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 4 \
    --reasoning-parser qwen3 \
    --host 0.0.0.0 \
    --port 8000

For Qwen3.6-35B-A3B (MoE)

#!/usr/bin/env bash
set -euo pipefail

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export NCCL_P2P_DISABLE=0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN

vllm serve ~/models/Qwen3.6-35B-A3B-AWQ \
    --served-model-name qwen3.6-35b-a3b \
    --tensor-parallel-size 2 \
    --max-model-len 262144 \
    --kv-cache-dtype fp8_e5m2 \
    --gpu-memory-utilization 0.95 \
    --enable-expert-parallel \
    --enable-chunked-prefill \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 4 \
    --reasoning-parser qwen3 \
    --host 0.0.0.0 \
    --port 8000

What the important flags do:

First boot is slow. vLLM compiles CUDA graphs and profiles KV memory on startup — expect 2–4 minutes before the INFO: Application startup complete line. Subsequent launches are faster thanks to the torch.compile cache in ~/.cache/vllm/.

6 Smoke-test the endpoint

vLLM speaks the OpenAI API. A plain curl is enough to confirm tokens come back.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "qwen3.6-27b",
      "messages": [
        {"role": "user", "content": "In one sentence: what makes a 3090 different from a 4090 for LLM inference?"}
      ],
      "max_tokens": 128,
      "temperature": 0.7
    }'

Point any OpenAI-compatible client at http://<your-box>:8000/v1 with any dummy API key. Aider, Continue, LibreChat, Open WebUI, LangChain, the OpenAI Python SDK — all just work.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

resp = client.chat.completions.create(
    model="qwen3.6-27b",
    messages=[{"role": "user", "content": "Hi!"}],
)
print(resp.choices[0].message.content)

7 Stress-test the 256k window

A common mistake is trusting --max-model-len 262144 without ever feeding the model a long prompt. Generate a 200k-token needle-in-a-haystack and check latency and coherence.

python - <<'PY'
import random, string, time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

# ~200k tokens of filler with a needle two-thirds through
filler = " ".join("".join(random.choices(string.ascii_lowercase, k=5))
                  for _ in range(150_000))
needle = "The secret code is GRAVITY-47-MAGNOLIA."
prompt = filler[:len(filler)*2//3] + " " + needle + " " + filler[len(filler)*2//3:]

t0 = time.time()
r = client.chat.completions.create(
    model="qwen3.6-27b",
    messages=[
        {"role": "system", "content": "Answer questions from the provided context."},
        {"role": "user", "content": prompt + "\n\nWhat is the secret code?"},
    ],
    max_tokens=64,
)
print(f"Latency: {time.time()-t0:.1f}s")
print(r.choices[0].message.content)
PY

On a 2× 3090 rig with NVLink, expect 60–90 seconds of prefill for a 200k-token prompt and 15–25 tokens/sec during generation for the dense 27B. The MoE 35B-A3B prefills slower (more memory traffic in the router) but generates meaningfully faster.

Troubleshooting

“CUDA out of memory” on startup

vLLM allocates the KV cache before serving the first token, so OOMs show up immediately. Step down in this order:

  1. Drop --gpu-memory-utilization to 0.90.
  2. Reduce --max-model-len to 131072 (still a very generous 128k).
  3. Lower --max-num-seqs to 2.

NCCL hangs on first inference

Some motherboards have flaky P2P over PCIe. Disable it:

export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=0

Throughput feels slow

Want even more context (up to 1M)?

Qwen3.6 supports YaRN scaling to 1,010,000 tokens. Add this flag — but be warned that 1M tokens will not fit KV-wise on two 3090s without extreme batch limits:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... \
    --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144}' \
    --max-model-len 1010000

What you end up with

A self-hosted, OpenAI-compatible endpoint running Qwen3.6 at the full 256k context on hardware you own. No per-token fees, no rate limits, no data leaving your desk — just two 3090s humming quietly at around 600 W, returning tokens over plain HTTP. Point your editor, your agent framework, or your chat UI at it and keep building.

References