Recoleta Item Note

How to run Qwen 3.5 locally

local-llmmodel-quantizationllama-cppqwentool-callingagentic-coding

Summary

This is an engineering-oriented local deployment guide explaining how to run the full Qwen3.5 model family on llama.cpp, LM Studio, and llama-server, while using Unsloth dynamic quantization to lower the barrier for local inference. It is not a standard academic paper, but more like a product/system documentation page focused on runnability, quantization configuration, inference mode switching, and resource requirements.

Problem

The problem it addresses is: how to efficiently run Qwen3.5 models ranging from 0.8B to 397B on local devices, while minimizing memory usage and preserving reasoning, coding, long-context, and tool-calling capabilities.
This matters because large models usually require very high VRAM/memory; if they can be run locally through quantization and backend adaptation, that enables offline use, lower-cost deployment, private data processing, and local agent/coding workflows.
The document also implicitly addresses an engineering problem: different “thinking / non-thinking” modes, tool-calling templates, quantization formats, and inference backends are easy to misconfigure, which affects real-world usability.

Approach

The core method is to use Unsloth Dynamic 2.0 GGUF dynamic quantization, combining low-bit quantization such as 4-bit with upcasting key layers to 8/16-bit to maintain performance while keeping model size as small as possible.
The runtime workflow is straightforward: first download the GGUF quantized file for the corresponding Qwen3.5 variant, then load it with llama.cpp / llama-server / LM Studio, and switch between “thinking” and “non-thinking” modes depending on the task.
The document provides local hardware recommendations, context length, sampling parameters, and command-line settings for different model sizes; for small models, reasoning is disabled by default and must be explicitly enabled with enable_thinking=true.
For the ultra-large 397B model, the method relies on MoE offloading + quantization, allowing it to run in an environment with a single 24GB GPU plus large system memory, rather than requiring a purely high-end multi-GPU cluster.
It also fixes a tool-calling chat template bug and updates imatrix data and the quantization algorithm to improve chat, coding, long-context, and tool-calling performance.

Results

The document claims Qwen3.5 supports 256K context and can be extended to 1M via YaRN; it covers 201 languages.
In terms of local resource requirements: 27B can run on about 18GB RAM/Mac, 35B-A3B can run on roughly 22–24GB devices, 9B near full precision requires about 12GB RAM/VRAM, and 122B-A10B requires about 70GB.
For 397B-A17B: the full checkpoint is about 807GB; 3-bit quantization can fit into 192GB RAM, and 4-bit can fit into 256GB RAM; its UD-Q4_K_XL uses about 214GB of disk space and can be loaded directly on a 256GB M3 Ultra; with MoE offloading using a single 24GB GPU + 256GB system memory, it is reported to reach 25+ tokens/s.
In a third-party 750-prompt mixed evaluation (LiveCodeBench v6, MMLU Pro, GPQA, Math500), Qwen3.5-397B-A17B original weights scored 81.3%; UD-Q4_K_XL scored 80.5% (-0.8 points, +4.3% relative error increase); UD-Q3_K_XL scored 80.7% (-0.6 points, +3.5% relative error increase). Based on this, the document emphasizes that the performance gap after quantization is less than 1 percentage point versus the original model.
For the 35B series, the document says the updated quantization performs as SOTA on Pareto Frontier across 150+ KL Divergence benchmarks and a total of 9TB GGUFs, and specifically mentions results such as UD-Q4_K_XL, IQ3_XXS, though the excerpt does not provide more complete per-benchmark values.
Most other results are qualitative claims: the improved quantization algorithm, imatrix data, and template fixes can improve performance in chat, coding, long context, and tool-calling scenarios; however, aside from the benchmarks above, the excerpt does not provide more detailed numbers that can be independently verified.

Link

https://unsloth.ai/docs/models/qwen3.5

Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.

View repo 5-minute quickstart