Recoleta Item Note

How to run Qwen 3.5 locally

This is an engineering-oriented local deployment guide explaining how to run the full Qwen3.5 model family on llama.cpp, LM Studio, and llama-server, while using Unsloth dynamic quantization to lower the barrier for…

local-llmmodel-quantizationllama-cppqwentool-callingagentic-coding

This is an engineering-oriented local deployment guide explaining how to run the full Qwen3.5 model family on llama.cpp, LM Studio, and llama-server, while using Unsloth dynamic quantization to lower the barrier for local inference. It is not a standard academic paper, but more like a product/system documentation page focused on runnability, quantization configuration, inference mode switching, and resource requirements.

  • The problem it addresses is: how to efficiently run Qwen3.5 models ranging from 0.8B to 397B on local devices, while minimizing memory usage and preserving reasoning, coding, long-context, and tool-calling capabilities.
  • This matters because large models usually require very high VRAM/memory; if they can be run locally through quantization and backend adaptation, that enables offline use, lower-cost deployment, private data processing, and local agent/coding workflows.
  • The document also implicitly addresses an engineering problem: different “thinking / non-thinking” modes, tool-calling templates, quantization formats, and inference backends are easy to misconfigure, which affects real-world usability.
  • The core method is to use Unsloth Dynamic 2.0 GGUF dynamic quantization, combining low-bit quantization such as 4-bit with upcasting key layers to 8/16-bit to maintain performance while keeping model size as small as possible.
  • The runtime workflow is straightforward: first download the GGUF quantized file for the corresponding Qwen3.5 variant, then load it with llama.cpp / llama-server / LM Studio, and switch between “thinking” and “non-thinking” modes depending on the task.
  • The document provides local hardware recommendations, context length, sampling parameters, and command-line settings for different model sizes; for small models, reasoning is disabled by default and must be explicitly enabled with enable_thinking=true.
  • For the ultra-large 397B model, the method relies on MoE offloading + quantization, allowing it to run in an environment with a single 24GB GPU plus large system memory, rather than requiring a purely high-end multi-GPU cluster.
  • It also fixes a tool-calling chat template bug and updates imatrix data and the quantization algorithm to improve chat, coding, long-context, and tool-calling performance.
  • The document claims Qwen3.5 supports 256K context and can be extended to 1M via YaRN; it covers 201 languages.
  • In terms of local resource requirements: 27B can run on about 18GB RAM/Mac, 35B-A3B can run on roughly 22–24GB devices, 9B near full precision requires about 12GB RAM/VRAM, and 122B-A10B requires about 70GB.
  • For 397B-A17B: the full checkpoint is about 807GB; 3-bit quantization can fit into 192GB RAM, and 4-bit can fit into 256GB RAM; its UD-Q4_K_XL uses about 214GB of disk space and can be loaded directly on a 256GB M3 Ultra; with MoE offloading using a single 24GB GPU + 256GB system memory, it is reported to reach 25+ tokens/s.
  • In a third-party 750-prompt mixed evaluation (LiveCodeBench v6, MMLU Pro, GPQA, Math500), Qwen3.5-397B-A17B original weights scored 81.3%; UD-Q4_K_XL scored 80.5% (-0.8 points, +4.3% relative error increase); UD-Q3_K_XL scored 80.7% (-0.6 points, +3.5% relative error increase). Based on this, the document emphasizes that the performance gap after quantization is less than 1 percentage point versus the original model.
  • For the 35B series, the document says the updated quantization performs as SOTA on Pareto Frontier across 150+ KL Divergence benchmarks and a total of 9TB GGUFs, and specifically mentions results such as UD-Q4_K_XL, IQ3_XXS, though the excerpt does not provide more complete per-benchmark values.
  • Most other results are qualitative claims: the improved quantization algorithm, imatrix data, and template fixes can improve performance in chat, coding, long context, and tool-calling scenarios; however, aside from the benchmarks above, the excerpt does not provide more detailed numbers that can be independently verified.
Built with Recoleta

Run your own research radar

Turn arXiv, Hacker News, OpenReview, Hugging Face Daily Papers, and RSS into local Markdown, Obsidian notes, Telegram digests, and a public site.