Currently, the only available DFlash draft model is Qwen3.5-4B-DFlash, which is paired with the Qwen3.5-4B target model. However, when deploying on consumer-grade GPUs (e.g., 2× 16GB), the Mamba cache required by the DFlash draft model consumes significant VRAM, making it difficult to use the recommended block_size=16 without running into OOM errors.A smaller DFlash draft checkpoint (e.g., trained from Qwen3.5-2B or Qwen3.5-0.8B) would be highly beneficial for memory-constrained deployments while still enabling speculative decoding acceleration.
Currently, the only available DFlash draft model is Qwen3.5-4B-DFlash, which is paired with the Qwen3.5-4B target model. However, when deploying on consumer-grade GPUs (e.g., 2× 16GB), the Mamba cache required by the DFlash draft model consumes significant VRAM, making it difficult to use the recommended block_size=16 without running into OOM errors.A smaller DFlash draft checkpoint (e.g., trained from Qwen3.5-2B or Qwen3.5-0.8B) would be highly beneficial for memory-constrained deployments while still enabling speculative decoding acceleration.