Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image

### Describe the bug

When combining torchao quantization (TorchAoConfig with Float8WeightOnlyConfig) and group offloading with use_stream=True, inference fails with a device mismatch error. The quantized weight remains on CPU while the input tensor is on CUDA.

### Reproduction

```python
import torch
from diffusers import QwenImageEditPlusPipeline  # or any pipeline
from diffusers.hooks import apply_group_offloading
from diffusers import PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Float8WeightOnlyConfig

model_path = "path/to/model"
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")

pipe = QwenImageEditPlusPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    quantization_config=PipelineQuantizationConfig(
        quant_mapping={"transformer": TorchAoConfig(Float8WeightOnlyConfig())}
    ),
    device_map="cpu"
)

# This will cause a RuntimeError during inference
pipe.transformer.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True,        # ← Required for performance, but triggers the bug
    non_blocking=True
)
pipe.vae.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True,
    non_blocking=True
)
apply_group_offloading(
    pipe.text_encoder,
    onload_device=onload_device,
    offload_type="leaf_level",
    use_stream=True,
    non_blocking=True
)


# Triggers the error:
pipe(image, prompt, height=2048, width=2048, num_inference_steps=50).images[0]
```
First, note that unlike official examples that load the pipeline directly to CUDA via ``` device_map='cuda' ```, I am constrained by VRAM size (which is the main intended use case for group offloading). The components (multi-billion parameter transformer + text encoder + VAE) are simply too large to fit into VRAM simultaneously. Therefore, the pipeline must be initialized on the CPU (device_map="cpu") before applying group offloading.

traceback:
```python
  File ".../diffusers/models/transformers/transformer_qwenimage.py", line 896, in forward
    hidden_states = self.img_in(hidden_states)
  File ".../torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
  File ".../torchao/quantization/quantize_/workflows/float8/float8_tensor.py", line 448, in _float8_addmm_impl
    out = torch.matmul(input_tensor, weight_tensor.dequantize())
RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu,
different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)
```

### Logs

```shell

```

### System Info

## Environment

| Component | Version |
|-----------|---------|
| diffusers | 0.37.0 |
| torch | 2.9.1+cu126 |
| torchao | 0.16.0 |
| accelerate | 1.3.0 |
| Python | 3.11 |
| CUDA | 12.6 |

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281

Describe the bug

Reproduction

Logs

System Info

Environment

Who can help?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
diffusers	0.37.0
torch	2.9.1+cu126
torchao	0.16.0
accelerate	1.3.0
Python	3.11
CUDA	12.6

Group offloading with use_stream=True breaks torchao quantized models (device mismatch) in qwen image #13281

Description

Describe the bug

Reproduction

Logs

System Info

Environment

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions