Fix Wan batch conditioning and config handling#13693
Fix Wan batch conditioning and config handling#13693hlky wants to merge 1 commit intohuggingface:mainfrom
Conversation
| ) | ||
| block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.repeat_interleave( | ||
| block_state.num_videos_per_prompt, dim=0 | ||
| ).to(block_state.dtype) |
There was a problem hiding this comment.
I think we already cast dtype later in denoise step no?
| ] | ||
|
|
||
| @property | ||
| def intermediate_outputs(self) -> list[OutputParam]: |
There was a problem hiding this comment.
I think we don't need to declare it here > this does not go beyond the loop step
https://huggingface.co/docs/diffusers/modular_diffusers/loop_sequential_pipeline_blocks#loop-blocks
| hidden_states=block_state.latent_model_input.to(block_state.dtype), | ||
| timestep=t.expand(block_state.latent_model_input.shape[0]).to(block_state.dtype), | ||
| hidden_states=block_state.latent_model_input.to(dtype), | ||
| timestep=t.expand(block_state.latent_model_input.shape[0]), |
There was a problem hiding this comment.
the dtype update here makes sense,
Keeps modular Wan timesteps in scheduler precision while casting model inputs to the selected transformer dtype.
I also liked we use transformer.dtype instead of prompt_embeds.dtype -> should be the stanrdard preferred way of infer dtype moving forward
But most of the rest of the change on inputs/outputs does not seem valid to me, loop step works a bit differently https://huggingface.co/docs/diffusers/modular_diffusers/loop_sequential_pipeline_blocks#loop-blocks; let me know if I missed anything
|
|
||
| return mask_lat_size | ||
|
|
||
| def _expand_tensor_to_effective_batch( |
There was a problem hiding this comment.
can we make it a regular function?
| " only forward one of the two." | ||
| ) | ||
| if image is None and image_embeds is None: | ||
| if image is None: |
There was a problem hiding this comment.
i don't think the list of image make sense here for I2V pipeline, no?
Maybe we should just fix the docstring instead?
Fixes #13578.
What does this PR do?
This PR addresses the Wan systemic review findings:
batch_size * num_videos_per_promptin per-prompt order.imagerequired where VAE conditioning still needs it, while allowing precomputed CLIPimage_embeds.output_type="latent".num_videos_per_prompt != 1..ai/pipelines.mdchecklist note fornum_*_per_promptexpansion drift.Tests
ruff check ...on touched Python filesruff format --check ...on touched Python filesgit diff --checkpython -m compileall -q src/diffusers/pipelines/wan src/diffusers/modular_pipelines/wan tests/pipelines/wan tests/modular_pipelines/wanpython -m pytest tests/pipelines/wan/test_wan_image_to_video.py tests/pipelines/wan/test_wan_vace.py tests/pipelines/wan/test_wan_video_to_video.py tests/pipelines/wan/test_wan_animate.py tests/modular_pipelines/wan/test_modular_pipeline_wan.py -q -k "not test_save_load_float16"Result:
180 passed, 21 skipped, 5 deselected, 61 warnings, 6 subtests passed.Note:
WanFLFToVideoPipelineFastTests::test_save_load_float16fails the same way on cleanupstream/main(max diff 0.04175 > 0.01), so it was excluded from the clean touched-file verification.