Skip to content

sd: support for CLIP and VAE on different devices#2184

Open
wbruna wants to merge 4 commits into
LostRuins:concedo_experimentalfrom
wbruna:kcpp_sd_multi_device_backend
Open

sd: support for CLIP and VAE on different devices#2184
wbruna wants to merge 4 commits into
LostRuins:concedo_experimentalfrom
wbruna:kcpp_sd_multi_device_backend

Conversation

@wbruna
Copy link
Copy Markdown

@wbruna wbruna commented May 3, 2026

Support for placing CLIP or VAE on separate devices (e.g. diffusion on Vulkan0, VAE on Vulkan1). It also enables keeping the diffusion model itself on CPU.

The first two commits adapt the C++ code: the interface receives device numbers instead of booleans, with -1 for "main device" and -2 for "CPU", and the backend includes a global config to choose which model gets which device. The last commit changes the sdclipgpu and sdvaecpu boolean parameters to accept "CPU", "main" or a device number.

Tested on Vulkan with my GPU and iGPU. Seems to work fine with command-line and config settings; however, I wasn't able to fully test the launcher, because there doesn't seem to be a way to select a discrete GPU and an iGPU through it (so I likely got its 1-based indexes wrong).

Edit: rebased on top of #2213 for the upstream multi-device support.

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch 2 times, most recently from 1517922 to 197cc2f Compare May 4, 2026 10:23
@wbruna wbruna marked this pull request as ready for review May 4, 2026 10:24
@LostRuins
Copy link
Copy Markdown
Owner

merged your other PR so now this conflicts

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from 197cc2f to ed81427 Compare May 7, 2026 15:37
@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 7, 2026

Fixed.

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from ed81427 to bd5330a Compare May 9, 2026 12:08
@LostRuins
Copy link
Copy Markdown
Owner

So I looked through this PR and it does seem like quite a lot of changes + complexities for something that doesn't seem too useful in my opinion.

Especially with the ability to already use offload_cpu (runtime load/unload for each component) it doesn't really seem too useful compare to simply using the same GPU for all image gen components. Is there something I'm missing.

Also it does modify a bunch of extra upstream code too.

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 11, 2026

So I looked through this PR and it does seem like quite a lot of changes + complexities for something that doesn't seem too useful in my opinion.

Especially with the ability to already use offload_cpu (runtime load/unload for each component) it doesn't really seem too useful compare to simply using the same GPU for all image gen components. Is there something I'm missing.

offload_cpu doesn't help situations which benefit from a second GPU for the same gen: a weaker card with more memory (like an iGPU) could e..g. run a video VAE which wouldn't fit on the main GPU. Also, its cost isn't trivial: it pins a lot of extra system RAM, and introduces extra latency for all generations.

We (and upstream) do get requests for this functionality from time to time.

Also it does modify a bunch of extra upstream code too.

Kind of? It's mostly the unavoidable device initialization for each component. I expect that code to change upstream when multi-device support gets implemented, but in that case dropping our changes would be simple enough.

Edit: rebased on top of #2204 to avoid conflicts.

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from bd5330a to 49ca0ac Compare May 12, 2026 22:00
@LostRuins
Copy link
Copy Markdown
Owner

merged 2204

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch 2 times, most recently from c900e1c to 93caa6d Compare May 15, 2026 21:21
@LostRuins
Copy link
Copy Markdown
Owner

This PR is still behaving very weirdly for me. Tried to add a patch to show more info but you didn't enable push permissions on this PR branch. So here's the patch

From e4975d84c666d250e755c6a9d42910cf972b1076 Mon Sep 17 00:00:00 2001
From: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Sat, 16 May 2026 11:07:10 +0800
Subject: [PATCH] add a bit of debug info, this PR is still behaving very
 strangely

---
 otherarch/sdcpp/sdtype_adapter.cpp | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/otherarch/sdcpp/sdtype_adapter.cpp b/otherarch/sdcpp/sdtype_adapter.cpp
index c1d16fc84..aaf37f1fc 100644
--- a/otherarch/sdcpp/sdtype_adapter.cpp
+++ b/otherarch/sdcpp/sdtype_adapter.cpp
@@ -35,6 +35,7 @@ using namespace kcpp_sd;
 #include "stb_image_resize.h"
 
 #include "avi_writer.h"
+#include "util.h"
 
 struct LoraMap {
     std::vector<std::pair<std::string, float>> items;
@@ -426,8 +427,10 @@ bool sdtype_load_model(const sd_load_model_inputs inputs) {
     params.vae_conv_direct = sd_params->vae_conv_direct;
     params.chroma_use_dit_mask = sd_params->chroma_use_dit_mask;
     params.offload_params_to_cpu = inputs.offload_cpu;
-    //params.keep_vae_on_cpu = (inputs.kcpp_vae_device <= -2);
-    //params.keep_clip_on_cpu = (inputs.kcpp_clip_device <= -2);
+    std::string vae_dev = kcpp_sd_get_device_name("vae");
+    std::string clip_dev = kcpp_sd_get_device_name("vae");
+    // params.keep_vae_on_cpu =  (vae_dev=="CPU");
+    // params.keep_clip_on_cpu = (clip_dev=="CPU");
     params.lora_apply_mode = (lora_apply_mode_t)lora_apply_mode;
 
     // also switches flash attn for the vae and conditioner
@@ -440,6 +443,8 @@ bool sdtype_load_model(const sd_load_model_inputs inputs) {
 
     if(inputs.debugmode==1)
     {
+        printf("\nkcpp_vae_device: %d = %s",inputs.kcpp_vae_device, vae_dev.c_str());
+        printf("\nkcpp_clip_device: %d = %s\n",inputs.kcpp_clip_device, clip_dev.c_str());
         char* buf = sd_ctx_params_to_str(&params);
         if(buf)
         {
-- 
2.30.2.windows.1


Anyway, as I mentioned in discord:

right now keep_clip_on_cpu is incorrectly set (and displayed), but it seems to crap out when i set it correctly (see params.keep_clip_on_cpu = (clip_dev=="CPU");), its almost like bool clip_on_cpu = sd_ctx_params->keep_clip_on_cpu; must be false for the pr to work correctly

then especially this line, it doesn't feel right,

 auto backend = std::shared_ptr<struct ggml_backend>(init_named_backend(device_name), ggml_backend_free);

simply querying a backend force inits it. So it takes up double the VRAM. But without it, it seems the VAE is forced onto gpu for some reason even if i attempt to select CPU explictly.

From the python side the defaults are also wrong. Loading any existing config somehow sets both options to "1" instead
image
the expected behavior would need changes to:

  • match previous defaults (clip on main cpu, vae on main gpu)
  • port old config if applicable (inside convert_invalid_args, if found, detect and port the old flags to the new ones)

@wbruna wbruna marked this pull request as draft May 16, 2026 10:19
@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from 93caa6d to a8eb544 Compare May 16, 2026 10:19
@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

Just rebased for now, to deal with the conflict with the mmap flag.

The flags params.keep_*_on_cpu are basically ignored when setting the backend through kcpp_get_backend. Setting them on sdtype_adapter would simply force the backend to "CPU" regardless of the string; but since both settings come from the same place, it becomes redundant (the only effect should be the original sd.cpp log line). So I've removed the attribution to make it easier to test, because other backends wouldn't have this special case.

There is a related bug if the boolean flags are set, though 😕 And I agree the display should be adjusted for that; will fix it.

auto backend = std::shared_ptr<struct ggml_backend>(init_named_backend(device_name), ggml_backend_free);

This is not a query: kcpp_get_backend may be badly named, but it's supposed to initialize it: "give me the (possibly already initialized) backend that matches this device name".

@LostRuins
Copy link
Copy Markdown
Owner

I see. But yeah, it was glitching out and double allocating when i set params.keep_clip_on_cpu = (clip_dev=="CPU");

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

I see. But yeah, it was glitching out and double allocating when i set params.keep_clip_on_cpu = (clip_dev=="CPU");

Indeed, it would pose as a separate module, and get its own entry on the backend map. 🤦

And I think I understand why it looks like it is initializing everything prematurely:

Setting clip device to -2
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon Vega 11 Graphics (RADV RAVEN) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
Setting vae device to 2

kcpp_sd::config_device validates the device number before it gets stored. To do that, it needs the number of available devices, and that path calls the device initialization code, printing the GPU info.

I could either store the config as-is first and validate it later, or make the device scan explicit at a previous point.

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

Oh, one thing I forgot to ask you about: GPU numbering. env vars, device names, and (I think) our own command line use zero-based counting; but the launcher shows them as one-based. I guess we avoid zero-counting on the launcher because it doesn't feel natural for normal people; but maybe showing them as names instead ("CUDA0", "GPU0", etc) would avoid that issue while still keeping everything consistent?

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

GUI-wise, what do you think about organizing the device selection like this?

image

(asking before trying to fight with control offsets 🙂)

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

I believe the config selection is fixed; but since leejet just pushed multi-device support, I'll focus on getting that ready, then adapt this PR for it.

@LostRuins
Copy link
Copy Markdown
Owner

GUI-wise, what do you think about organizing the device selection like this?

main concern is space. window is very tight on space so i try to maximize as much as possible,
how does it look with that

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from 0e70800 to c93423a Compare May 16, 2026 21:11
@wbruna wbruna marked this pull request as ready for review May 16, 2026 21:12
@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

Adapted to the mainstream support.

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 16, 2026

GUI-wise, what do you think about organizing the device selection like this?

main concern is space. window is very tight on space so i try to maximize as much as possible, how does it look with that

Since they wouldn't take much horizontal space, one approach could be shrinking the model filename fields a bit:

image

and maybe replace "Browse" with "📂".

Or... since the filename fields refer to the same modules, we make a full grid module x (filenames, backend), and drop the backend labels:

image

(the mock looks awful, but hopefully the idea is clear 🙂)

@LostRuins
Copy link
Copy Markdown
Owner

LostRuins commented May 17, 2026

Hmm, if they enable tae sd then the vae backend selector will disappear?
or if they load a model that has different number of files (e.g. sdxl is all in one, flux needs both clip1 and 2 but z-image only needs one etc)

probably can be confusing

@LostRuins
Copy link
Copy Markdown
Owner

Should I still consider merging this or is this superceded by #2213 ?

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 17, 2026

Should I still consider merging this or is this superceded by #2213 ?

#2213 just syncs with upstream, and by itself just keeps the current "*_on_cpu" support. This one provides the multi-device selection.

@wbruna
Copy link
Copy Markdown
Author

wbruna commented May 17, 2026

Hmm, if they enable tae sd then the vae backend selector will disappear?

I think the VAE backend is used for the TAE model, too. TAE is a VAE, just a lightweight one.

or if they load a model that has different number of files (e.g. sdxl is all in one, flux needs both clip1 and 2 but z-image only needs one etc)

If we had a lot of space, or better controls, I'd add a "Text encoder" box grouping the backend selector and all the text models; a VAE one with backend, models, threshold; etc.

@wbruna wbruna force-pushed the kcpp_sd_multi_device_backend branch from c93423a to 8b998f9 Compare May 17, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants