feat: support for memory-mapping model weights by wbruna · Pull Request #1414 · leejet/stable-diffusion.cpp

wbruna · 2026-04-13T23:13:41Z

A follow-up for #1059, this adds support for pointing tensor storage buffers directly into memory-mapped model files.

~~Apart from the expected limitations (e.g. weight types need to match), for now a lot of stars need to be properly aligned:~~

only enabled for 100% CPU backends, to avoid the complexity of tracking backend information per tensor; so e.g. --clip-on-cpu won't benefit from it. On the other hand, it does work with --offload-to-cpu
~~only enabled if LoRA apply mode is at_runtime (even if no LoRAs are loaded). I've reused the I/O mmap support, which is read-only, so it needs to avoid trying to modify the mapped weights in place.~~

Edit: added device compatibility detection in the same way as llama.cpp, and per-tensor tracking; so all compatible devices should be supported, including with --clip-on-cpu and --vae-on-cpu.

Edit 2: for LoRA apply mode immediately, turn the mapping writable. With certain LoRAs, the weight patching may cancel most of the mmap savings, but it will still work for some of the unchanged tensors (note: working fine on Linux, but I couldn't test it on Windows).

The existing mmap support on the I/O path isn't affected.

[INFO ] model.cpp:1469 - memory-mapped 606 tensors in 3 files (8356.31 MB), taking 0.00s
[DEBUG] ggml_extend.hpp:2046 - qwen3 params backend buffer size = 1483.75 MB(RAM) (398 tensors)
[DEBUG] ggml_extend.hpp:2046 - z_image params backend buffer size = 6.93 MB(RAM) (453 tensors)
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size = 92.57 MB(RAM) (138 tensors)

Instead of disabling mmap, we turn the mapping writable.

Without an explicit posix_fadvise(POSIX_FADV_DONTNEED), the Linux kernel keeps a model file's pages cached as buff/cache long after we're done with it, so loading the LLM (13.7 GB) followed by the DiT (17 GB) piles up to 30+ GB of cached pages on a 32 GB box and triggers the OOM-killer. - Keep the file descriptor alive in MmapWrapperImpl so we can posix_fadvise(POSIX_FADV_DONTNEED) on it before munmap. madvise alone only unmaps the address range — it does not evict pagecache. - Add POSIX_FADV_SEQUENTIAL on open: nudges the kernel toward a smaller working set during the read. - Make the "using mmap" log line INFO instead of DEBUG so the user can confirm at a glance. - Bound the lazy-load worker count to 2: the per-thread staging buffers grow to the largest tensor seen, so n_threads=8 doubles RAM peak for no measurable read-throughput gain. Result on 32 GB box: peak RSS ~6 GB, peak buff/cache ~12 GB during LLM lazy load — comfortably within budget.

- drop superfluous validity tests from the mmap handler destructor, since by design they are always valid on the manager object - check against zero-sized files - control read-ahead and discard hints through an environment variable: on my own system, with a warm cache, all these flags actually hurt performance for common sd-cli runs (~10-20% worse loading times), so they should probably be enabled on a case-by-case basis

wbruna · 2026-05-11T01:48:11Z

@pwilkin , I've cherry-picked b8d1c99 here to make it easier to test mmap behavior.

I'm not sure why, but the performance flags made loading times consistently worse for me, so I've made them opt-in through an env var. For consistency, and because consecutive sd-cli runs would also benefit from a cached model, I've made the cache eviction opt-in too; but I don't feel strongly about it.

junmo-kim · 2026-05-11T04:16:06Z

Hi @wbruna, thanks for this PR — I've been running a merged build (master + this branch) for image generation/edit workloads. Hit a consistent failure with Qwen-Image GGUF models + --offload-to-cpu + --mmap on AMD Radeon 780M / Vulkan / Windows 11 / 64 GB RAM:

[ERROR] ggml_extend.hpp:2063 - qwen2.5vl alloc params backend buffer failed, num_tensors = 857
[ERROR] ggml_extend.hpp:2063 - qwen_image alloc params backend buffer failed, num_tensors = 1933
[INFO ] wan_vae params backend buffer size = 242.10 MB(RAM) (194 tensors)
[INFO ] main.cpp:148  - listening on: 127.0.0.1:7860

sd-server enters listen state but with diffusion_model 0.00MB — any inference request returns blank/noise.

Root cause

When all tensors in params_ctx are already memory-mapped (i.e. t->data != NULL), ggml_backend_alloc_ctx_tensors_from_buft_impl in ggml-alloc.c correctly returns NULL with n_buffers == 0:

// ggml/src/ggml-alloc.c L1210-1215
if (n_buffers == 0) {
#ifndef NDEBUG
    GGML_LOG_DEBUG("%s: all tensors in the context are already allocated\n", __func__);
#endif
    GGML_ASSERT(!buffers);
    return NULL;
}

But GGMLRunner::alloc_params_buffer() treats any NULL return as a hard failure. For GGUF models with mmap enabled, every tensor has a valid t->data pointer → n_buffers == 0 → spurious LOG_ERROR. VAE (loaded from safetensors here) allocates fine because its tensors aren't in the mmap region — only the GGUF diffusion model and GGUF text encoder fail.

This is consistent with the failing components in the log above: qwen2.5vl and qwen_image are both GGUF, wan_vae (safetensors) is fine.

Proposed fix

Add a check before the failure path: if all tensors in params_ctx already have t->data (or are views), treat it as "no separate buffer needed":

bool alloc_params_buffer() {
    size_t num_tensors = ggml_tensor_num(params_ctx);
    params_buffer = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend);
    // mmap-aware path: ggml returns NULL when all tensors are already allocated
    // (typical for memory-mapped weights). See ggml-alloc.c n_buffers==0 branch.
    if (params_buffer == nullptr && num_tensors > 0) {
        bool all_have_data = true;
        for (ggml_tensor * t = ggml_get_first_tensor(params_ctx); t != nullptr; t = ggml_get_next_tensor(params_ctx, t)) {
            if (t->data == nullptr && t->view_src == nullptr) {
                all_have_data = false;
                break;
            }
        }
        if (all_have_data) {
            LOG_DEBUG("%s all params already mmap-allocated (no separate buffer needed)", get_desc().c_str());
            rebuild_params_tensor_set();
            return true;
        }
    }
    if (params_buffer == nullptr) {
        LOG_ERROR("%s alloc params backend buffer failed, num_tensors = %i",
                  get_desc().c_str(), num_tensors);
        return false;
    }
    rebuild_params_tensor_set();
    ggml_backend_buffer_set_usage(params_buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
    // ... rest unchanged
}

free_params_buffer and get_params_buffer_size already have null guards, and grepping the source I haven't found other paths that dereference params_buffer without checking — so leaving it nullptr in the mmap case appears safe.

Verification

Built on Windows 11 + Vulkan SDK 1.4.341 + MSVC, 64 GB RAM, branch merged with current master (so the failure is not RAM exhaustion — peak working set stays well under available memory)
Tested with Qwen-Image-Q8 (21.8 GB GGUF) + Qwen 2.5 VL text encoder (safetensors) + qwen_image_vae (safetensors)
Tested with Qwen-Image-Edit-2511-Q8 (21.8 GB GGUF) + Qwen 2.5 VL text encoder (Q8 GGUF) + qwen_image_vae (safetensors) — both GGUF diffusion and GGUF TE trigger the n_buffers==0 path, both handled by the fix
4-step 1024² t2i: wall 421s, valid 638 KB PNG, no errors/warnings
Confirmed safetensors-only paths are unaffected (VAE always loads correctly in both setups)

Happy to open a separate PR if you'd prefer, or you can incorporate it directly. The underlying ggml-alloc behavior is backend-agnostic, so I expect this generalizes to CUDA/Metal as well — confirmation from users on those backends would be welcome.

For models with mmap enabled, all tensors could already have a valid `t->data` pointer, but this condition triggers an error on `ggml_backend_alloc_ctx_tensors` (either a `NULL` return or an assertion failure).

wbruna · 2026-05-11T11:53:34Z

@junmo-kim , thanks for testing, and the fix! Unfortunately, we can't count on ggml_backend_alloc_ctx_tensors returning NULL in this case, because of that GGML_ASSERT: it could trigger an abort().

But I believe just moving the test before the allocation would work fine. Could you give 90370bf a try? (I've also removed the view_src test, since the mmap path never touches it).

junmo-kim · 2026-05-14T13:24:02Z

@wbruna Confirmed 90370bf works on my setup (Win11 + Vulkan + Radeon 780M, Qwen-Image-Q8 + --offload-to-cpu --mmap --lora-apply-mode at_runtime). alloc params backend buffer failed is gone, inference completes end-to-end with valid output.

Functionally equivalent to my earlier post-NULL workaround in side-by-side runs (same outputs, same wall-time). Dropping the view_src check is fine — mmap path doesn't touch it. LGTM, happy to switch over once this lands.

leejet · 2026-05-14T16:30:10Z

Thank you for your contribution.

Picks up 8 commits since the previous sync at 90e87bc: 0b82969 docs: add .github/pull_request_template.md 381e0df docs: add CONTRIBUTING.md 0665a7f feat: add hidream o1 image support (leejet#1485) eeac950 fix: Use PkgConfig for WebP and WebM (leejet#1400) 57ff2eb feat: support for memory-mapping model weights (leejet#1414) 9d68341 feat: add Euler CFG++ and Euler-A CFG++ samplers (leejet#1354) 60477fd docs: add new go bindings for stable-diffusion.cpp (leejet#1480) 6ee0684 feat: display server url with "http://" prefix. (leejet#1486) Conflicts, all in src/ggml_extend.hpp: 1. copy_data_to_backend_tensor signature: upstream made gf required (graph-cut needs the segment's graph to restrict uploads); our layer-streaming path needs gf=nullptr so each mini-graph uploads its full backend_tensor_data_map without filtering. Resolution: keep gf optional (default nullptr) and guard the graph_tensor_set filter on gf != nullptr. Upstream's new read_graph_tensor<T> template is added unchanged above copy_data_to_backend_tensor. 2. Tensor-loop null check: upstream added tensor/data null guards and a single ggml_get_name() lookup. Kept both, with our gf-gate layered on top of upstream's set-membership check. 3. alloc_params_buffer: upstream's mmap fast-path (skip allocation when every tensor already has data, since ggml_backend_alloc_ctx_tensors would hit n_buffers==0) and our pinned-host fast-path (allocate weights in the GPU device's host buffer for async H2D under offload) collide on the same function. Resolution: mmap check runs first and returns early — mmapped tensors can't be moved into pinned host memory — then the pinned-host path runs for the non-mmap CPU-params-with-GPU-runtime case, then the original pageable params_backend alloc as the final fallback. Smoke-tested on Z-Image-Turbo Q8 at 512x512: --offload-mode layer_streaming -> 4.0s total (coarse-stage path) --offload-to-cpu --max-vram 4 -> 8.3s total (3 graph-cut segments) HiDream O1 streaming hooks deferred to a follow-up commit.

wbruna force-pushed the sd_mmap_weights branch from bb1731d to 49115cb Compare April 14, 2026 22:30

wbruna changed the title ~~feat: initial support for memory-mapping model weights~~ feat: support for memory-mapping model weights Apr 14, 2026

wbruna force-pushed the sd_mmap_weights branch 2 times, most recently from 97190f6 to 776fea2 Compare April 19, 2026 19:32

wbruna force-pushed the sd_mmap_weights branch from 776fea2 to 12d6f98 Compare April 30, 2026 01:47

wbruna mentioned this pull request May 1, 2026

Add tensor splitting (row + tensor), lazy loading and autofitting logic #1470

Open

wbruna added 5 commits May 6, 2026 12:01

refactor: postpone alloc_params_buffer calls

4bc2d66

refactor: split model file processing from tensor loading

c805561

feat: initial support for memory-mapping model weights

06184f2

feat: enable memory-mapped tensors for all compatible backends

2a14f52

feat: enable memory-mapped tensors with --lora-apply-mode immediately

ec8de10

Instead of disabling mmap, we turn the mapping writable.

wbruna force-pushed the sd_mmap_weights branch from 12d6f98 to ec8de10 Compare May 6, 2026 23:34

pwilkin and others added 2 commits May 10, 2026 21:17

fix: avoid context allocation failure due to memory-mapped weights

90370bf

For models with mmap enabled, all tensors could already have a valid `t->data` pointer, but this condition triggers an error on `ggml_backend_alloc_ctx_tensors` (either a `NULL` return or an assertion failure).

leejet added 2 commits May 15, 2026 00:20

fix get_param_tensors_p call when tae_preview_only

f154eba

format code

2d684e6

leejet merged commit 57ff2eb into leejet:master May 14, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support for memory-mapping model weights#1414

feat: support for memory-mapping model weights#1414
leejet merged 10 commits into
leejet:masterfrom
wbruna:sd_mmap_weights

wbruna commented Apr 13, 2026 •

edited

Loading

Uh oh!

wbruna commented May 11, 2026

Uh oh!

junmo-kim commented May 11, 2026 •

edited

Loading

Uh oh!

wbruna commented May 11, 2026

Uh oh!

junmo-kim commented May 14, 2026

Uh oh!

Uh oh!

leejet commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wbruna commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented May 11, 2026

Uh oh!

junmo-kim commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Proposed fix

Verification

Uh oh!

wbruna commented May 11, 2026

Uh oh!

junmo-kim commented May 14, 2026

Uh oh!

Uh oh!

leejet commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wbruna commented Apr 13, 2026 •

edited

Loading

junmo-kim commented May 11, 2026 •

edited

Loading