VRAM Management

Overview

The VRAM Manager is a Python service that runs as a Docker container alongside Ollama and ComfyUI. It monitors Ollama's model loading and automatically frees ComfyUI's VRAM when needed, preventing Ollama from falling back to CPU mode (which is ~100x slower).

How It Works

┌─────────────────────────────────────────────────────────┐
│         VRAM Manager (Docker Container)                  │
│         Monitors every 5 seconds (configurable)          │
└─────────────────────────────────────────────────────────┘
            │                            │
            ▼                            ▼
    ┌──────────────┐            ┌──────────────┐
    │    Ollama    │            │   ComfyUI    │
    │  GET /api/ps │            │  GET /stats  │
    └──────────────┘            └──────────────┘
            │                            │
            ▼                            │
    New model loading?                   │
            │                            │
            ├─── YES ────────────────────┤
            │                            ▼
            │                   POST /free API
            │                   (cache + models)
            │                            │
            ▼                            ▼
    Ollama loads model using freed VRAM

Triggers

New Model Loading — A new model name appears in Ollama's /api/ps response
VRAM Threshold — Total GPU memory usage exceeds 75% (configurable)
Rate Limited — Minimum 30 seconds between frees to prevent thrashing

Two-Stage Freeing

Soft Free (default) — Clears ComfyUI's memory cache only (unload_models=false)
Aggressive Free — If VRAM remains above 85% after soft free, fully unloads models (unload_models=true)

Configuration

All settings are in .env.dev:

# VRAM Manager
VRAM_MANAGER_CONTAINER_NAME=aistack-vram-manager
VRAM_CHECK_INTERVAL=5           # Check interval in seconds
VRAM_THRESHOLD=75               # Free ComfyUI when VRAM exceeds this %
VRAM_DEBUG=false                # Enable debug logging

# Ollama VRAM limits
OLLAMA_MAX_VRAM=10737418240     # 10 GB max VRAM allocation
OLLAMA_MAX_LOADED_MODELS=2      # Keep max 2 models in memory
OLLAMA_NUM_PARALLEL=2           # Handle 2 parallel requests

# ComfyUI memory mode
COMFYUI_CLI_ARGS=--normalvram   # Options: --normalvram, --lowvram, --highvram

VRAM Allocation Strategy

Service	Max VRAM	Purpose
Ollama	~10 GB	LLM inference (chat)
ComfyUI	~6 GB	Image generation (uses remaining VRAM + RAM)

Tuning for Different Workloads

Chat-Heavy (Many Users, Larger Models)

OLLAMA_MAX_VRAM=12884901888     # 12 GB for Ollama
OLLAMA_MAX_LOADED_MODELS=3
VRAM_THRESHOLD=70               # More proactive freeing

Image-Heavy (Complex Workflows, Video)

OLLAMA_MAX_VRAM=8589934592      # 8 GB for Ollama
COMFYUI_CLI_ARGS=--highvram     # Keep models in VRAM
VRAM_THRESHOLD=85               # Less aggressive freeing

Both Services Struggling

OLLAMA_MAX_VRAM=7516192768      # 7 GB for Ollama
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_NUM_PARALLEL=1
COMFYUI_CLI_ARGS=--lowvram      # Aggressive memory saving
VRAM_THRESHOLD=65
VRAM_CHECK_INTERVAL=3

After changing settings, restart the stack:

docker compose -f docker-compose.dev.yaml down
docker compose -f docker-compose.dev.yaml up -d

Model Size Reference

Ollama LLM Models (approximate VRAM)

Size	Examples	VRAM Usage
7B	Mistral, Llama2-7B, Phi-3	4–5 GB
13B	Llama2-13B, Vicuna-13B	7–8 GB
30B+	Llama2-70B Q4, CodeLlama-34B Q4	10–15 GB

ComfyUI Workflows (approximate VRAM)

Workflow	VRAM Usage
Simple SDXL	3–4 GB
Flux	4–6 GB
Multiple LoRAs	+1–2 GB each
ControlNet	+1–2 GB
Video (AnimateDiff)	8–10 GB

Monitoring

# VRAM Manager logs
docker logs -f aistack-vram-manager

# GPU usage (real-time)
watch -n 1 nvidia-smi

# Container resource usage
docker stats aistack-ollama aistack-comfyui aistack-vram-manager

# Count memory frees
docker logs aistack-vram-manager | grep "Freed ComfyUI memory" | wc -l

# Check for errors
docker logs aistack-vram-manager | grep ERROR

# Verify Ollama environment
docker inspect aistack-ollama | grep -A 10 "Env"

Performance Expectations

Operation	Expected Time
Chat response (7B)	1–3 seconds
Chat response (13B)	2–5 seconds
SDXL image (512×512)	3–8 seconds
SDXL image (1024×1024)	8–15 seconds
Video generation	2–5 minutes (uses RAM)

Concurrent Load

2–3 chat + 1 image: ✅ Works well
5 chat users: ✅ Queue forms but manageable
2+ simultaneous images: ⚠️ May cause slowdown
13B+ model + image gen: ✅ VRAM Manager handles it

Troubleshooting

Ollama Falls Back to CPU (Very Slow Responses)

# Check if Ollama is using GPU
docker logs aistack-ollama | grep -E "VRAM|offload"

# Check VRAM usage
nvidia-smi

# Manually free ComfyUI memory
docker exec aistack-comfyui curl -X POST http://localhost:8188/free \
  -H "Content-Type: application/json" \
  -d '{"unload_models": true, "free_memory": true}'

VRAM Manager Not Freeing Memory

# Enable debug logging
# Set VRAM_DEBUG=true in .env.dev, then:
docker compose -f docker-compose.dev.yaml restart vram-manager
docker logs -f aistack-vram-manager

Common causes:

Check interval too long → try VRAM_CHECK_INTERVAL=2
Threshold too high → try VRAM_THRESHOLD=70
Rate limiting → 30s minimum between frees (by design)

Container Won't Start

# Check logs
docker logs aistack-vram-manager

# Verify services are reachable from inside the container
docker exec aistack-vram-manager curl http://aistack-ollama:11434/api/tags
docker exec aistack-vram-manager curl http://aistack-comfyui:8188/system_stats

# Verify all containers are on the same network
docker network inspect ai-stack-network | grep -A 5 "Containers"

ComfyUI Out of Memory

# Switch to low VRAM mode
# Set COMFYUI_CLI_ARGS=--lowvram in .env.dev

# Or reduce Ollama's allocation
# Set OLLAMA_MAX_VRAM=8589934592 in .env.dev (8 GB)

# Then restart
docker compose -f docker-compose.dev.yaml down
docker compose -f docker-compose.dev.yaml up -d

Services Won't Start (GPU Not Found)

# Check GPU is accessible
nvidia-smi

# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

API Endpoints Used

Ollama

Method	Endpoint	Purpose
GET	`/api/tags`	List available models
GET	`/api/ps`	List running/loaded models

ComfyUI

Method	Endpoint	Purpose
GET	`/system_stats`	System status
POST	`/free`	Free memory (`{"unload_models": bool, "free_memory": true}`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM Management

Overview

How It Works

Triggers

Two-Stage Freeing

Configuration

VRAM Allocation Strategy

Tuning for Different Workloads

Chat-Heavy (Many Users, Larger Models)

Image-Heavy (Complex Workflows, Video)

Both Services Struggling

Model Size Reference

Ollama LLM Models (approximate VRAM)

ComfyUI Workflows (approximate VRAM)

Monitoring

Performance Expectations

Concurrent Load

Troubleshooting

Ollama Falls Back to CPU (Very Slow Responses)

VRAM Manager Not Freeing Memory

Container Won't Start

ComfyUI Out of Memory

Services Won't Start (GPU Not Found)

API Endpoints Used

Ollama

ComfyUI

Resources

FilesExpand file tree

VRAM-Management.md

Latest commit

History

VRAM-Management.md

File metadata and controls

VRAM Management

Overview

How It Works

Triggers

Two-Stage Freeing

Configuration

VRAM Allocation Strategy

Tuning for Different Workloads

Chat-Heavy (Many Users, Larger Models)

Image-Heavy (Complex Workflows, Video)

Both Services Struggling

Model Size Reference

Ollama LLM Models (approximate VRAM)

ComfyUI Workflows (approximate VRAM)

Monitoring

Performance Expectations

Concurrent Load

Troubleshooting

Ollama Falls Back to CPU (Very Slow Responses)

VRAM Manager Not Freeing Memory

Container Won't Start

ComfyUI Out of Memory

Services Won't Start (GPU Not Found)

API Endpoints Used

Ollama

ComfyUI

Resources