You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The VRAM Manager is a Python service that runs as a Docker container alongside Ollama and ComfyUI. It monitors Ollama's model loading and automatically frees ComfyUI's VRAM when needed, preventing Ollama from falling back to CPU mode (which is ~100x slower).
How It Works
┌─────────────────────────────────────────────────────────┐
│ VRAM Manager (Docker Container) │
│ Monitors every 5 seconds (configurable) │
└─────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Ollama │ │ ComfyUI │
│ GET /api/ps │ │ GET /stats │
└──────────────┘ └──────────────┘
│ │
▼ │
New model loading? │
│ │
├─── YES ────────────────────┤
│ ▼
│ POST /free API
│ (cache + models)
│ │
▼ ▼
Ollama loads model using freed VRAM
Triggers
New Model Loading — A new model name appears in Ollama's /api/ps response
VRAM Threshold — Total GPU memory usage exceeds 75% (configurable)
Rate Limited — Minimum 30 seconds between frees to prevent thrashing
Two-Stage Freeing
Soft Free (default) — Clears ComfyUI's memory cache only (unload_models=false)
Aggressive Free — If VRAM remains above 85% after soft free, fully unloads models (unload_models=true)
Configuration
All settings are in .env.dev:
# VRAM Manager
VRAM_MANAGER_CONTAINER_NAME=aistack-vram-manager
VRAM_CHECK_INTERVAL=5 # Check interval in seconds
VRAM_THRESHOLD=75 # Free ComfyUI when VRAM exceeds this %
VRAM_DEBUG=false # Enable debug logging# Ollama VRAM limits
OLLAMA_MAX_VRAM=10737418240 # 10 GB max VRAM allocation
OLLAMA_MAX_LOADED_MODELS=2 # Keep max 2 models in memory
OLLAMA_NUM_PARALLEL=2 # Handle 2 parallel requests# ComfyUI memory mode
COMFYUI_CLI_ARGS=--normalvram # Options: --normalvram, --lowvram, --highvram
VRAM Allocation Strategy
Service
Max VRAM
Purpose
Ollama
~10 GB
LLM inference (chat)
ComfyUI
~6 GB
Image generation (uses remaining VRAM + RAM)
Tuning for Different Workloads
Chat-Heavy (Many Users, Larger Models)
OLLAMA_MAX_VRAM=12884901888 # 12 GB for Ollama
OLLAMA_MAX_LOADED_MODELS=3
VRAM_THRESHOLD=70 # More proactive freeing
Image-Heavy (Complex Workflows, Video)
OLLAMA_MAX_VRAM=8589934592 # 8 GB for Ollama
COMFYUI_CLI_ARGS=--highvram # Keep models in VRAM
VRAM_THRESHOLD=85 # Less aggressive freeing
# Check if Ollama is using GPU
docker logs aistack-ollama | grep -E "VRAM|offload"# Check VRAM usage
nvidia-smi
# Manually free ComfyUI memory
docker exec aistack-comfyui curl -X POST http://localhost:8188/free \
-H "Content-Type: application/json" \
-d '{"unload_models": true, "free_memory": true}'
VRAM Manager Not Freeing Memory
# Enable debug logging# Set VRAM_DEBUG=true in .env.dev, then:
docker compose -f docker-compose.dev.yaml restart vram-manager
docker logs -f aistack-vram-manager
Common causes:
Check interval too long → try VRAM_CHECK_INTERVAL=2
Threshold too high → try VRAM_THRESHOLD=70
Rate limiting → 30s minimum between frees (by design)
Container Won't Start
# Check logs
docker logs aistack-vram-manager
# Verify services are reachable from inside the container
docker exec aistack-vram-manager curl http://aistack-ollama:11434/api/tags
docker exec aistack-vram-manager curl http://aistack-comfyui:8188/system_stats
# Verify all containers are on the same network
docker network inspect ai-stack-network | grep -A 5 "Containers"
ComfyUI Out of Memory
# Switch to low VRAM mode# Set COMFYUI_CLI_ARGS=--lowvram in .env.dev# Or reduce Ollama's allocation# Set OLLAMA_MAX_VRAM=8589934592 in .env.dev (8 GB)# Then restart
docker compose -f docker-compose.dev.yaml down
docker compose -f docker-compose.dev.yaml up -d
Services Won't Start (GPU Not Found)
# Check GPU is accessible
nvidia-smi
# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi