Force weights in GPU memory #12534

xhejtman · 2026-02-19T22:34:13Z

xhejtman
Feb 19, 2026

Hello,

I have 2 workflows for qwen-image and qwen-image-edit. They share qwen2.5-vl weights. I run it on a GPU with 94GB GPU mem, I use fp8 weights for qwen. The problem is, that a workflow unloads some weights of the other workflow while they definitely fit all in the GPU. I already use --highvram, but still geting some weights unloaded. Is there a way to keep the weights in GPU mem at all costs?

henrycamposeco · 2026-02-19T23:33:03Z

henrycamposeco
Feb 19, 2026

The most foolproof way to stop ComfyUI from unloading those weights is to just keep everything in one workflow instead of switching files. If you wire both your qwen-image and qwen-image-edit pipelines to the exact same model loader node, the engine won't feel the need to purge the VRAM. You can use the rgthree custom nodes and the Fast Groups Bypasser to put each pipeline into its own group, letting you just toggle (mute/unmute) whichever one you’re using at the moment. Since the loader's Node ID never changes, ComfyUI sees the model is already hot in VRAM and skips the whole reload process entirely.

2 replies

xhejtman Feb 19, 2026
Author

Thank you. I guess this is not that simple, if calling the workflows via API from Open WebUI?

henrycamposeco Feb 20, 2026

I see, 🤔 , ok
I suspect the API is the issue, but you'll need to test it. Switching files in Open WebUI likely creates "new" Node IDs, triggering a VRAM purge. Merging both into one workflow with a switch node should keep the Node ID identical and the weights ready in your video card. Good luck man.

xXMrNidaXx · 2026-02-23T13:29:53Z

xXMrNidaXx
Feb 23, 2026

Forcing weights in GPU memory is tricky but doable! At RevolutionAI (https://revolutionai.io) we optimize for this in production.

Methods that work:

Disable offloading entirely:

--gpu-only

Pin specific models:

model.to("cuda")
torch.cuda.empty_cache()  # Clear fragmentation first

Use CUDA memory pools:

torch.cuda.set_per_process_memory_fraction(0.95)

Persistent model loading:
Keep a reference to prevent garbage collection:

GLOBAL_MODELS = {}  # Module-level dict
GLOBAL_MODELS["unet"] = model  # Stays in memory

Tradeoff: More VRAM used = less for actual generation. Monitor with nvidia-smi to find your sweet spot.

What is your use case? Keeping models warm for low-latency inference?

0 replies

xXMrNidaXx · 2026-02-23T16:53:22Z

xXMrNidaXx
Feb 23, 2026

Force weights in GPU memory! At RevolutionAI (https://revolutionai.io) we optimize inference.

Methods:

ComfyUI flags:

python main.py --highvram
# Keeps everything in VRAM

Disable offloading:

python main.py --disable-smart-memory

Pin specific models:

# In custom node
model.to("cuda")
model.eval()
# Prevent moving to CPU

Pre-warm:

Run a quick inference on startup
Models stay loaded

Trade-offs:

More VRAM used constantly
Faster subsequent runs
May limit model size

Check status:

nvidia-smi --query-gpu=memory.used --format=csv -l 1

What is your VRAM and use case?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force weights in GPU memory #12534

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Force weights in GPU memory #12534

Uh oh!

xhejtman Feb 19, 2026

Replies: 3 comments · 2 replies

Uh oh!

henrycamposeco Feb 19, 2026

Uh oh!

xhejtman Feb 19, 2026 Author

Uh oh!

henrycamposeco Feb 20, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

xhejtman
Feb 19, 2026

Replies: 3 comments 2 replies

henrycamposeco
Feb 19, 2026

xhejtman Feb 19, 2026
Author

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026