feat: add --vae_cpu flag for improved VRAM optimization on consumer GPUsMovie elves #535

stancampbell3 · 2025-10-17T10:22:20Z

Description

Problem

Users with consumer-grade GPUs (like RTX 4090 with 11.49 GB VRAM) encounter OOM errors when running the T2V-1.3B model even with existing optimization flags (--offload_model True --t5_cpu). The OOM occurs because the VAE remains on GPU throughout the entire generation pipeline despite only being needed briefly for encoding/decoding.

Solution

This PR adds a --vae_cpu flag that works similarly to the existing --t5_cpu flag. When enabled:

VAE initializes on CPU instead of GPU
VAE moves to GPU only when needed for encode/decode operations
VAE returns to CPU after use, freeing VRAM for other models
Saves ~100-200MB VRAM without performance degradation

Implementation Details

Added --vae_cpu argument to generate.py (mirrors --t5_cpu pattern)
Updated all 4 pipelines: WanT2V, WanI2V, WanFLF2V, WanVace
Fixed critical DiT offloading: When offload_model=True and t5_cpu=False, DiT now offloads before T5 loads to prevent OOM
Handled VAE scale tensors: Ensured mean and std tensors move with the model

Added environment.yml for project dependencies and wok/go.sh script. Updated .gitignore to exclude symlinks in wok directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add --vae_cpu argument to enable VAE offloading for consumer GPUs with limited VRAM. When enabled, VAE initializes on CPU and moves to GPU only when needed for encoding/decoding operations. Key changes: - Add --vae_cpu argument to generate.py (mirrors --t5_cpu pattern) - Update all 4 pipelines (T2V, I2V, FLF2V, VACE) with conditional VAE offloading - Fix DiT offloading to free VRAM before T5 loading when offload_model=True - Handle VAE scale tensors (mean/std) during device transfers Benefits: - Saves ~100-200MB VRAM without performance degradation - Enables T2V-1.3B on more consumer GPUs (tested on 11.49GB GPU) - Backward compatible (default=False) - Consistent with existing --t5_cpu flag Test results on 11.49 GiB VRAM GPU: - Baseline: OOM (needed 80MB, only 85MB free) - With --vae_cpu: Success - With --t5_cpu: Success - With both flags: Success (maximum VRAM savings) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

stancampbell3 and others added 2 commits October 17, 2025 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --vae_cpu flag for improved VRAM optimization on consumer GPUsMovie elves #535

feat: add --vae_cpu flag for improved VRAM optimization on consumer GPUsMovie elves #535

Uh oh!

stancampbell3 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add --vae_cpu flag for improved VRAM optimization on consumer GPUsMovie elves #535

Are you sure you want to change the base?

feat: add --vae_cpu flag for improved VRAM optimization on consumer GPUsMovie elves #535

Uh oh!

Conversation

stancampbell3 commented Oct 17, 2025

Description

Problem

Solution

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant