Skip to content

Conversation

@stancampbell3
Copy link

Description

Problem

Users with consumer-grade GPUs (like RTX 4090 with 11.49 GB VRAM) encounter OOM errors when running the T2V-1.3B model even with existing optimization flags (--offload_model True --t5_cpu). The OOM occurs because the VAE remains on GPU throughout the entire generation pipeline despite only being needed briefly for encoding/decoding.

Solution

This PR adds a --vae_cpu flag that works similarly to the existing --t5_cpu flag. When enabled:

  • VAE initializes on CPU instead of GPU
  • VAE moves to GPU only when needed for encode/decode operations
  • VAE returns to CPU after use, freeing VRAM for other models
  • Saves ~100-200MB VRAM without performance degradation

Implementation Details

  1. Added --vae_cpu argument to generate.py (mirrors --t5_cpu pattern)
  2. Updated all 4 pipelines: WanT2V, WanI2V, WanFLF2V, WanVace
  3. Fixed critical DiT offloading: When offload_model=True and t5_cpu=False, DiT now offloads before T5 loads to prevent OOM
  4. Handled VAE scale tensors: Ensured mean and std tensors move with the model

stancampbell3 and others added 2 commits October 17, 2025 01:34
Added environment.yml for project dependencies and wok/go.sh script. Updated .gitignore to exclude symlinks in wok directory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add --vae_cpu argument to enable VAE offloading for consumer GPUs with
limited VRAM. When enabled, VAE initializes on CPU and moves to GPU only
when needed for encoding/decoding operations.

Key changes:
- Add --vae_cpu argument to generate.py (mirrors --t5_cpu pattern)
- Update all 4 pipelines (T2V, I2V, FLF2V, VACE) with conditional VAE offloading
- Fix DiT offloading to free VRAM before T5 loading when offload_model=True
- Handle VAE scale tensors (mean/std) during device transfers

Benefits:
- Saves ~100-200MB VRAM without performance degradation
- Enables T2V-1.3B on more consumer GPUs (tested on 11.49GB GPU)
- Backward compatible (default=False)
- Consistent with existing --t5_cpu flag

Test results on 11.49 GiB VRAM GPU:
- Baseline: OOM (needed 80MB, only 85MB free)
- With --vae_cpu: Success
- With --t5_cpu: Success
- With both flags: Success (maximum VRAM savings)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant