Releases: huggingface/diffusers
Diffusers 0.37.0: Modular Diffusers, New image and video pipelines, multiple core library improvements, and more 🔥
Modular Diffusers
Modular Diffusers introduces a new way to build diffusion pipelines by composing reusable blocks. Instead of writing entire pipelines from scratch, you can now mix and match building blocks to create custom workflows tailored to your specific needs! This complements the existing DiffusionPipeline class, providing a more flexible way to create custom diffusion pipelines.
Find more details on how to get started with Modular Diffusers here, and also check out the announcement post.
New Pipelines and Models
Image 🌆
- Z Image Omni Base: Z-Image is the foundation model of the Z-Image family, engineered for good quality, robust generative diversity, broad stylistic coverage, and precise prompt adherence. While Z-Image-Turbo is built for speed, Z-Image is a full-capacity, undistilled transformer designed to be the backbone for creators, researchers, and developers who require the highest level of creative freedom. Thanks to @RuoyiDufor for contributing this in #12857.
- Flux2 Klein:FLUX.2 [Klein] unifies generation and editing in a single compact architecture, delivering state-of-the-art quality with end-to-end inference in as low as under a second. Built for applications that require real-time image generation without sacrificing quality, and runs on consumer hardware, with as little as 13GB VRAM.
- Qwen Image Layered: Qwen-Image-Layered is a model capable of decomposing an image into multiple RGBA layers. This layered representation unlocks inherent editability: each layer can be independently manipulated without affecting other content. Thanks to @naykun for contributing this in #12853.
- FIBO Edit: Fibo Edit is an 8B parameter image-to-image model that introduces a new paradigm of structured control, operating on JSON inputs paired with source images to enable deterministic and repeatable editing workflows. Featuring native masking for granular precision, it moves beyond simple prompt-based diffusion to offer explicit, interpretable control optimized for production environments. Its lightweight architecture is designed for deep customization, empowering researchers to build specialized “Edit” models for domain-specific tasks while delivering top-tier aesthetic quality. Thanks galbria for contributing it in #12930.
- Cosmos Predict2.5: Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the future state of the world. Thanks to @miguelmartin75 for contributing it in #12852.
- Cosmos Transfer2.5: Cosmos-Transfer2.5 is a conditional world generation model with adaptive multimodal control, that produces high-quality world simulations conditioned on multiple control inputs. These inputs can take different modalities—including edges, blurred video, segmentation maps, and depth maps. Thanks to @miguelmartin75 for contributing it in #13066.
- GLM-Image: GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. Thanks to @zRzRzRzRzRzRzR for contributing it in #12973.
- RAE: Representation Autoencoders (aka RAE) are an exciting alternative to traditional VAEs, typically used in the area of latent-space diffusion models of image generation. RAEs leverage pre-trained vision encoders and train lightweight decoders for the task of reconstruction.
Video + audio 🎥 🎼
- LTX-2: LTX-2 is an audio-conditioned text-to-video generation model that can generate videos with synced audio. Full and distilled model inference, as well as two-stage inference with spatial sampling, is supported. We also support a conditioning pipeline that allows for passing different conditions (such as images, series of images, etc.). Check out the docs to learn more!
- Helios: Helios is a 14B video generation model that runs at 17 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching a strong baseline in quality. Thanks to @SHYuanBest for contributing this in #13208.
Improvements to Core Library
New caching methods
- MagCache — thanks to @AlanPonnachan!
- TaylorSeer — thanks to @toilaluan!
New context-parallelism (CP) backends
- Unified Sequence Parallel attention — thanks to @Bissmella!
- Ulysses Anything Attention — thanks to @DefTruth!
Misc
- Mambo-G Guidance: New guider implementation (#12862)
- Laplace Scheduler for DDPM (#11320)
- Custom Sigmas in UniPCMultistepScheduler (#12109)
- MultiControlNet support for SD3 Inpainting (#11251)
- Context parallel in native flash attention (#12829)
- NPU Ulysses Attention Support (#12919)
- Fix Wan 2.1 I2V Context Parallel Inference (#12909)
- Fix Qwen-Image Context Parallel Inference (#12970)
- Introduction to
@apply_lora_scaledecorator for simplifying model definitions (#12994) - Introduction of pipeline-level “cpu”
device_map(#12811) - Enable CP for kernels-based attention backends (#12812)
- Diffusers is fully functional with Transformers V5 (#12976)
A lot of the above features/improvements came as part of the MVP program we have been running. Immense thanks to the contributors!
Bug Fixes
- Fix QwenImageEditPlus on NPU (#13017)
- Fix MT5Tokenizer → use
T5Tokenizerfor Transformers v5.0+ compatibility (#12877) - Fix Wan/WanI2V patchification (#13038)
- Fix LTX-2 inference with
num_videos_per_prompt > 1and CFG (#13121) - Fix Flux2 img2img prediction (#12855)
- Fix QwenImage
txt_seq_lenshandling (#12702) - Fix
prefix_token_lenbug (#12845) - Fix ftfy imports in Wan and SkyReels-V2 (#12314, #13113)
- Fix
is_fsdpdetermination (#12960) - Fix GLM-Image
get_image_featuresAPI (#13052) - Fix Wan 2.2 when either transformer isn't present (#13055)
- Fix guider issue (#13147)
- Fix torchao quantizer for new versions (#12901)
- Fix GGUF for unquantized types with unquantize kernels (#12498)
- Make Qwen hidden states contiguous for torchao (#13081)
- Make Flux hidden states contiguous (#13068)
- Fix Kandinsky 5 hardcoded CUDA autocast (#12814)
- Fix
aiteravailability check (#13059) - Fix attention mask check for unsupported backends (#12892)
- Allow
promptandprior_token_idssimultaneously inGlmImagePipeline(#13092) - GLM-Image batch support (#13007)
- Cosmos 2.5 Video2World frame extraction fix (#13018)
- ResNet: only use contiguous in training mode (#12977)
All commits
- [PRX] Improve model compilation by @WaterKnight1998 in #12787
- Improve docstrings and type hints in scheduling_dpmsolver_singlestep.py by @delmalih in #12798
- [Modular]z-image by @yiyixuxu in #12808
- Fix Qwen Edit Plus modular for multi-image input by @sayakpaul in #12601
- [WIP] Add Flux2 modular by @DN6 in #12763
- [docs] improve distributed inference cp docs. by @sayakpaul in #12810
- post release 0.36.0 by @sayakpaul in #12804
- Update distributed_inference.md to correct syntax by @sayakpaul in #12827
- [lora] Remove lora docs unneeded and add " # Copied from ..." by @sayakpaul in #12824
- support CP in native flash attention by @sywangyi in #12829
- [qwen-image] edit 2511 support by @naykun in #12839
- fix pytest tests/pipelines/pixart_sigma/test_pixart.py::PixArtSigmaPi… by @sywangyi in #12842
- Support for control-lora by @lavinal712 in #10686
- Add support for LongCat-Image by @junqiangwu in #12828
- fix the prefix_token_len bug by @junqiangwu in #12845
- extend TorchAoTest::test_model_memory_usage to other platform by @sywangyi in #12768
- Qwen Image Layered Support by @naykun in #12853
- Z-Image-Turbo ControlNet by @hlky in #12792
- Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion by @miguelmartin75 in #12852
- more update in modular by @yiyixuxu in #12560
- Feature: Add Mambo-G Guidance as Guider by @MatrixTeam-AI in #12862
- Add
OvisImagePipelineinAUTO_TEXT2IMAGE_PIPELINES_MAPPINGby @alvarobartt in #12876 - Cosmos Predict2.5 14b Conversion by @miguelmartin75 in #12863
- Use
T5Tokenizerinstead ofMT5Tokenizer(removed in Transformers v5.0+) by @alvarobartt in #12877 - Add z-image-omni-base implementation by @RuoyiDu in #12857
- fix torchao quantizer for new torchao versions by @vkuzo in #12901
- fix Qwen Image Transformer s...
Diffusers 0.36.0: Pipelines galore, new caching method, training scripts, and more 🎄
The release features a number of new image and video pipelines, a new caching method, a new training script, new kernels - powered attention backends, and more. It is quite packed with a lot of new stuff, so make sure you read the release notes fully 🚀
New image pipelines
- Flux2: Flux2 is the latest generation of image generation and editing model from Black Forest Labs. It’s capable of taking multiple input images as reference, making it versatile for different use cases.
- Z-Image: Z-Image is a best-of-its-kind image generation model in the 6B param regime. Thanks to @JerryWu-code in #12703.
- QwenImage Edit Plus: It’s an upgrade of QwenImage Edit and is capable of taking multiple input images as references. It can act as both a generation and an editing model. Thanks to @naykun for contributing in #12357.
- Bria FIBO: FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs. Thanks to @galbria for contributing this in #12545.
- Kandinsky Image Lite: Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters). Thanks to @leffff for contributing this in #12664.
- ChronoEdit: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. Thanks to @zhangjiewu for contributing this in #12593.
New video pipelines
- Sana-Video: Sana-Video is a fast and efficient video generation model, equipped to handle long video sequences, thanks to its incorporation of linear attention. Thanks to @lawrence-cj for contributing this in #12634.
- Kandinsky 5: Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem. Thanks to @leffff for contributing this in #12478.
- Hunyuan 1.5: HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs.
- Wan Animate: Wan-Animate is a state-of-the-art character animation and replacement video model based on Wan2.1. Given a reference character image and driving motion video, it can either animate the character with motion from the driving video, or replace the existing character in that video with that character.
New kernels-powered attention backends
The kernels library helps you save a lot of time by providing pre-built kernel interfaces for various environments and accelerators. This release features three new kernels-powered attention backends:
- Flash Attention 3 (+ its
varlenvariant) - Flash Attention 2 (+ its
varlenvariant) - SAGE
This means if any of the above backend is supported by your development environment, you should be able to skip the manual process of building the corresponding kernels and just use:
# Make sure you have `kernels` installed: `pip install kernels`.
# You can choose `flash_hub` or `sage_hub`, too.
pipe.transformer.set_attention_backend("_flash_3_hub")For more details, check out the documentation.
TaylorSeer cache
TaylorSeer is now supported in Diffusers, delivering upto 3x speedups with negligible-to-none quality compromise. Thanks to @toilaluan for contributing this in #12648. Check out the documentation here.
New training script
Our Flux.2 integration features a LoRA fine-tuning script that you can check out here. We provide a number of optimizations to help make it run on consumer GPUs.
Misc
- Reusing
AttentionMixin: Making certain compatible models subclass from theAttentionMixinclass helped us get rid of 2K LoC. Going forward, users can expect more such refactorings that will help make the library leaner and simpler. Check out #12463 for more details. - Diffusers backend in SGLang: sgl-project/sglang#14112.
- We started the Diffusers MVP program to work with talented community members who will help us improve the library across multiple fronts. Check out the link for more information.
All commits
- remove unneeded checkpoint imports. by @sayakpaul in #12488
- [tests] fix clapconfig for text backbone in audioldm2 by @sayakpaul in #12490
- ltx0.9.8 (without IC lora, autoregressive sampling) by @yiyixuxu in #12493
- [docs] Attention checks by @stevhliu in #12486
- [CI] Check links by @stevhliu in #12491
- [ci] xfail more incorrect transformer imports. by @sayakpaul in #12455
- [tests] introduce
VAETesterMixinto consolidate tests for slicing and tiling by @sayakpaul in #12374 - docs: cleanup of runway model by @EazyAl in #12503
- Kandinsky 5 is finally in Diffusers! by @leffff in #12478
- Remove Qwen Image Redundant RoPE Cache by @dg845 in #12452
- Raise warning instead of error when imports are missing for custom code by @DN6 in #12513
- Fix: Use incorrect temporary variable key when replacing adapter name… by @FeiXie8 in #12502
- [docs] Organize toctree by modality by @stevhliu in #12514
- styling issues. by @sayakpaul in #12522
- Add Photon model and pipeline support by @DavidBert in #12456
- purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #12497
- Prx by @DavidBert in #12525
- [core]
AutoencoderMixinto abstract common methods by @sayakpaul in #12473 - Kandinsky5 No cfg fix by @asomoza in #12527
- Fix: Add _skip_keys for AutoencoderKLWan by @yiyixuxu in #12523
- [CI] xfail the test_wuerstchen_prior test by @sayakpaul in #12530
- [tests] Test attention backends by @sayakpaul in #12388
- fix CI bug for kandinsky3_img2img case by @kaixuanliu in #12474
- Fix MPS compatibility in get_1d_sincos_pos_embed_from_grid #12432 by @Aishwarya0811 in #12449
- Handle deprecated transformer classes by @DN6 in #12517
- fix constants.py to user
upper()by @sayakpaul in #12479 - HunyuanImage21 by @yiyixuxu in #12333
- Loose the criteria tolerance appropriately for Intel XPU devices by @kaixuanliu in #12460
- Deprecate Stable Cascade by @DN6 in #12537
- [chore] Move guiders experimental warning by @sayakpaul in #12543
- Fix Chroma attention padding order and update docs to use
lodestones/Chroma1-HDby @josephrocca in #12508 - Add AITER attention backend by @lauri9 in #12549
- Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline by @alirezafarashah in #12531
- Kandinsky 5 10 sec (NABLA suport) by @leffff in #12520
- Improve pos embed for Flux.1 inference on Ascend NPU by @gameofdimension in #12534
- support latest few-step wan LoRA. by @sayakpaul in #12541
- [Pipelines] Enable Wan VACE to run since single transformer by @DN6 in #12428
- fix crash if tiling mode is enabled by @sywangyi in #12521
- Fix typos in kandinsky5 docs by @Meatfucker in #12552
- [ci] don't run sana layerwise casting tests in CI. by @sayakpaul in #12551
- Bria fibo by @galbria in #12545
- Avoiding graph break by changing the way we infer dtype in vae.decoder by @ppadjinTT in #12512
- [Modular] Fix for custom block kwargs by @DN6 in #12561
- [Modular] Allow custom blocks to be saved to
local_dirby @DN6 in #12381 - Fix Stable Diffusion 3.x pooled prompt embedding with multiple images by @friedrich in #12306
- Fix custom code loading in Automodel by @DN6 in #12571
- [modular] better warn message by @yiyixuxu in #12573
- [tests] add tests for flux modular (t2i, i2i, kontext) by @sayakpaul in #12566
- [modular]pass hub_kwargs to load_config by @yiyixuxu in #12577
- ulysses enabling in native attention path by @sywangyi in #12563
- Kandinsky ...
🐞 fixes for `transformers` models, imports,
All commits
- Release: v0.35.1-patch by @sayakpaul (direct commit on v0.35.2-patch)
- handle offload_state_dict when initing transformers models by @sayakpaul in #12438
- [CI] Fix TRANSFORMERS_FLAX_WEIGHTS_NAME import issue by @DN6 in #12354
- Fix PyTorch 2.3.1 compatibility: add version guard for torch.library.… by @Aishwarya0811 in #12206
- fix scale_shift_factor being on cpu for wan and ltx by @vladmandic in #12347
- Release: v0.35.2-patch by @sayakpaul (direct commit on v0.35.2-patch)
v0.35.1 for improvements in Qwen-Image Edit
Diffusers 0.35.0: Qwen Image pipelines, Flux Kontext, Wan 2.2, and more
This release comes packed with new image generation and editing pipelines, a new video pipeline, new training scripts, quality-of-life improvements, and much more. Read the rest of the release notes fully to not miss out on the fun stuff.
New pipelines 🧨
We welcomed new pipelines in this release:
- Wan 2.2
- Flux-Kontext
- Qwen-Image
- Qwen-Image-Edit
Wan 2.2 📹
This update to Wan provides significant improvements in video fidelity, prompt adherence, and style. Please check out the official doc to learn more.
Flux-Kontext 🎇
Flux-Kontext is a 12-billion-parameter rectified flow transformer capable of editing images based on text instructions. Please check out the official doc to learn more about it.
Qwen-Image 🌅
After a successful run of delivering language models and vision-language models, the Qwen team is back with an image generation model, which is Apache-2.0 licensed! It achieves significant advances in complex text rendering and precise image editing. To learn more about this powerful model, refer to our docs.
Thanks to @naykun for contributing both Qwen-Image and Qwen-Image-Edit via this PR and this PR.
New training scripts 🎛️
Make these newly added models your own with our training scripts:
Single-file modeling implementations
Following the 🤗 Transformers’ philosophy of single-file modeling implementations, we have started implementing modeling code in single and self-contained files. The Flux Transformer code is one example of this.
Attention refactor
We have massively refactored how we do attention in the models. This allows us to provide support for different attention backends (such as PyTorch native scaled_dot_product_attention, Flash Attention 3, SAGE attention, etc.) in the library seamlessly.
Having attention supported this way also allows us to integrate different parallelization mechanisms, which we’re actively working on. Follow this PR if you’re interested.
Users shouldn’t be affected at all by these changes. Please open an issue if you face any problems.
Regional compilation
Regional compilation trims cold-start latency by only compiling the small and frequently-repeated block(s) of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Refer to this doc to learn more.
Thanks to @anijain2305 for contributing this feature in this PR.
We have also authored a number of posts that center around the use of torch.compile. You can check them out at the links below:
- Presenting Flux Fast: Making Flux go brrr on H100s
- torch.compile and Diffusers: A Hands-On Guide to Peak Performance
- Fast LoRA inference for Flux with Diffusers and PEFT
Faster pipeline loading ⚡️
Users can now load pipelines directly on an accelerator device leading to significantly faster load times. This particularly becomes evident when loading large pipelines like Wan and Qwen-Image.
from diffusers import DiffusionPipeline
import torch
ckpt_id = "Qwen/Qwen-Image"
pipe = DiffusionPipeline.from_pretrained(
- ckpt_id, torch_dtype=torch.bfloat16
- ).to("cuda")
+ ckpt_id, torch_dtype=torch.bfloat16, device_map="cuda"
+ ) You can speed up loading even more by enabling parallelized loading of state dict shards. This is particularly helpful when you’re working with large models like Wan and Qwen-Image, where the model state dicts are typically sharded across multiple files.
import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "yes"
# rest of the loading code
....Better GGUF integration
@Isotr0py contributed support for native GGUF CUDA kernels in this PR. This should provide an approximately 10% improvement in inference speed.
We have also worked on a tool for converting regular checkpoints to GGUF, letting the community easily share their GGUF checkpoints. Learn more here.
We now support loading of Diffusers format GGUF checkpoints.
You can learn more about all of this in our GGUF official docs.
Modular Diffusers (Experimental)
Modular Diffusers is a system for building diffusion pipelines pipelines with individual pipeline blocks. It is highly customisable, with blocks that can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows.
The API is currently in active development and is being released as an experimental feature. Learn more in our docs.
All commits
- [tests] skip instead of returning. by @sayakpaul in #11793
- adjust to get CI test cases passed on XPU by @kaixuanliu in #11759
- fix deprecation in lora after 0.34.0 release by @sayakpaul in #11802
- [chore] post release v0.34.0 by @sayakpaul in #11800
- Follow up for Group Offload to Disk by @DN6 in #11760
- [rfc][compile] compile method for DiffusionPipeline by @anijain2305 in #11705
- [tests] add a test on torch compile for varied resolutions by @sayakpaul in #11776
- adjust tolerance criteria for
test_float16_inferencein unit test by @kaixuanliu in #11809 - Flux Kontext by @a-r-r-o-w in #11812
- Kontext training by @sayakpaul in #11813
- Kontext fixes by @a-r-r-o-w in #11815
- remove syncs before denoising in Kontext by @sayakpaul in #11818
- [CI] disable onnx, mps, flax from the CI by @sayakpaul in #11803
- TorchAO compile + offloading tests by @a-r-r-o-w in #11697
- Support dynamically loading/unloading loras with group offloading by @a-r-r-o-w in #11804
- [lora] fix: lora unloading behvaiour by @sayakpaul in #11822
- [lora]feat: use exclude modules to loraconfig. by @sayakpaul in #11806
- ENH: Improve speed of function expanding LoRA scales by @BenjaminBossan in #11834
- Remove print statement in SCM Scheduler by @a-r-r-o-w in #11836
- [tests] add test for hotswapping + compilation on resolution changes by @sayakpaul in #11825
- reset deterministic in tearDownClass by @jiqing-feng in #11785
- [tests] Fix failing float16 cuda tests by @a-r-r-o-w in #11835
- [single file] Cosmos by @a-r-r-o-w in #11801
- [docs] fix single_file example. by @sayakpaul in #11847
- Use real-valued instead of complex tensors in Wan2.1 RoPE by @mjkvaak-amd in #11649
- [docs] Batch generation by @stevhliu in #11841
- [docs] Deprecated pipelines by @stevhliu in #11838
- fix norm not training in train_control_lora_flux.py by @Luo-Yihang in #11832
- [From Single File] support
from_single_filemethod forWanVACE3DTransformerby @J4BEZ in #11807 - [lora] tests for
exclude_moduleswith Wan VACE by @sayakpaul in #11843 - update: FluxKontextInpaintPipeline support by @vuongminh1907 in #11820
- [Flux Kontext] Support Fal Kontext LoRA by @linoytsaban in #11823
- [docs] Add a note of
_keep_in_fp32_modulesby @a-r-r-o-w in #11851 - [benchmarks] overhaul benchmarks by @sayakpaul in #11565
- FIX set_lora_device when target layers differ by @BenjaminBossan in #11844
- Fix Wan AccVideo/CausVid fuse_lora by @a-r-r-o-w in #11856
- [chore] deprecate blip controlnet pipeline. by @sayakpaul in #11877
- [docs] fix references in flux pipelines. by @sayakpaul in #11857
- [tests] remove tests for deprecated pipelines. by @sayakpaul in #11879
- [docs] LoRA metadata by @stevhliu in #11848
- [training ] add Kontext i2i training by @sayakpaul in #11858
- [CI] Fix big GPU test marker by @DN6 in #11786
- First Block Cache by @a-r-r-o-w in #11180
- [tests] annotate compilation test classes with bnb by @sayakpaul in #11715
- Update chroma.md by @shm4r7 in #11891
- [CI] Speed up GPU PR Tests by @DN6 in #11887
- Pin k-diffusion for CI by @sayakpaul in #11894
- [Docker] update doc builder dockerfile to include quant libs. by @sayakpaul in #11728
- [tests] Remove more deprecated tests by @sayakpaul in #11895
- [tests] mark the wanvace lora tester flaky by @sayakpaul in #11883
- [tests] add compile + offload tests for GGUF. by @sayakpaul in #11740
- feat: add multiple input image support in Flux Kontext by @Net-Mist in #11880
- Fix unique memory address when doing group-offloading with disk by @sayakpaul in #11767
- [SD3] CFG Cutoff fix and official callback by @asomoza in #11890
- The Modular Diffusers by @yiyixuxu in #9672
- [quant] QoL improvements for pipeline-level quant config by @sayakpaul in ...
Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more
📹 New video generation pipelines
Wan VACE
Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:
- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
- Inpainting and Outpainting
- Subject to Video (faces, object, characters, etc.)
- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)
The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.
Check out the docs to learn more.
Cosmos Predict2 Video2World
Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.
The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.
LTX 0.9.7 and Distilled
LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.
Check out the docs to learn more.
Hunyuan Video Framepack and F1
Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.
FusionX
The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():
from diffusers import WanTransformer3DModel
transformer = WanTransformer3DModel.from_single_file(
"https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
torch_dtype=torch.bfloat16
)To load the LoRAs, use load_lora_weights():
pipe = DiffusionPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
"vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)AccVideo and CausVid (only LoRAs)
AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.
🌠 New image generation pipelines
Cosmos Predict2 Text2Image
Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.
Chroma
Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more
Thanks to @Ednaordinary for contributing it in this PR!
VisualCloze
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:
- Support for various in-domain tasks
- Generalization to unseen tasks through in-context learning
- Unify multiple tasks into one step and generate both target image and intermediate results
- Support reverse-engineering conditions from target images
Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!
Better torch.compile support
We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:
Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:
Code
import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=0.,
height=768,
width=1360,
num_inference_steps=4,
max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:
You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:
Code
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
import torch
torch._dynamo.config.recompile_limit = 1000
quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)
ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
ckpt_id,
subfolder="text_encoder_2",
quantization_config=text_encoder_2_quant_config,
torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
ckpt_id,
subfolder="transformer",
quantization_config=dit_quant_config,
torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
ckpt_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=3.5,
height=768,
width=1360,
num_inference_steps=28,
max_sequence_length=512,
).images[0]Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.
Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.
Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.
PipelineQuantizationConfig
Users can now provide a quantization config while initializing a pipeline:
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("photo of a cute dog").images[0]This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about [different configurations](https://huggingface.co/docs/diffusers/main/en/quantization/overview...
v0.33.1: fix ftfy import
Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more
New Pipelines for Video Generation
Wan 2.1
Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.
Wan-AI/Wan2.1-T2V-1.3B-DiffusersWan-AI/Wan2.1-T2V-14B-DiffusersWan-AI/Wan2.1-I2V-14B-480P-DiffusersWan-AI/Wan2.1-I2V-14B-720P-Diffusers
Check out the docs here to learn more.
LTX Video 0.9.5
LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).
To support these additional conditioning inputs, we’ve introduced the LTXConditionPipeline and LTXVideoCondition object.
To learn more about the usage, check out the docs here.
Hunyuan Image to Video
Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.
To learn more, check out the docs here.
Others
- EasyAnimateV5 (thanks to @bubbliiiing for contributing this in this PR)
- ConsisID (thanks to @SHYuanBest for contributing this in this PR)
New Pipelines for Image Generation
Sana-Sprint
SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.
Shoutout to @lawrence-cj for their help and guidance on this PR.
Check out the pipeline docs of SANA-Sprint to learn more.
Lumina2
Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.
Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.
One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.
Omnigen
OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.
Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.
Others
- CogView4 (thanks to @zRzRzRzRzRzRzR for contributing CogView4 in this PR)
New Memory Optimizations
Layerwise Casting
PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they can’t be used for computation on many devices due to unimplemented kernel support.
However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%.
Code
import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
model_id = "THUDM/CogVideoX-5b"
# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)Group Offloading
Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.
On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.
One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.
You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
onload_device=onload_device,
offload_device=offload_device,
offload_type="leaf_level",
use_stream=True
)
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)Remote Components
Remote components are an experimental feature designed to offload memory-intensive steps of t...
v0.32.2
Fixes for Flux Single File loading, LoRA loading for 4bit BnB Flux, Hunyuan Video
This patch release
- Fixes a regression in loading Comfy UI format single file checkpoints for Flux
- Fixes a regression in loading LoRAs with bitsandbytes 4bit quantized Flux models
- Adds
unload_lora_weightsfor Flux Control - Fixes a bug that prevents Hunyuan Video from running with batch size > 1
- Allow Hunyuan Video to load LoRAs created from the original repository code
All commits
- [Single File] Fix loading Flux Dev finetunes with Comfy Prefix by @DN6 in #10545
- [CI] Update HF Token on Fast GPU Model Tests by @DN6 #10570
- [CI] Update HF Token in Fast GPU Tests by @DN6 #10568
- Fix batch > 1 in HunyuanVideo by @hlky in #10548
- Fix HunyuanVideo produces NaN on PyTorch<2.5 by @hlky in #10482
- Fix hunyuan video attention mask dim by @a-r-r-o-w in #10454
- [LoRA] Support original format loras for HunyuanVideo by @a-r-r-o-w in #10376
- [LoRA] feat: support loading loras into 4bit quantized Flux models. by @sayakpaul in #10578
- [LoRA] clean up
load_lora_into_text_encoder()andfuse_lora()copied from by @sayakpaul in #10495 - [LoRA] feat: support
unload_lora_weights()for Flux Control. by @sayakpaul in #10206 - Fix Flux multiple Lora loading bug by @maxs-kan in #10388
- [LoRA] fix: lora unloading when using expanded Flux LoRAs. by @sayakpaul in #10397
v0.32.1
TorchAO Quantizer fixes
This patch release fixes a few bugs related to the TorchAO Quantizer introduced in v0.32.0.
- Importing Diffusers would raise an error in PyTorch versions lower than 2.3.0. This should no longer be a problem.
- Device Map does not work as expected when using the quantizer. We now raise an error if it is used. Support for using device maps with different quantization backends will be added in the near future.
- Quantization was not performed due to faulty logic. This is now fixed and better tested.
Refer to our documentation to learn more about how to use different quantization backends.
All commits
- make style for #10368 by @yiyixuxu in #10370
- fix test pypi installation in the release workflow by @sayakpaul in #10360
- Fix TorchAO related bugs; revert device_map changes by @a-r-r-o-w in #10371
