05 Mar 15:05

d791c5c

Diffusers 0.37.0: Modular Diffusers, New image and video pipelines, multiple core library improvements, and more 🔥

Modular Diffusers

Modular Diffusers introduces a new way to build diffusion pipelines by composing reusable blocks. Instead of writing entire pipelines from scratch, you can now mix and match building blocks to create custom workflows tailored to your specific needs! This complements the existing DiffusionPipeline class, providing a more flexible way to create custom diffusion pipelines.

Find more details on how to get started with Modular Diffusers here, and also check out the announcement post.

New Pipelines and Models

Image 🌆

Z Image Omni Base: Z-Image is the foundation model of the Z-Image family, engineered for good quality, robust generative diversity, broad stylistic coverage, and precise prompt adherence. While Z-Image-Turbo is built for speed, Z-Image is a full-capacity, undistilled transformer designed to be the backbone for creators, researchers, and developers who require the highest level of creative freedom. Thanks to @RuoyiDufor for contributing this in #12857.
Flux2 Klein:FLUX.2 [Klein] unifies generation and editing in a single compact architecture, delivering state-of-the-art quality with end-to-end inference in as low as under a second. Built for applications that require real-time image generation without sacrificing quality, and runs on consumer hardware, with as little as 13GB VRAM.
Qwen Image Layered: Qwen-Image-Layered is a model capable of decomposing an image into multiple RGBA layers. This layered representation unlocks inherent editability: each layer can be independently manipulated without affecting other content. Thanks to @naykun for contributing this in #12853.
FIBO Edit: Fibo Edit is an 8B parameter image-to-image model that introduces a new paradigm of structured control, operating on JSON inputs paired with source images to enable deterministic and repeatable editing workflows. Featuring native masking for granular precision, it moves beyond simple prompt-based diffusion to offer explicit, interpretable control optimized for production environments. Its lightweight architecture is designed for deep customization, empowering researchers to build specialized “Edit” models for domain-specific tasks while delivering top-tier aesthetic quality. Thanks galbria for contributing it in #12930.
Cosmos Predict2.5: Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the future state of the world. Thanks to @miguelmartin75 for contributing it in #12852.
Cosmos Transfer2.5: Cosmos-Transfer2.5 is a conditional world generation model with adaptive multimodal control, that produces high-quality world simulations conditioned on multiple control inputs. These inputs can take different modalities—including edges, blurred video, segmentation maps, and depth maps. Thanks to @miguelmartin75 for contributing it in #13066.
GLM-Image: GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. Thanks to @zRzRzRzRzRzRzR for contributing it in #12973.
RAE: Representation Autoencoders (aka RAE) are an exciting alternative to traditional VAEs, typically used in the area of latent-space diffusion models of image generation. RAEs leverage pre-trained vision encoders and train lightweight decoders for the task of reconstruction.

Video + audio 🎥 🎼

LTX-2: LTX-2 is an audio-conditioned text-to-video generation model that can generate videos with synced audio. Full and distilled model inference, as well as two-stage inference with spatial sampling, is supported. We also support a conditioning pipeline that allows for passing different conditions (such as images, series of images, etc.). Check out the docs to learn more!
Helios: Helios is a 14B video generation model that runs at 17 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching a strong baseline in quality. Thanks to @SHYuanBest for contributing this in #13208.

Improvements to Core Library

New caching methods

MagCache — thanks to @AlanPonnachan!
TaylorSeer — thanks to @toilaluan!

New context-parallelism (CP) backends

Unified Sequence Parallel attention — thanks to @Bissmella!
Ulysses Anything Attention — thanks to @DefTruth!

Misc

Mambo-G Guidance: New guider implementation (#12862)
Laplace Scheduler for DDPM (#11320)
Custom Sigmas in UniPCMultistepScheduler (#12109)
MultiControlNet support for SD3 Inpainting (#11251)
Context parallel in native flash attention (#12829)
NPU Ulysses Attention Support (#12919)
Fix Wan 2.1 I2V Context Parallel Inference (#12909)
Fix Qwen-Image Context Parallel Inference (#12970)
Introduction to @apply_lora_scale decorator for simplifying model definitions (#12994)
Introduction of pipeline-level “cpu” device_map (#12811)
Enable CP for kernels-based attention backends (#12812)
Diffusers is fully functional with Transformers V5 (#12976)

A lot of the above features/improvements came as part of the MVP program we have been running. Immense thanks to the contributors!

Bug Fixes

Fix QwenImageEditPlus on NPU (#13017)
Fix MT5Tokenizer → use T5Tokenizer for Transformers v5.0+ compatibility (#12877)
Fix Wan/WanI2V patchification (#13038)
Fix LTX-2 inference with num_videos_per_prompt > 1 and CFG (#13121)
Fix Flux2 img2img prediction (#12855)
Fix QwenImage txt_seq_lens handling (#12702)
Fix prefix_token_len bug (#12845)
Fix ftfy imports in Wan and SkyReels-V2 (#12314, #13113)
Fix is_fsdp determination (#12960)
Fix GLM-Image get_image_features API (#13052)
Fix Wan 2.2 when either transformer isn't present (#13055)
Fix guider issue (#13147)
Fix torchao quantizer for new versions (#12901)
Fix GGUF for unquantized types with unquantize kernels (#12498)
Make Qwen hidden states contiguous for torchao (#13081)
Make Flux hidden states contiguous (#13068)
Fix Kandinsky 5 hardcoded CUDA autocast (#12814)
Fix aiter availability check (#13059)
Fix attention mask check for unsupported backends (#12892)
Allow prompt and prior_token_ids simultaneously in GlmImagePipeline (#13092)
GLM-Image batch support (#13007)
Cosmos 2.5 Video2World frame extraction fix (#13018)
ResNet: only use contiguous in training mode (#12977)

All commits

[PRX] Improve model compilation by @WaterKnight1998 in #12787
Improve docstrings and type hints in scheduling_dpmsolver_singlestep.py by @delmalih in #12798
[Modular]z-image by @yiyixuxu in #12808
Fix Qwen Edit Plus modular for multi-image input by @sayakpaul in #12601
[WIP] Add Flux2 modular by @DN6 in #12763
[docs] improve distributed inference cp docs. by @sayakpaul in #12810
post release 0.36.0 by @sayakpaul in #12804
Update distributed_inference.md to correct syntax by @sayakpaul in #12827
[lora] Remove lora docs unneeded and add " # Copied from ..." by @sayakpaul in #12824
support CP in native flash attention by @sywangyi in #12829
[qwen-image] edit 2511 support by @naykun in #12839
fix pytest tests/pipelines/pixart_sigma/test_pixart.py::PixArtSigmaPi… by @sywangyi in #12842
Support for control-lora by @lavinal712 in #10686
Add support for LongCat-Image by @junqiangwu in #12828
fix the prefix_token_len bug by @junqiangwu in #12845
extend TorchAoTest::test_model_memory_usage to other platform by @sywangyi in #12768
Qwen Image Layered Support by @naykun in #12853
Z-Image-Turbo ControlNet by @hlky in #12792
Cosmos Predict2.5 Base: inference pipeline, scheduler & chkpt conversion by @miguelmartin75 in #12852
more update in modular by @yiyixuxu in #12560
Feature: Add Mambo-G Guidance as Guider by @MatrixTeam-AI in #12862
Add OvisImagePipeline in AUTO_TEXT2IMAGE_PIPELINES_MAPPING by @alvarobartt in #12876
Cosmos Predict2.5 14b Conversion by @miguelmartin75 in #12863
Use T5Tokenizer instead of MT5Tokenizer (removed in Transformers v5.0+) by @alvarobartt in #12877
Add z-image-omni-base implementation by @RuoyiDu in #12857
fix torchao quantizer for new torchao versions by @vkuzo in #12901
fix Qwen Image Transformer s...

Contributors

kashif, geekuillaume, and 75 other contributors

Assets 2

08 Dec 10:17

sayakpaul

v0.36.0

9380e58

Diffusers 0.36.0: Pipelines galore, new caching method, training scripts, and more 🎄 Latest

Latest

The release features a number of new image and video pipelines, a new caching method, a new training script, new kernels - powered attention backends, and more. It is quite packed with a lot of new stuff, so make sure you read the release notes fully 🚀

New image pipelines

Flux2: Flux2 is the latest generation of image generation and editing model from Black Forest Labs. It’s capable of taking multiple input images as reference, making it versatile for different use cases.
Z-Image: Z-Image is a best-of-its-kind image generation model in the 6B param regime. Thanks to @JerryWu-code in #12703.
QwenImage Edit Plus: It’s an upgrade of QwenImage Edit and is capable of taking multiple input images as references. It can act as both a generation and an editing model. Thanks to @naykun for contributing in #12357.
Bria FIBO: FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs. Thanks to @galbria for contributing this in #12545.
Kandinsky Image Lite: Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters). Thanks to @leffff for contributing this in #12664.
ChronoEdit: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. Thanks to @zhangjiewu for contributing this in #12593.

New video pipelines

Sana-Video: Sana-Video is a fast and efficient video generation model, equipped to handle long video sequences, thanks to its incorporation of linear attention. Thanks to @lawrence-cj for contributing this in #12634.
Kandinsky 5: Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem. Thanks to @leffff for contributing this in #12478.
Hunyuan 1.5: HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs.
Wan Animate: Wan-Animate is a state-of-the-art character animation and replacement video model based on Wan2.1. Given a reference character image and driving motion video, it can either animate the character with motion from the driving video, or replace the existing character in that video with that character.

New `kernels`-powered attention backends

The kernels library helps you save a lot of time by providing pre-built kernel interfaces for various environments and accelerators. This release features three new kernels-powered attention backends:

Flash Attention 3 (+ its varlen variant)
Flash Attention 2 (+ its varlen variant)
SAGE

This means if any of the above backend is supported by your development environment, you should be able to skip the manual process of building the corresponding kernels and just use:

# Make sure you have `kernels` installed: `pip install kernels`.
# You can choose `flash_hub` or `sage_hub`, too.
pipe.transformer.set_attention_backend("_flash_3_hub")

For more details, check out the documentation.

TaylorSeer cache

TaylorSeer is now supported in Diffusers, delivering upto 3x speedups with negligible-to-none quality compromise. Thanks to @toilaluan for contributing this in #12648. Check out the documentation here.

New training script

Our Flux.2 integration features a LoRA fine-tuning script that you can check out here. We provide a number of optimizations to help make it run on consumer GPUs.

Misc

Reusing AttentionMixin: Making certain compatible models subclass from the AttentionMixin class helped us get rid of 2K LoC. Going forward, users can expect more such refactorings that will help make the library leaner and simpler. Check out #12463 for more details.
Diffusers backend in SGLang: sgl-project/sglang#14112.
We started the Diffusers MVP program to work with talented community members who will help us improve the library across multiple fronts. Check out the link for more information.

All commits

remove unneeded checkpoint imports. by @sayakpaul in #12488
[tests] fix clapconfig for text backbone in audioldm2 by @sayakpaul in #12490
ltx0.9.8 (without IC lora, autoregressive sampling) by @yiyixuxu in #12493
[docs] Attention checks by @stevhliu in #12486
[CI] Check links by @stevhliu in #12491
[ci] xfail more incorrect transformer imports. by @sayakpaul in #12455
[tests] introduce VAETesterMixin to consolidate tests for slicing and tiling by @sayakpaul in #12374
docs: cleanup of runway model by @EazyAl in #12503
Kandinsky 5 is finally in Diffusers! by @leffff in #12478
Remove Qwen Image Redundant RoPE Cache by @dg845 in #12452
Raise warning instead of error when imports are missing for custom code by @DN6 in #12513
Fix: Use incorrect temporary variable key when replacing adapter name… by @FeiXie8 in #12502
[docs] Organize toctree by modality by @stevhliu in #12514
styling issues. by @sayakpaul in #12522
Add Photon model and pipeline support by @DavidBert in #12456
purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #12497
Prx by @DavidBert in #12525
[core] AutoencoderMixin to abstract common methods by @sayakpaul in #12473
Kandinsky5 No cfg fix by @asomoza in #12527
Fix: Add _skip_keys for AutoencoderKLWan by @yiyixuxu in #12523
[CI] xfail the test_wuerstchen_prior test by @sayakpaul in #12530
[tests] Test attention backends by @sayakpaul in #12388
fix CI bug for kandinsky3_img2img case by @kaixuanliu in #12474
Fix MPS compatibility in get_1d_sincos_pos_embed_from_grid #12432 by @Aishwarya0811 in #12449
Handle deprecated transformer classes by @DN6 in #12517
fix constants.py to user upper() by @sayakpaul in #12479
HunyuanImage21 by @yiyixuxu in #12333
Loose the criteria tolerance appropriately for Intel XPU devices by @kaixuanliu in #12460
Deprecate Stable Cascade by @DN6 in #12537
[chore] Move guiders experimental warning by @sayakpaul in #12543
Fix Chroma attention padding order and update docs to use lodestones/Chroma1-HD by @josephrocca in #12508
Add AITER attention backend by @lauri9 in #12549
Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline by @alirezafarashah in #12531
Kandinsky 5 10 sec (NABLA suport) by @leffff in #12520
Improve pos embed for Flux.1 inference on Ascend NPU by @gameofdimension in #12534
support latest few-step wan LoRA. by @sayakpaul in #12541
[Pipelines] Enable Wan VACE to run since single transformer by @DN6 in #12428
fix crash if tiling mode is enabled by @sywangyi in #12521
Fix typos in kandinsky5 docs by @Meatfucker in #12552
[ci] don't run sana layerwise casting tests in CI. by @sayakpaul in #12551
Bria fibo by @galbria in #12545
Avoiding graph break by changing the way we infer dtype in vae.decoder by @ppadjinTT in #12512
[Modular] Fix for custom block kwargs by @DN6 in #12561
[Modular] Allow custom blocks to be saved to local_dir by @DN6 in #12381
Fix Stable Diffusion 3.x pooled prompt embedding with multiple images by @friedrich in #12306
Fix custom code loading in Automodel by @DN6 in #12571
[modular] better warn message by @yiyixuxu in #12573
[tests] add tests for flux modular (t2i, i2i, kontext) by @sayakpaul in #12566
[modular]pass hub_kwargs to load_config by @yiyixuxu in #12577
ulysses enabling in native attention path by @sywangyi in #12563
Kandinsky ...

Contributors

turian, friedrich, and 50 other contributors

Assets 2

15 Oct 04:14

sayakpaul

v0.35.2

b712696

🐞 fixes for `transformers` models, imports,

All commits

Release: v0.35.1-patch by @sayakpaul (direct commit on v0.35.2-patch)
handle offload_state_dict when initing transformers models by @sayakpaul in #12438
[CI] Fix TRANSFORMERS_FLAX_WEIGHTS_NAME import issue by @DN6 in #12354
Fix PyTorch 2.3.1 compatibility: add version guard for torch.library.… by @Aishwarya0811 in #12206
fix scale_shift_factor being on cpu for wan and ltx by @vladmandic in #12347
Release: v0.35.2-patch by @sayakpaul (direct commit on v0.35.2-patch)

Contributors

DN6, sayakpaul, and 2 other contributors

Assets 2

20 Aug 04:17

sayakpaul

v0.35.1

0f252be

v0.35.1 for improvements in Qwen-Image Edit

Thanks to @naykun for the following PRs that improve Qwen-Image Edit:

Contributors

naykun

Assets 2

19 Aug 03:28

sayakpaul

v0.35.0

f27949d

Diffusers 0.35.0: Qwen Image pipelines, Flux Kontext, Wan 2.2, and more

This release comes packed with new image generation and editing pipelines, a new video pipeline, new training scripts, quality-of-life improvements, and much more. Read the rest of the release notes fully to not miss out on the fun stuff.

New pipelines 🧨

We welcomed new pipelines in this release:

Wan 2.2
Flux-Kontext
Qwen-Image
Qwen-Image-Edit

Wan 2.2 📹

This update to Wan provides significant improvements in video fidelity, prompt adherence, and style. Please check out the official doc to learn more.

Flux-Kontext 🎇

Flux-Kontext is a 12-billion-parameter rectified flow transformer capable of editing images based on text instructions. Please check out the official doc to learn more about it.

Qwen-Image 🌅

After a successful run of delivering language models and vision-language models, the Qwen team is back with an image generation model, which is Apache-2.0 licensed! It achieves significant advances in complex text rendering and precise image editing. To learn more about this powerful model, refer to our docs.

Thanks to @naykun for contributing both Qwen-Image and Qwen-Image-Edit via this PR and this PR.

New training scripts 🎛️

Make these newly added models your own with our training scripts:

Single-file modeling implementations

Following the 🤗 Transformers’ philosophy of single-file modeling implementations, we have started implementing modeling code in single and self-contained files. The Flux Transformer code is one example of this.

Attention refactor

We have massively refactored how we do attention in the models. This allows us to provide support for different attention backends (such as PyTorch native scaled_dot_product_attention, Flash Attention 3, SAGE attention, etc.) in the library seamlessly.

Having attention supported this way also allows us to integrate different parallelization mechanisms, which we’re actively working on. Follow this PR if you’re interested.

Users shouldn’t be affected at all by these changes. Please open an issue if you face any problems.

Regional compilation

Regional compilation trims cold-start latency by only compiling the small and frequently-repeated block(s) of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Refer to this doc to learn more.

Thanks to @anijain2305 for contributing this feature in this PR.

We have also authored a number of posts that center around the use of torch.compile. You can check them out at the links below:

Faster pipeline loading ⚡️

Users can now load pipelines directly on an accelerator device leading to significantly faster load times. This particularly becomes evident when loading large pipelines like Wan and Qwen-Image.

from diffusers import DiffusionPipeline
import torch 

ckpt_id = "Qwen/Qwen-Image"
pipe = DiffusionPipeline.from_pretrained(
-    ckpt_id, torch_dtype=torch.bfloat16
- ).to("cuda")
+    ckpt_id, torch_dtype=torch.bfloat16, device_map="cuda"
+ )

You can speed up loading even more by enabling parallelized loading of state dict shards. This is particularly helpful when you’re working with large models like Wan and Qwen-Image, where the model state dicts are typically sharded across multiple files.

import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "yes"

# rest of the loading code
....

Better GGUF integration

@Isotr0py contributed support for native GGUF CUDA kernels in this PR. This should provide an approximately 10% improvement in inference speed.

We have also worked on a tool for converting regular checkpoints to GGUF, letting the community easily share their GGUF checkpoints. Learn more here.

We now support loading of Diffusers format GGUF checkpoints.

You can learn more about all of this in our GGUF official docs.

Modular Diffusers (Experimental)

Modular Diffusers is a system for building diffusion pipelines pipelines with individual pipeline blocks. It is highly customisable, with blocks that can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows.

The API is currently in active development and is being released as an experimental feature. Learn more in our docs.

All commits

[tests] skip instead of returning. by @sayakpaul in #11793
adjust to get CI test cases passed on XPU by @kaixuanliu in #11759
fix deprecation in lora after 0.34.0 release by @sayakpaul in #11802
[chore] post release v0.34.0 by @sayakpaul in #11800
Follow up for Group Offload to Disk by @DN6 in #11760
[rfc][compile] compile method for DiffusionPipeline by @anijain2305 in #11705
[tests] add a test on torch compile for varied resolutions by @sayakpaul in #11776
adjust tolerance criteria for test_float16_inference in unit test by @kaixuanliu in #11809
Flux Kontext by @a-r-r-o-w in #11812
Kontext training by @sayakpaul in #11813
Kontext fixes by @a-r-r-o-w in #11815
remove syncs before denoising in Kontext by @sayakpaul in #11818
[CI] disable onnx, mps, flax from the CI by @sayakpaul in #11803
TorchAO compile + offloading tests by @a-r-r-o-w in #11697
Support dynamically loading/unloading loras with group offloading by @a-r-r-o-w in #11804
[lora] fix: lora unloading behvaiour by @sayakpaul in #11822
[lora]feat: use exclude modules to loraconfig. by @sayakpaul in #11806
ENH: Improve speed of function expanding LoRA scales by @BenjaminBossan in #11834
Remove print statement in SCM Scheduler by @a-r-r-o-w in #11836
[tests] add test for hotswapping + compilation on resolution changes by @sayakpaul in #11825
reset deterministic in tearDownClass by @jiqing-feng in #11785
[tests] Fix failing float16 cuda tests by @a-r-r-o-w in #11835
[single file] Cosmos by @a-r-r-o-w in #11801
[docs] fix single_file example. by @sayakpaul in #11847
Use real-valued instead of complex tensors in Wan2.1 RoPE by @mjkvaak-amd in #11649
[docs] Batch generation by @stevhliu in #11841
[docs] Deprecated pipelines by @stevhliu in #11838
fix norm not training in train_control_lora_flux.py by @Luo-Yihang in #11832
[From Single File] support from_single_file method for WanVACE3DTransformer by @J4BEZ in #11807
[lora] tests for exclude_modules with Wan VACE by @sayakpaul in #11843
update: FluxKontextInpaintPipeline support by @vuongminh1907 in #11820
[Flux Kontext] Support Fal Kontext LoRA by @linoytsaban in #11823
[docs] Add a note of _keep_in_fp32_modules by @a-r-r-o-w in #11851
[benchmarks] overhaul benchmarks by @sayakpaul in #11565
FIX set_lora_device when target layers differ by @BenjaminBossan in #11844
Fix Wan AccVideo/CausVid fuse_lora by @a-r-r-o-w in #11856
[chore] deprecate blip controlnet pipeline. by @sayakpaul in #11877
[docs] fix references in flux pipelines. by @sayakpaul in #11857
[tests] remove tests for deprecated pipelines. by @sayakpaul in #11879
[docs] LoRA metadata by @stevhliu in #11848
[training ] add Kontext i2i training by @sayakpaul in #11858
[CI] Fix big GPU test marker by @DN6 in #11786
First Block Cache by @a-r-r-o-w in #11180
[tests] annotate compilation test classes with bnb by @sayakpaul in #11715
Update chroma.md by @shm4r7 in #11891
[CI] Speed up GPU PR Tests by @DN6 in #11887
Pin k-diffusion for CI by @sayakpaul in #11894
[Docker] update doc builder dockerfile to include quant libs. by @sayakpaul in #11728
[tests] Remove more deprecated tests by @sayakpaul in #11895
[tests] mark the wanvace lora tester flaky by @sayakpaul in #11883
[tests] add compile + offload tests for GGUF. by @sayakpaul in #11740
feat: add multiple input image support in Flux Kontext by @Net-Mist in #11880
Fix unique memory address when doing group-offloading with disk by @sayakpaul in #11767
[SD3] CFG Cutoff fix and official callback by @asomoza in #11890
The Modular Diffusers by @yiyixuxu in #9672
[quant] QoL improvements for pipeline-level quant config by @sayakpaul in ...

Contributors

piercus, okaris, and 48 other contributors

Assets 2

24 Jun 15:13

sayakpaul

v0.34.0

50dea89

Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more

📹 New video generation pipelines

Wan VACE

Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:

Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
Inpainting and Outpainting
Subject to Video (faces, object, characters, etc.)
Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.

Check out the docs to learn more.

Cosmos Predict2 Video2World

Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.

LTX 0.9.7 and Distilled

LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.

Check out the docs to learn more.

Hunyuan Video Framepack and F1

Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.

FusionX

The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():

from diffusers import WanTransformer3DModel

transformer = WanTransformer3DModel.from_single_file(
    "https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
    torch_dtype=torch.bfloat16
)

To load the LoRAs, use load_lora_weights():

pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)

AccVideo and CausVid (only LoRAs)

AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.

Chroma

Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more

Thanks to @Ednaordinary for contributing it in this PR!

VisualCloze

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:

Support for various in-domain tasks
Generalization to unseen tasks through in-context learning
Unify multiple tasks into one step and generate both target image and intermediate results
Support reverse-engineering conditions from target images

Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!

Better `torch.compile` support

We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:

Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:

Code

import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:

You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:

Code

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel

import torch
torch._dynamo.config.recompile_limit = 1000 

quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)

ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_quant_config,
    torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=dit_quant_config,
    torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=28,
    max_sequence_length=512,
).images[0]

Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.

Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.

Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.

PipelineQuantizationConfig

Users can now provide a quantization config while initializing a pipeline:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
     quant_backend="bitsandbytes_4bit",
     quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
     components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about [different configurations](https://huggingface.co/docs/diffusers/main/en/quantization/overview...

Contributors

iamwavecut, apolinario, and 86 other contributors

Assets 2

10 Apr 05:38

yiyixuxu

v0.33.1

375ec93

v0.33.1: fix ftfy import

All commits

fix ftfy import for wan pipelines by @yiyixuxu in #11262

Contributors

yiyixuxu

Assets 2

09 Apr 13:37

sayakpaul

v0.33.0

a2ed6b4

Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more

New Pipelines for Video Generation

Wan 2.1

Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.

Wan-AI/Wan2.1-T2V-1.3B-Diffusers
Wan-AI/Wan2.1-T2V-14B-Diffusers
Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Check out the docs here to learn more.

LTX Video 0.9.5

LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).

To support these additional conditioning inputs, we’ve introduced the LTXConditionPipeline and LTXVideoCondition object.

To learn more about the usage, check out the docs here.

Hunyuan Image to Video

Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.

To learn more, check out the docs here.

Others

EasyAnimateV5 (thanks to @bubbliiiing for contributing this in this PR)
ConsisID (thanks to @SHYuanBest for contributing this in this PR)

New Pipelines for Image Generation

Sana-Sprint

SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.

Shoutout to @lawrence-cj for their help and guidance on this PR.

Check out the pipeline docs of SANA-Sprint to learn more.

Lumina2

Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.

Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.

One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.

Omnigen

OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.

Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.

Others

CogView4 (thanks to @zRzRzRzRzRzRzR for contributing CogView4 in this PR)

New Memory Optimizations

Layerwise Casting

PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they can’t be used for computation on many devices due to unimplemented kernel support.

However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%.

Code

import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video

model_id = "THUDM/CogVideoX-5b"

# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)

# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)

Group Offloading

Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.

On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.

One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.

You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.

Code

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
	onload_device=onload_device, 
	offload_device=offload_device, 
	offload_type="leaf_level", 
	use_stream=True
)

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.

Code

import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video

# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)

Remote Components

Remote components are an experimental feature designed to offload memory-intensive steps of t...

Contributors

catwell, dimitribarbot, and 88 other contributors

Assets 2

15 Jan 16:46

DN6

v0.32.2

560fb5f

v0.32.2

Fixes for Flux Single File loading, LoRA loading for 4bit BnB Flux, Hunyuan Video

This patch release

Fixes a regression in loading Comfy UI format single file checkpoints for Flux
Fixes a regression in loading LoRAs with bitsandbytes 4bit quantized Flux models
Adds unload_lora_weights for Flux Control
Fixes a bug that prevents Hunyuan Video from running with batch size > 1
Allow Hunyuan Video to load LoRAs created from the original repository code

All commits

[Single File] Fix loading Flux Dev finetunes with Comfy Prefix by @DN6 in #10545
[CI] Update HF Token on Fast GPU Model Tests by @DN6 #10570
[CI] Update HF Token in Fast GPU Tests by @DN6 #10568
Fix batch > 1 in HunyuanVideo by @hlky in #10548
Fix HunyuanVideo produces NaN on PyTorch<2.5 by @hlky in #10482
Fix hunyuan video attention mask dim by @a-r-r-o-w in #10454
[LoRA] Support original format loras for HunyuanVideo by @a-r-r-o-w in #10376
[LoRA] feat: support loading loras into 4bit quantized Flux models. by @sayakpaul in #10578
[LoRA] clean up load_lora_into_text_encoder() and fuse_lora() copied from by @sayakpaul in #10495
[LoRA] feat: support unload_lora_weights() for Flux Control. by @sayakpaul in #10206
Fix Flux multiple Lora loading bug by @maxs-kan in #10388
[LoRA] fix: lora unloading when using expanded Flux LoRAs. by @sayakpaul in #10397

Contributors

DN6, sayakpaul, and 3 other contributors

Assets 2

25 Dec 12:34

a-r-r-o-w

v0.32.1

e8aacda

v0.32.1

TorchAO Quantizer fixes

This patch release fixes a few bugs related to the TorchAO Quantizer introduced in v0.32.0.

Importing Diffusers would raise an error in PyTorch versions lower than 2.3.0. This should no longer be a problem.
Device Map does not work as expected when using the quantizer. We now raise an error if it is used. Support for using device maps with different quantization backends will be added in the near future.
Quantization was not performed due to faulty logic. This is now fixed and better tested.

Refer to our documentation to learn more about how to use different quantization backends.

All commits

make style for #10368 by @yiyixuxu in #10370
fix test pypi installation in the release workflow by @sayakpaul in #10360
Fix TorchAO related bugs; revert device_map changes by @a-r-r-o-w in #10371

Contributors

yiyixuxu, sayakpaul, and a-r-r-o-w

Assets 2

Releases: huggingface/diffusers

Diffusers 0.37.0: Modular Diffusers, New image and video pipelines, multiple core library improvements, and more 🔥

Modular Diffusers

New Pipelines and Models

Image 🌆

Video + audio 🎥 🎼

Improvements to Core Library

New caching methods

New context-parallelism (CP) backends

Misc

Bug Fixes

All commits

Contributors

Uh oh!

Diffusers 0.36.0: Pipelines galore, new caching method, training scripts, and more 🎄

New image pipelines

New video pipelines

New kernels-powered attention backends

TaylorSeer cache

New training script

Misc

All commits

Contributors

Uh oh!

🐞 fixes for `transformers` models, imports,

All commits

Contributors

Uh oh!

v0.35.1 for improvements in Qwen-Image Edit

Contributors

Uh oh!

Diffusers 0.35.0: Qwen Image pipelines, Flux Kontext, Wan 2.2, and more

New pipelines 🧨

Wan 2.2 📹

Flux-Kontext 🎇

Qwen-Image 🌅

New training scripts 🎛️

Single-file modeling implementations

Attention refactor

Regional compilation

Faster pipeline loading ⚡️

Better GGUF integration

Modular Diffusers (Experimental)

All commits

Contributors

Uh oh!

Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more

📹 New video generation pipelines

Wan VACE

Cosmos Predict2 Video2World

LTX 0.9.7 and Distilled

Hunyuan Video Framepack and F1

FusionX

AccVideo and CausVid (only LoRAs)

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Chroma

VisualCloze

Better torch.compile support

PipelineQuantizationConfig

Contributors

Uh oh!

v0.33.1: fix ftfy import

All commits

Contributors

Uh oh!

Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more

New Pipelines for Video Generation

Wan 2.1

LTX Video 0.9.5

Hunyuan Image to Video

Others

New Pipelines for Image Generation

Sana-Sprint

Lumina2

Omnigen

Others

New Memory Optimizations

Layerwise Casting

Group Offloading

New `kernels`-powered attention backends

Better `torch.compile` support