[Flux.1] improve pos embed for ascend npu by computing on npu #12897

zhangtao0408 · 2025-12-27T10:09:50Z

What does this PR do?

Moving pos_embed computation from CPU back to NPU results in a 1.07x speedup in Flux.1's end-to-end latency.

Since CANN updated to 8.3.RC1, the bad performance of torch.repeat_interleave operator has been optimized. Results shown below:

Model	Device	Resolution	Steps	e2e latency
FLUX.1-DEV	npu	1024 x 1024	28	25.54
FLUX.1-DEV	cpu	1024 x 1024	28	27.41
FLUX.2-DEV	npu	1024 x 1024	28	101.49
FLUX.2-DEV	cpu	1024 x 1024	28	118.22
LongCat-Image	npu	768x1344	28	31.87
LongCat-Image	cpu	768x1344	28	36.19
Ovis-Image	npu	1024 x 1024	28	27.16
Ovis-Image	cpu	1024 x 1024	28	40.47

Tested Hardware

Ascend 910B3

Repro Code

1. FLUX.1-dev

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("npu")

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0))

# Inference
start_time = time.time()

image = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0)).images[0]
image.save("flux-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

2. FLUX.2-DEV

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import Flux2Pipeline
pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
pipe.enable_group_offload(
    onload_device=torch.device("npu"),
    offload_device=torch.device("cpu"),
    offload_type="leaf_level",
    use_stream=True
)

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0))

# Inference
start_time = time.time()

image = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0)).images[0]
image.save("flux.2-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

3. LongCat-Image

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import LongCatImagePipeline
pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image/", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = '一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。'

# WARMUP
image = pipe(prompt, height=768, width=1344, guidance_scale=4.0, num_inference_steps=2, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True).images[0]

# Inference
start_time = time.time()

image = pipe(prompt, height=768, width=1344, guidance_scale=4.0, num_inference_steps=28, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True).images[0]

image.save("longcat.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

4. Ovis-Image

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import OvisImagePipeline
pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A creative 3D artistic render where the text \"OVIS-IMAGE\" is written in a bold, expressive handwritten brush style using thick, wet oil paint. The paint is a mix of vibrant rainbow colors (red, blue, yellow) swirling together like toothpaste or impasto art. You can see the ridges of the brush bristles and the glossy, wet texture of the paint. The background is a clean artist's canvas. Dynamic lighting creates soft shadows behind the floating paint strokes. Colorful, expressive, tactile texture, 4k detail."

# Warmup
image = pipe(prompt, negative_prompt="", num_inference_steps=2, guidance_scale=5.0).images[0]

# Inference
start_time = time.time()
image = pipe(prompt, negative_prompt="", num_inference_steps=28, guidance_scale=5.0).images[0]
image.save("ovis.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…omputation.

zhangtao0408 · 2025-12-27T10:27:04Z

@sayakpaul Please review this pr, thanks.

sayakpaul

Thanks! We should also take care of others that follow this pattern. For example:

diffusers/src/diffusers/models/transformers/transformer_flux2.py

Lines 838 to 845 in f6b6a71

    
           if is_torch_npu_available(): 
        
               freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu()) 
        
               image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu()) 
        
               freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu()) 
        
               text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu()) 
        
           else: 
        
               image_rotary_emb = self.pos_embed(img_ids) 
        
               text_rotary_emb = self.pos_embed(txt_ids)

…omputation.

…o npu computation.

…pu computation.

zhangtao0408 · 2026-01-03T17:15:01Z

Thanks! We should also take care of others that follow this pattern. For example:

diffusers/src/diffusers/models/transformers/transformer_flux2.py

Lines 838 to 845 in f6b6a71

if is_torch_npu_available():

freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu())

image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu())

freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu())

text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu())

else:

image_rotary_emb = self.pos_embed(img_ids)

text_rotary_emb = self.pos_embed(txt_ids)

Thanks for your suggestion, I tested the FLUX.2-Dev, LongCat-Image, and Ovis-Image models on the Ascend platform. Their performance improved after switching the position embedding calculation from CPU back to the NPU.

yiyixuxu · 2026-01-06T01:09:27Z

ohh but the pattern was added in this PR #12534
and claimed the opposite

do you have any idea what caused the difference?

HuggingFaceDocBuilderDev · 2026-01-06T01:11:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhangtao0408 · 2026-01-06T02:41:20Z

ohh but the pattern was added in this PR #12534 and claimed the opposite

do you have any idea what caused the difference?

diffusers/src/diffusers/models/embeddings.py

Lines 1171 to 1175 in f6b6a71

    
           if use_real and repeat_interleave_real: 
        
               # flux, hunyuan-dit, cogvideox 
        
               freqs_cos = freqs.cos().repeat_interleave(2, dim=1, output_size=freqs.shape[1] * 2).float()  # [S, D] 
        
               freqs_sin = freqs.sin().repeat_interleave(2, dim=1, output_size=freqs.shape[1] * 2).float()  # [S, D] 
        
               return freqs_cos, freqs_sin

The primary reason is that the torch.repeat_interleave operator has been optimized on Ascend starting from the CANN 8.3.RC1 version. It is likely that the PR #12534 was tested on an earlier version of CANN.

yiyixuxu · 2026-01-06T05:22:33Z

@zhangtao0408
thanks for explaining! yeah, good to remove the workaround if no longer needed

gameofdimension · 2026-01-06T14:55:57Z

Is it possible to check the effective CANN version? If so, perhaps we can leave that to the CPU when the CANN version is below 8.3.RC1.

[Flux.1] improve pos embed for ascend npu by setting it back to npu c…

edf0a69

…omputation.

sayakpaul reviewed Dec 29, 2025

View reviewed changes

TaoZhang-Work added 3 commits January 3, 2026 15:47

[Flux.2] improve pos embed for ascend npu by setting it back to npu c…

23c70fd

…omputation.

[LongCat-Image] improve pos embed for ascend npu by setting it back t…

6efcd0a

…o npu computation.

[Ovis-Image] improve pos embed for ascend npu by setting it back to n…

a0f7b63

…pu computation.

zhangtao0408 and others added 2 commits January 6, 2026 10:44

Merge branch 'main' into pos_emb_on_npu

25d2fd8

Remove unused import of is_torch_npu_available

bdfebd5

yiyixuxu approved these changes Jan 6, 2026

View reviewed changes

yiyixuxu merged commit ade1059 into huggingface:main Jan 6, 2026
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Uh oh!

zhangtao0408 commented Dec 27, 2025 •

edited

Loading

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

zhangtao0408 commented Jan 3, 2026

Uh oh!

yiyixuxu commented Jan 6, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2026

Uh oh!

zhangtao0408 commented Jan 6, 2026 •

edited

Loading

Uh oh!

yiyixuxu commented Jan 6, 2026

Uh oh!

gameofdimension commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	if is_torch_npu_available():
	freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu())
	image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu())
	freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu())
	text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu())
	else:
	image_rotary_emb = self.pos_embed(img_ids)
	text_rotary_emb = self.pos_embed(txt_ids)

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Uh oh!

Conversation

zhangtao0408 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tested Hardware

Repro Code

1. FLUX.1-dev

2. FLUX.2-DEV

3. LongCat-Image

4. Ovis-Image

Before submitting

Who can review?

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

zhangtao0408 commented Jan 3, 2026

Uh oh!

yiyixuxu commented Jan 6, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2026

Uh oh!

zhangtao0408 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiyixuxu commented Jan 6, 2026

Uh oh!

gameofdimension commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhangtao0408 commented Dec 27, 2025 •

edited

Loading

zhangtao0408 commented Jan 6, 2026 •

edited

Loading

gameofdimension commented Jan 6, 2026 •

edited

Loading