-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Flux.1] improve pos embed for ascend npu by computing on npu #12897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sayakpaul Please review this pr, thanks. |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We should also take care of others that follow this pattern. For example:
diffusers/src/diffusers/models/transformers/transformer_flux2.py
Lines 838 to 845 in f6b6a71
| if is_torch_npu_available(): | |
| freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu()) | |
| image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu()) | |
| freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu()) | |
| text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu()) | |
| else: | |
| image_rotary_emb = self.pos_embed(img_ids) | |
| text_rotary_emb = self.pos_embed(txt_ids) |
Thanks for your suggestion, I tested the FLUX.2-Dev, LongCat-Image, and Ovis-Image models on the Ascend platform. Their performance improved after switching the position embedding calculation from CPU back to the NPU. |
|
ohh but the pattern was added in this PR #12534 do you have any idea what caused the difference? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
diffusers/src/diffusers/models/embeddings.py Lines 1171 to 1175 in f6b6a71
The primary reason is that the torch.repeat_interleave operator has been optimized on Ascend starting from the CANN 8.3.RC1 version. It is likely that the PR #12534 was tested on an earlier version of CANN.
|
|
@zhangtao0408 |
|
Is it possible to check the effective CANN version? If so, perhaps we can leave that to the CPU when the CANN version is below 8.3.RC1. |
What does this PR do?
Moving pos_embed computation from CPU back to NPU results in a 1.07x speedup in Flux.1's end-to-end latency.
Since CANN updated to 8.3.RC1, the bad performance of
torch.repeat_interleaveoperator has been optimized. Results shown below:Tested Hardware
Ascend 910B3
Repro Code
1. FLUX.1-dev
2. FLUX.2-DEV
3. LongCat-Image
4. Ovis-Image
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.