-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32
#18336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 8600ecd20d6c902fe16271d6af1e59504eff4a27.
| volatile HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp)); | ||
| volatile HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp)); | ||
| HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp)); | ||
| HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my observations, using volatile here seems to have several drawbacks:
- Prevents inlining: With
volatile, the binary retains a separatevec_dot_f16_f32function instead of inlining it intomatmul_f16_f32.
- Generates extra store instructions: noticed that the compiler generates extra
vmeminstructions to write the result registers to the stack, as seen in the highlight below. This will increase the mem bandwidth pressure, which impacts the processing speed.

| uint32_t prof_usecs; | ||
| uint32_t prof_cycles; | ||
| uint32_t prof_pkts; | ||
| uint64_t avail_mem_bytes = kMaxMemPerSessInBytes; // available memory for allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we add a new field to track the available bytes per session.
|
@chraac Re: f16/f32 matmuls. I've been playing with Flash Attention implementation and ended up overhauling how f16/f32 stuff is handled in many places. We should do as much as possible in f16 and only accumulate into f32. Leftover handling is quite simple with f16 using the mask predicate trick. |
@max-krasnyansky |
Only the DMA engine has 64-bit address space, the rest of the HW is 32-bit address space. |
Performance
Device:
8Gen2Baseline:
ed7597771Optimization:
2058f28b3