[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

chraac · 2025-12-24T03:01:18Z

Performance

Device: 8Gen2

Baseline: ed7597771
Optimization: 2058f28b3

Operation	Params	Baseline (GFLOPS)	Optimization (GFLOPS)	Speedup
MUL_MAT (f16, f32)	k=128, n=1	3.68	5.74	1.56x
MUL_MAT (f16, f32)	k=14336, n=1	3.43	7.26	2.12x
MUL_MAT (f16, f32)	k=14336, n=2	3.46	7.29	2.11x
MUL_MAT (f16, f32)	k=14336, n=3	3.46	7.29	2.11x
MUL_MAT (f16, f32)	k=14336, n=4	3.46	7.34	2.12x
MUL_MAT (f16, f32)	k=14336, n=5	3.46	7.31	2.11x
MUL_MAT (f16, f32)	k=14336, n=8	3.48	7.35	2.11x

This reverts commit 8600ecd20d6c902fe16271d6af1e59504eff4a27.

chraac · 2025-12-24T03:23:52Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

-        volatile HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
-        volatile HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
+        HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
+        HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));


Based on my observations, using volatile here seems to have several drawbacks:

Prevents inlining: With volatile, the binary retains a separate vec_dot_f16_f32 function instead of inlining it into matmul_f16_f32.

Generates extra store instructions: noticed that the compiler generates extra vmem instructions to write the result registers to the stack, as seen in the highlight below. This will increase the mem bandwidth pressure, which impacts the processing speed.

chraac · 2025-12-28T03:01:14Z

ggml/src/ggml-hexagon/ggml-hexagon.cpp

    uint32_t         prof_usecs;
    uint32_t         prof_cycles;
    uint32_t         prof_pkts;
+    uint64_t         avail_mem_bytes = kMaxMemPerSessInBytes;  // available memory for allocations


Here we add a new field to track the available bytes per session.

max-krasnyansky · 2026-01-05T06:21:18Z

@chraac
Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

Re: f16/f32 matmuls. I've been playing with Flash Attention implementation and ended up overhauling how f16/f32 stuff is handled in many places. We should do as much as possible in f16 and only accumulate into f32. Leftover handling is quite simple with f16 using the mask predicate trick.
Take a look at #18611

chraac · 2026-01-06T04:07:29Z

Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

@max-krasnyansky
No worries. Regarding the limitation, I assume the limitation comes from the 32-bit address space of the Hexagon RTOS. Given newer versions support a 64-bit address space, will we be able to increase the memory allocation per session?

max-krasnyansky · 2026-01-07T02:04:21Z

Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

@max-krasnyansky No worries. Regarding the limitation, I assume the limitation comes from the 32-bit address space of the Hexagon RTOS. Given newer versions support a 64-bit address space, will we be able to increase the memory allocation per session?

Only the DMA engine has 64-bit address space, the rest of the HW is 32-bit address space.
If we were to use the DDR <-> DMA <-> VTCM for every single part of the processing then yes we can increase the memory allocation per session. Otherwise we need to keep it under 4GB.

chraac added 9 commits December 23, 2025 16:15

refactoring: enhance memory management with tracking buffer allocation

afaeb54

wip

7ef467c

refactoring: improve code formatting and alignment in matmul operations

e0b1435

wip

398aa85

wip

500c627

wip

2917136

opt: use qf32 internal precision for vec_dot_f16_f32

ea45020

add unroll marker

cb0a8ff

Revert "opt: use qf32 internal precision for vec_dot_f16_f32"

2058f28

This reverts commit 8600ecd20d6c902fe16271d6af1e59504eff4a27.

chraac requested review from lhez and max-krasnyansky as code owners December 24, 2025 03:01

chraac marked this pull request as draft December 24, 2025 03:01

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 24, 2025

loci-dev mentioned this pull request Dec 24, 2025

UPSTREAM PR #18336: [WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 auroralabs-loci/llama.cpp#684

Open

chraac commented Dec 24, 2025

View reviewed changes

wip

5e3db77

chraac commented Dec 28, 2025

View reviewed changes

chraac added 2 commits December 31, 2025 14:27

Merge tag 'b7588' into dev-fix-model-load-error

207786a

Merge branch 'master' into dev-fix-model-load-error

6756c69

chraac closed this Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

Uh oh!

chraac commented Dec 24, 2025 •

edited

Loading

Uh oh!

chraac Dec 24, 2025 •

edited

Loading

Uh oh!

chraac Dec 28, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Jan 5, 2026

Uh oh!

chraac commented Jan 6, 2026 •

edited

Loading

Uh oh!

max-krasnyansky commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 #18336

[WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 #18336

Uh oh!

Conversation

chraac commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

chraac Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chraac Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Jan 5, 2026

Uh oh!

chraac commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

chraac commented Dec 24, 2025 •

edited

Loading

chraac Dec 24, 2025 •

edited

Loading

chraac Dec 28, 2025 •

edited

Loading

chraac commented Jan 6, 2026 •

edited

Loading