Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Dec 24, 2025

Performance

Device: 8Gen2

Baseline: ed7597771
Optimization: 2058f28b3

Operation Params Baseline (GFLOPS) Optimization (GFLOPS) Speedup
MUL_MAT (f16, f32) k=128, n=1 3.68 5.74 1.56x
MUL_MAT (f16, f32) k=14336, n=1 3.43 7.26 2.12x
MUL_MAT (f16, f32) k=14336, n=2 3.46 7.29 2.11x
MUL_MAT (f16, f32) k=14336, n=3 3.46 7.29 2.11x
MUL_MAT (f16, f32) k=14336, n=4 3.46 7.34 2.12x
MUL_MAT (f16, f32) k=14336, n=5 3.46 7.31 2.11x
MUL_MAT (f16, f32) k=14336, n=8 3.48 7.35 2.11x

@chraac chraac marked this pull request as draft December 24, 2025 03:01
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 24, 2025
volatile HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
volatile HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
Copy link
Contributor Author

@chraac chraac Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my observations, using volatile here seems to have several drawbacks:

  • Prevents inlining: With volatile, the binary retains a separate vec_dot_f16_f32 function instead of inlining it into matmul_f16_f32. image
  • Generates extra store instructions: noticed that the compiler generates extra vmem instructions to write the result registers to the stack, as seen in the highlight below. This will increase the mem bandwidth pressure, which impacts the processing speed.
    image

uint32_t prof_usecs;
uint32_t prof_cycles;
uint32_t prof_pkts;
uint64_t avail_mem_bytes = kMaxMemPerSessInBytes; // available memory for allocations
Copy link
Contributor Author

@chraac chraac Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we add a new field to track the available bytes per session.

@max-krasnyansky
Copy link
Collaborator

@chraac
Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

Re: f16/f32 matmuls. I've been playing with Flash Attention implementation and ended up overhauling how f16/f32 stuff is handled in many places. We should do as much as possible in f16 and only accumulate into f32. Leftover handling is quite simple with f16 using the mask predicate trick.
Take a look at #18611

@chraac
Copy link
Contributor Author

chraac commented Jan 6, 2026

Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

@max-krasnyansky
No worries. Regarding the limitation, I assume the limitation comes from the 32-bit address space of the Hexagon RTOS. Given newer versions support a 64-bit address space, will we be able to increase the memory allocation per session?

@max-krasnyansky
Copy link
Collaborator

Apologies for the delayed review. I looked at it earlier and the host side changes were breaking all kinds of things.
We don't need to enforce 2GB limit, the actual limit is 3.8GB so it's OK to spill over.

@max-krasnyansky No worries. Regarding the limitation, I assume the limitation comes from the 32-bit address space of the Hexagon RTOS. Given newer versions support a 64-bit address space, will we be able to increase the memory allocation per session?

Only the DMA engine has 64-bit address space, the rest of the HW is 32-bit address space.
If we were to use the DDR <-> DMA <-> VTCM for every single part of the processing then yes we can increase the memory allocation per session. Otherwise we need to keep it under 4GB.

@chraac chraac closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants