Skip to content

Triton -> CuteDSL dim0 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma #4052

@danielvegamyhre

Description

@danielvegamyhre

For shape (128000, 7168)

  • Current: ~5.5 tb/s
  • Goal: ~6.4 tb/s.

Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout. We should write directly to blocked layout.

Also, we should try CuteDSL instead of Triton for this next iteration

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions