30 Apr 07:53

taozha2

b4f7216

SYCL*TLA 0.9 Latest

Latest

Enhancements

Support FP8 upconversion for CuTe SLM Copy (#772)
Support AOT (Ahead-Of-Time) compilation instead of JIT (#763)
Add vectorized test cases for CuTe SLM Copy (#766)
Add some GEMM and Flash Attention benchmark cases (#773)

Bug Fixes

Fix example06 memory bandwidth computation bugs (#778)
Fix Python GEMM Generation bugs (#768)
Fix Python EVT test cases bugs (#762)
Fix AOT multitarget support bugs (#765)

See the CHANGELOG for details of all past releases and updates.

Assets 2

26 Mar 07:45

taozha2

v0.8

2fc0997

SYCL*TLA 0.8

Major Architecture Changes

Support BMG G31 Platform (#755)
SLM Copy API functionalities and examples
- Support CuTe copy engines for 1D LDSM/STSM operations with vISA (#753)
- Enable fusion example of 2 matmul operations through SLM Copoy API (#747)
- Enable subgroup specialization example with SLM Copy API (#735)
Support default sub-byte reorder for low-precision data types (#709)

Enhancements

Flash Attention Performance Improvements (for BMG and BF16):
- Fix long context OOM issue (#728)
- Overall performance improved from ~45% to ~78% of peak(#728, #743,#749,#750)
- Refine code and fix bugs (#715, #716,#720)
Epilogue Visitor Tree (EVT) Enhancements:
- Combine with SIGMOID function (#686)
- Add Relu variation test cases (#693)
- Enhance and refine code and test case(#703, #717)
GEMM Enhancements:
- Support all GEMM tile shapes (#738)
- Enhanced examples (#726)
Python Package : Pypi

See the CHANGELOG for details of all past releases and updates.

Assets 2

29 Jan 16:04

Antonyvance

v0.7

5b9e70a

SYCL*TLA 0.7

Major Architecture Changes

Epilogue Visitor Tree (EVT):
- Extended Xe epilogue fusion support for various post-GEMM computations (#647, #650)
- Added support for XeAuxLoad (#674), XeAuxStore (#691, #698, #704), XeRow/XeCol broadcast (#690) and XeRow/XeCol/XeScalar reduction (#680, #694)
- Added Python EVT support with comprehensive test coverage
Xe Epilogue Rearchitecture (#621): redesigned epilogue path for the new Xe architecture.

Enhancements

Flash Attention: performance improvements (BF16 speedup) (#679).
GEMM: Column Major C bias support (#656).
Shared Local Memory (SLM): new SLM copy helper functions (#673).
Build: multi-target builds via DPCPP_SYCL_TARGET (#630).
Reorder: API updates and fixes (#639), (#635)).

Bug Fixes / Notes

Flash Attention KV cache / prefill fixes (#617).
CuTe atom partitioning edge-case fix (#628).
Fix CMake path issue (#700).
Unit tests for VNNI load disabled due to driver issue (#707).

Python Packages

Available in Pypi
Install with pip install sycl-tla==0.7.0

Known Issues

CuTe Column Major Support: Column Major support for C matrix may introduce stability issues with older versions of driver. Please update to the latest driver version for optimal stability.

Deprecation Notice

Legacy APIs with old CuTe atoms are deprecated and will be removed in future releases. Users are encouraged to migrate to the new CuTe atom APIs for better performance and support. Refer Xe Rearchitecture for new APIs.

See the CHANGELOG for details of all past releases and updates.

Assets 2

04 Nov 00:48

Antonyvance

v0.6

d2292f0

v0.6

What's New in SYCL*TLA 0.6

SYCL*TLA 0.6 (2025-11-03)

Major Architecture Changes

Flash Attention Reimplementation (#547 ): Complete rewrite of Flash Attention using new Xe atoms
- Enhanced performance with optimized memory access patterns
- Better integration with Intel Xe hardware capabilities
CUTLASS Library Generation (#578): Full support for CUTLASS library generation and operations
- New Xe architecture support in library generation pipeline
- Automated kernel instantiation and compilation support
Python package vi Pypi distribution
- pip install from sycl-tla

Enhancements

Python Operations Support (#595): Enhanced Python bindings with comprehensive test coverage
- Improved Python API stability and usability
- Enhanced test framework for Python operations
CuTe Subgroup Extensions: New subgroup-scope operations for Intel Xe
- Subgroup broadcast and reduction operations (#9a6aa27)
- make_subgroup_tensor helpers for improved tensor manipulation (#21fb89a)
Enhanced 2D Copy Operations: Extended block 2D copy functionality
- New make_block_2d_copy_{C,D} variants with subtiling support (#48d82e8)
- Support for size-1 fragments in block 2D copies (#2212f1b)
4-bit VNNI Reorders (#593): New 4-bit unit stride to VNNI reorder operations
Batch GEMM with new APIs (#540): Enhanced Batch GEMM with new streamlined APIs
Grouped GEMM with new APIs (#574): Enhanced grouped GEMM with new streamlined APIs

See the CHANGELOG for details of all past releases and updates.

SYCL is a trademark of the Khronos Group Inc, Other names and brands may be claimed as the property of others.

Assets 2

26 Sep 22:42

rolandschulz

v0.5

b0cb10e

v0.5

New in CUTLASS SYCL 0.5

Major Architecture Changes

Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
- New MMA atoms for improved performance
- Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
- New 2D copy helpers (low-level make_block_2d_copy and high-level make_block_2d_copy_{A,B,C})
- Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
- Requires IGC version v2.18.5 or later

New Features

G++ Host Compiler Support (#490): Support for G++ 13 as host compiler
Migrated syclcompat to this repository as cutlasscompat for better compatibility
- Fixed compilation issues when using G++ instead of clang++
- Added new CI workflow for testing G++ host compiler builds
- Enhanced build system to support -DDPCPP_HOST_COMPILER=g++ option
Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations
- Added support for BF16 + S8 mixed dtype grouped GEMM
- Added support for FP16 + U4 mixed dtype grouped GEMM
- New examples: 10_bmg_grouped_gemm_bf16_f16_s8.cpp and 10_bmg_grouped_gemm_f16_u4.cpp
See the CHANGELOG-SYCL for details of all past releases and updates.

Assets 2

30 Jun 21:12

mehdi-goli

v3.9-0.3

467a2bb

v3.9-0.3

What's Changed

Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)

Add support for GEMM FP8 (E5M2 and E4M3)
Add example for GEMM FP8 with support for channel-wise and group-wise quantization
Add support for Grouped GEMM FP8
Improve performance for FP8 to FP16 conversion
Add support for epilogue data conversion
Add support for FP16 GEMM with FP16 accumulator
Add support for BF16 GEMM with BF16 accumulator
Add support for mixed dtype GEMM with support for tensor-wise, channel-wise and group-wise quantization
Add example of mixed dtype BF16 + INT8 using channel-wise and group-wise quantization
Add example of mixed dtype FP16 + INT8 using tensor-wise quantization
Add example of mixed dtype FP16 + INT4 using channel-wise and group-wise quantization
Add support for zero-point quantization in INT4 and INT8 data types
Add support for Flash Attention prefill FP8 with and without KV cache
Add support for Flash Attention decode FP8 with and without KV cache

Full Changelog: v3.9-0.2...v3.9-0.3

Assets 2

30 May 23:43

mehdi-goli

v3.9-0.2

dd43242

Cutlass 3.9.2 SYCL backend Version 0.2

Cutlass 3.9.2 SYCL backend Version 0.2 (2025-05-30)
Based on CUTLASS 3.9.2 - May 2025 release

Platforms

Support for Intel GPU Data Center Max (1100 and 1550)
Support for Intel Arc B580 ("Battlemage")

Features

GEMM/StreamK/SplitK with support for FP16 data type
Flash attention prefill with Paged KV cache with support for FP16 data type
Performance improvements for flash attention prefill and decode

Full Changelog: v3.9-0.1...v3.9-0.2

Assets 2

30 Apr 01:12

mehdi-goli

v3.9-0.1

ef9797f

Cutlass 3.9 sycl backend version 0.1

Based on CUTLASS 3.9.0 March 2025 release

Platforms

Support for Intel GPU Data Center Max (1100 and 1550)
Support for Intel Arc B580 ("Battlemage")

Features

GEMM/StreamK/SplitK with support for bfloat16 data type
Flash attention prefill and decode with KV cache with support for bfloat16 data type
Support for epilogue operations:
- Element-wise, row-wise and column-wise bias
- ReLU, SiLU, GELU activation fns
- Softmax
Mixed precision GEMM (bfloat16/int8, half/int4) with dequantization support
Dual GEMM & Grouped GEMM

Full Changelog: https://github.com/codeplaysoftware/cutlass-sycl/commits/v3.9-0.1

Assets 2

Releases: intel/sycl-tla

SYCL*TLA 0.9

Enhancements

Bug Fixes

Uh oh!

SYCL*TLA 0.8

Major Architecture Changes

Enhancements

Uh oh!

SYCL*TLA 0.7

Major Architecture Changes

Enhancements

Bug Fixes / Notes

Python Packages

Known Issues

Deprecation Notice

Uh oh!

v0.6

What's New in SYCL*TLA 0.6

SYCL*TLA 0.6 (2025-11-03)

Major Architecture Changes

Enhancements

Uh oh!

v0.5

New in CUTLASS SYCL 0.5

Major Architecture Changes

New Features

Uh oh!

v3.9-0.3

What's Changed

Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)

Uh oh!

Cutlass 3.9.2 SYCL backend Version 0.2

Uh oh!

Cutlass 3.9 sycl backend version 0.1

Uh oh!