Releases: intel/sycl-tla
SYCL*TLA 0.9
Enhancements
- Support FP8 upconversion for CuTe SLM Copy (#772)
- Support AOT (Ahead-Of-Time) compilation instead of JIT (#763)
- Add vectorized test cases for CuTe SLM Copy (#766)
- Add some GEMM and Flash Attention benchmark cases (#773)
Bug Fixes
- Fix example06 memory bandwidth computation bugs (#778)
- Fix Python GEMM Generation bugs (#768)
- Fix Python EVT test cases bugs (#762)
- Fix AOT multitarget support bugs (#765)
See the CHANGELOG for details of all past releases and updates.
SYCL*TLA 0.8
Major Architecture Changes
- Support BMG G31 Platform (#755)
- SLM Copy API functionalities and examples
- Support default sub-byte reorder for low-precision data types (#709)
Enhancements
-
Flash Attention Performance Improvements (for BMG and BF16):
-
Epilogue Visitor Tree (EVT) Enhancements:
-
GEMM Enhancements:
-
Python Package : Pypi
See the CHANGELOG for details of all past releases and updates.
SYCL*TLA 0.7
Major Architecture Changes
- Epilogue Visitor Tree (EVT):
- Xe Epilogue Rearchitecture (#621): redesigned epilogue path for the new Xe architecture.
Enhancements
- Flash Attention: performance improvements (BF16 speedup) (#679).
- GEMM: Column Major C bias support (#656).
- Shared Local Memory (SLM): new SLM copy helper functions (#673).
- Build: multi-target builds via
DPCPP_SYCL_TARGET(#630). - Reorder: API updates and fixes (#639), (#635)).
Bug Fixes / Notes
- Flash Attention KV cache / prefill fixes (#617).
- CuTe atom partitioning edge-case fix (#628).
- Fix CMake path issue (#700).
- Unit tests for VNNI load disabled due to driver issue (#707).
Python Packages
- Available in Pypi
- Install with
pip install sycl-tla==0.7.0
Known Issues
- CuTe Column Major Support: Column Major support for C matrix may introduce stability issues with older versions of driver. Please update to the latest driver version for optimal stability.
Deprecation Notice
- Legacy APIs with old CuTe atoms are deprecated and will be removed in future releases. Users are encouraged to migrate to the new CuTe atom APIs for better performance and support. Refer Xe Rearchitecture for new APIs.
See the CHANGELOG for details of all past releases and updates.
v0.6
What's New in SYCL*TLA 0.6
SYCL*TLA 0.6 (2025-11-03)
Major Architecture Changes
- Flash Attention Reimplementation (#547 ): Complete rewrite of Flash Attention using new Xe atoms
- Enhanced performance with optimized memory access patterns
- Better integration with Intel Xe hardware capabilities
- CUTLASS Library Generation (#578): Full support for CUTLASS library generation and operations
- New Xe architecture support in library generation pipeline
- Automated kernel instantiation and compilation support
- Python package vi Pypi distribution
- pip install from sycl-tla
Enhancements
-
Python Operations Support (#595): Enhanced Python bindings with comprehensive test coverage
- Improved Python API stability and usability
- Enhanced test framework for Python operations
-
CuTe Subgroup Extensions: New subgroup-scope operations for Intel Xe
-
Enhanced 2D Copy Operations: Extended block 2D copy functionality
-
4-bit VNNI Reorders (#593): New 4-bit unit stride to VNNI reorder operations
-
Batch GEMM with new APIs (#540): Enhanced Batch GEMM with new streamlined APIs
-
Grouped GEMM with new APIs (#574): Enhanced grouped GEMM with new streamlined APIs
See the CHANGELOG for details of all past releases and updates.
SYCL is a trademark of the Khronos Group Inc, Other names and brands may be claimed as the property of others.
v0.5
New in CUTLASS SYCL 0.5
Major Architecture Changes
- Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
- New MMA atoms for improved performance
- Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
- New 2D copy helpers (low-level
make_block_2d_copyand high-levelmake_block_2d_copy_{A,B,C}) - Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
- Requires IGC version v2.18.5 or later
New Features
-
G++ Host Compiler Support (#490): Support for G++ 13 as host compiler
-
Migrated
syclcompatto this repository ascutlasscompatfor better compatibility- Fixed compilation issues when using G++ instead of clang++
- Added new CI workflow for testing G++ host compiler builds
- Enhanced build system to support
-DDPCPP_HOST_COMPILER=g++option
-
Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations
- Added support for BF16 + S8 mixed dtype grouped GEMM
- Added support for FP16 + U4 mixed dtype grouped GEMM
- New examples:
10_bmg_grouped_gemm_bf16_f16_s8.cppand10_bmg_grouped_gemm_f16_u4.cpp
See the CHANGELOG-SYCL for details of all past releases and updates.
v3.9-0.3
What's Changed
Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)
- Add support for GEMM FP8 (E5M2 and E4M3)
- Add example for GEMM FP8 with support for channel-wise and group-wise quantization
- Add support for Grouped GEMM FP8
- Improve performance for FP8 to FP16 conversion
- Add support for epilogue data conversion
- Add support for FP16 GEMM with FP16 accumulator
- Add support for BF16 GEMM with BF16 accumulator
- Add support for mixed dtype GEMM with support for tensor-wise, channel-wise and group-wise quantization
- Add example of mixed dtype BF16 + INT8 using channel-wise and group-wise quantization
- Add example of mixed dtype FP16 + INT8 using tensor-wise quantization
- Add example of mixed dtype FP16 + INT4 using channel-wise and group-wise quantization
- Add support for zero-point quantization in INT4 and INT8 data types
- Add support for Flash Attention prefill FP8 with and without KV cache
- Add support for Flash Attention decode FP8 with and without KV cache
Full Changelog: v3.9-0.2...v3.9-0.3
Cutlass 3.9.2 SYCL backend Version 0.2
Cutlass 3.9.2 SYCL backend Version 0.2 (2025-05-30)
Based on CUTLASS 3.9.2 - May 2025 release
Platforms
- Support for Intel GPU Data Center Max (1100 and 1550)
- Support for Intel Arc B580 ("Battlemage")
Features
- GEMM/StreamK/SplitK with support for FP16 data type
- Flash attention prefill with Paged KV cache with support for FP16 data type
- Performance improvements for flash attention prefill and decode
Full Changelog: v3.9-0.1...v3.9-0.2
Cutlass 3.9 sycl backend version 0.1
Based on CUTLASS 3.9.0 March 2025 release
Platforms
- Support for Intel GPU Data Center Max (1100 and 1550)
- Support for Intel Arc B580 ("Battlemage")
Features
-
GEMM/StreamK/SplitK with support for bfloat16 data type
-
Flash attention prefill and decode with KV cache with support for bfloat16 data type
-
Support for epilogue operations:
- Element-wise, row-wise and column-wise bias
- ReLU, SiLU, GELU activation fns
- Softmax
-
Mixed precision GEMM (bfloat16/int8, half/int4) with dequantization support
-
Dual GEMM & Grouped GEMM
Full Changelog: https://github.com/codeplaysoftware/cutlass-sycl/commits/v3.9-0.1