Skip to content

Releases: intel/sycl-tla

SYCL*TLA 0.9

30 Apr 07:53
b4f7216

Choose a tag to compare

Enhancements

  • Support FP8 upconversion for CuTe SLM Copy (#772)
  • Support AOT (Ahead-Of-Time) compilation instead of JIT (#763)
  • Add vectorized test cases for CuTe SLM Copy (#766)
  • Add some GEMM and Flash Attention benchmark cases (#773)

Bug Fixes

  • Fix example06 memory bandwidth computation bugs (#778)
  • Fix Python GEMM Generation bugs (#768)
  • Fix Python EVT test cases bugs (#762)
  • Fix AOT multitarget support bugs (#765)

See the CHANGELOG for details of all past releases and updates.

SYCL*TLA 0.8

26 Mar 07:45
2fc0997

Choose a tag to compare

Major Architecture Changes

  • Support BMG G31 Platform (#755)
  • SLM Copy API functionalities and examples
    • Support CuTe copy engines for 1D LDSM/STSM operations with vISA (#753)
    • Enable fusion example of 2 matmul operations through SLM Copoy API (#747)
    • Enable subgroup specialization example with SLM Copy API (#735)
  • Support default sub-byte reorder for low-precision data types (#709)

Enhancements

  • Flash Attention Performance Improvements (for BMG and BF16):

  • Epilogue Visitor Tree (EVT) Enhancements:

    • Combine with SIGMOID function (#686)
    • Add Relu variation test cases (#693)
    • Enhance and refine code and test case(#703, #717)
  • GEMM Enhancements:

    • Support all GEMM tile shapes (#738)
    • Enhanced examples (#726)
  • Python Package : Pypi

See the CHANGELOG for details of all past releases and updates.

SYCL*TLA 0.7

29 Jan 16:04
5b9e70a

Choose a tag to compare

Major Architecture Changes

  • Epilogue Visitor Tree (EVT):
    • Extended Xe epilogue fusion support for various post-GEMM computations (#647, #650)
    • Added support for XeAuxLoad (#674), XeAuxStore (#691, #698, #704), XeRow/XeCol broadcast (#690) and XeRow/XeCol/XeScalar reduction (#680, #694)
    • Added Python EVT support with comprehensive test coverage
  • Xe Epilogue Rearchitecture (#621): redesigned epilogue path for the new Xe architecture.

Enhancements

  • Flash Attention: performance improvements (BF16 speedup) (#679).
  • GEMM: Column Major C bias support (#656).
  • Shared Local Memory (SLM): new SLM copy helper functions (#673).
  • Build: multi-target builds via DPCPP_SYCL_TARGET (#630).
  • Reorder: API updates and fixes (#639), (#635)).

Bug Fixes / Notes

  • Flash Attention KV cache / prefill fixes (#617).
  • CuTe atom partitioning edge-case fix (#628).
  • Fix CMake path issue (#700).
  • Unit tests for VNNI load disabled due to driver issue (#707).

Python Packages

  • Available in Pypi
  • Install with pip install sycl-tla==0.7.0

Known Issues

  • CuTe Column Major Support: Column Major support for C matrix may introduce stability issues with older versions of driver. Please update to the latest driver version for optimal stability.

Deprecation Notice

  • Legacy APIs with old CuTe atoms are deprecated and will be removed in future releases. Users are encouraged to migrate to the new CuTe atom APIs for better performance and support. Refer Xe Rearchitecture for new APIs.

See the CHANGELOG for details of all past releases and updates.

v0.6

04 Nov 00:48
d2292f0

Choose a tag to compare

What's New in SYCL*TLA 0.6

SYCL*TLA 0.6 (2025-11-03)

Major Architecture Changes

  • Flash Attention Reimplementation (#547 ): Complete rewrite of Flash Attention using new Xe atoms
    • Enhanced performance with optimized memory access patterns
    • Better integration with Intel Xe hardware capabilities
  • CUTLASS Library Generation (#578): Full support for CUTLASS library generation and operations
    • New Xe architecture support in library generation pipeline
    • Automated kernel instantiation and compilation support
  • Python package vi Pypi distribution

Enhancements

  • Python Operations Support (#595): Enhanced Python bindings with comprehensive test coverage

    • Improved Python API stability and usability
    • Enhanced test framework for Python operations
  • CuTe Subgroup Extensions: New subgroup-scope operations for Intel Xe

    • Subgroup broadcast and reduction operations (#9a6aa27)
    • make_subgroup_tensor helpers for improved tensor manipulation (#21fb89a)
  • Enhanced 2D Copy Operations: Extended block 2D copy functionality

    • New make_block_2d_copy_{C,D} variants with subtiling support (#48d82e8)
    • Support for size-1 fragments in block 2D copies (#2212f1b)
  • 4-bit VNNI Reorders (#593): New 4-bit unit stride to VNNI reorder operations

  • Batch GEMM with new APIs (#540): Enhanced Batch GEMM with new streamlined APIs

  • Grouped GEMM with new APIs (#574): Enhanced grouped GEMM with new streamlined APIs

    See the CHANGELOG for details of all past releases and updates.

SYCL is a trademark of the Khronos Group Inc, Other names and brands may be claimed as the property of others.

v0.5

26 Sep 22:42
b0cb10e

Choose a tag to compare

New in CUTLASS SYCL 0.5

Major Architecture Changes

  • Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
    • New MMA atoms for improved performance
    • Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
    • New 2D copy helpers (low-level make_block_2d_copy and high-level make_block_2d_copy_{A,B,C})
    • Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
    • Requires IGC version v2.18.5 or later

New Features

  • G++ Host Compiler Support (#490): Support for G++ 13 as host compiler

  • Migrated syclcompat to this repository as cutlasscompat for better compatibility

    • Fixed compilation issues when using G++ instead of clang++
    • Added new CI workflow for testing G++ host compiler builds
    • Enhanced build system to support -DDPCPP_HOST_COMPILER=g++ option
  • Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations

    • Added support for BF16 + S8 mixed dtype grouped GEMM
    • Added support for FP16 + U4 mixed dtype grouped GEMM
    • New examples: 10_bmg_grouped_gemm_bf16_f16_s8.cpp and 10_bmg_grouped_gemm_f16_u4.cpp

    See the CHANGELOG-SYCL for details of all past releases and updates.

v3.9-0.3

30 Jun 21:12
467a2bb

Choose a tag to compare

What's Changed

Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)

  • Add support for GEMM FP8 (E5M2 and E4M3)
  • Add example for GEMM FP8 with support for channel-wise and group-wise quantization
  • Add support for Grouped GEMM FP8
  • Improve performance for FP8 to FP16 conversion
  • Add support for epilogue data conversion
  • Add support for FP16 GEMM with FP16 accumulator
  • Add support for BF16 GEMM with BF16 accumulator
  • Add support for mixed dtype GEMM with support for tensor-wise, channel-wise and group-wise quantization
  • Add example of mixed dtype BF16 + INT8 using channel-wise and group-wise quantization
  • Add example of mixed dtype FP16 + INT8 using tensor-wise quantization
  • Add example of mixed dtype FP16 + INT4 using channel-wise and group-wise quantization
  • Add support for zero-point quantization in INT4 and INT8 data types
  • Add support for Flash Attention prefill FP8 with and without KV cache
  • Add support for Flash Attention decode FP8 with and without KV cache

Full Changelog: v3.9-0.2...v3.9-0.3

Cutlass 3.9.2 SYCL backend Version 0.2

30 May 23:43
dd43242

Choose a tag to compare

Cutlass 3.9.2 SYCL backend Version 0.2 (2025-05-30)
Based on CUTLASS 3.9.2 - May 2025 release

Platforms

  • Support for Intel GPU Data Center Max (1100 and 1550)
  • Support for Intel Arc B580 ("Battlemage")

Features

  • GEMM/StreamK/SplitK with support for FP16 data type
  • Flash attention prefill with Paged KV cache with support for FP16 data type
  • Performance improvements for flash attention prefill and decode

Full Changelog: v3.9-0.1...v3.9-0.2

Cutlass 3.9 sycl backend version 0.1

30 Apr 01:12
ef9797f

Choose a tag to compare

Based on CUTLASS 3.9.0 March 2025 release

Platforms

  • Support for Intel GPU Data Center Max (1100 and 1550)
  • Support for Intel Arc B580 ("Battlemage")

Features

  • GEMM/StreamK/SplitK with support for bfloat16 data type

  • Flash attention prefill and decode with KV cache with support for bfloat16 data type

  • Support for epilogue operations:

    • Element-wise, row-wise and column-wise bias
    • ReLU, SiLU, GELU activation fns
    • Softmax
  • Mixed precision GEMM (bfloat16/int8, half/int4) with dequantization support

  • Dual GEMM & Grouped GEMM

Full Changelog: https://github.com/codeplaysoftware/cutlass-sycl/commits/v3.9-0.1