Skip to content

Add AArch64 SIMD for Blake, SHA, CRC, XXH3, Argon2, and Adler32#88

Merged
Xor-el merged 34 commits into
masterfrom
feature/arm-simd
Jun 25, 2026
Merged

Add AArch64 SIMD for Blake, SHA, CRC, XXH3, Argon2, and Adler32#88
Xor-el merged 34 commits into
masterfrom
feature/arm-simd

Conversation

@Xor-el

@Xor-el Xor-el commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Summary

Adds runtime-dispatched AArch64 SIMD implementations across HashLib4Pascal, bringing ARM64 to parity with the existing x86 SIMD tier ladder. Kernels use inline assembly with .long-encoded vector/crypto instructions for broad FPC assembler compatibility, and are selected at startup via the existing *Dispatch.pas + HlpArmSimdFeatures infrastructure.

This PR also refactors the CRC fold core (unified runtime context, clearer include naming).

What's new on AArch64

Crypto Extensions (FEAT_SHA*)

Algorithm Dispatch probe Kernel
SHA-1 HasSHA1() SHA1CompressCryptoExt_aarch64.inc
SHA-256 HasSHA256() SHA256CompressCryptoExt_aarch64.inc
SHA-512 HasSHA512() SHA512CompressCryptoExt_aarch64.inc
SHA-3 (Keccak-f[1600]) HasSHA3() KeccakF1600CryptoExt_aarch64.inc + absorb variant
CRC fold HasPMULL() CRCFoldForwardPmull_aarch64.inc, CRCFoldReflectedPmull_aarch64.inc

NEON (Advanced SIMD)

Algorithm Dispatch Kernel
BLAKE2b / BLAKE2s SelectSlot([NEON]) Blake2BCompressNeon_aarch64.inc, Blake2SCompressNeon_aarch64.inc
BLAKE3 SelectSlot([NEON]) Blake3CompressNeon_aarch64.inc, Blake3Hash4Neon_aarch64.inc
Adler-32 SelectSlot([NEON]) Adler32BlocksNeon_aarch64.inc
XXH3 SelectSlot([NEON]) XXH3Acc512Neon_aarch64.inc, XXH3InitSecretNeon_aarch64.inc, XXH3ScrambleNeon_aarch64.inc
Argon2 SelectSlot([NEON]) Argon2FillBlockNeon_aarch64.inc

Scrypt (intentional scalar default)

A verified ScryptSalsaXor_Neon kernel is included, but dispatch keeps the scalar path on AArch64. Benchmarks on Apple Silicon show scalar wins at every tested N because Scrypt's serial Salsa20/8 chain does not benefit from lane parallelism, while AArch64's 31 GPRs let the scalar kernel avoid spills. This matches upstream practice (OpenSSL/libsodium ship x86 SSE2 Scrypt but no NEON variant).

Infrastructure changes

  • HlpArmSimdFeatures: probes SHA-512, SHA-3, and PMULL; adds DisableAllExtraFeatures() for uniform HASHLIB_FORCE_* override baselines
  • AArch64 asm prologues: new SimdProc1Begin_aarch64.incSimdProc6Begin_aarch64.inc under Include/Simd/Common/
  • Dispatch documentation: standardized SIMD index blocks in all *Dispatch.pas units; kernel header conventions documented in HashLib.Tests/docs/SimdDispatch.md and SimdAarch64Headers.md
  • Package wiring: FPC/Delphi packages updated for renamed/consolidated CRC units

CRC core refactor (#86, #87)

  • Introduces unified TCRCFoldRuntimeCtx (fold constants + slicing table rows in one packed record)
  • Renames fold include files for consistency (CRCFoldForwardPclmul_x86_64.inc, etc.)
  • Renames HlpGF2.pasHlpCRCFoldConstants.pas
  • Consolidates width-specific CRC wrappers into HlpCRCStandard.pas (replaces separate HlpCRC16/32/64.pas units)
  • Registers PMULL carry-less multiply fold when HasPMULL() is true (analogous to x86 PCLMUL/VPCLMUL chain)

Other changes

  • CI: add benchmark support to ci workflow
  • Benchmark: renames benchmark project/source files to HashLib.Benchmark* convention (no functional change)

Architecture

flowchart TD
  Init["InitDispatch at unit load"] --> Scalar["Assign scalar fallback"]
  Scalar --> ArmProbe{"AArch64?"}
  ArmProbe -->|CryptoExt| SHA["HasSHA* / HasSHA3 probes"]
  ArmProbe -->|NEON tier| NEON["SelectSlot NEON"]
  ArmProbe -->|PMULL| CRC["HasPMULL for CRC fold"]
  ArmProbe -->|Scrypt skip| ScryptScalar["Keep Scrypt_SalsaXor_Scalar"]
  SHA --> Active["Active proc pointer"]
  NEON --> Active
  CRC --> Active
  ScryptScalar --> Active
Loading

Xor-el added 30 commits June 21, 2026 02:13
- unify TCRCFoldRuntimeCtx32/64 into one TCRCFoldRuntimeCtx (untyped TableRow,
  same +96 asm layout); collapse the two init builders into one overload
- rename fold fns/globals to ISA-neutral names (Lsb->Reflected, Msb->Forward,
  CRC_Fold_UsesPclmul->CRC_Fold_UsesCarrylessMul) and the 10 CRCFold*.inc files
- factor scalar reflected slice into CRC_FoldReflected_OneSlice
- rename HlpGF2 -> HlpCRCFoldConstants; merge HlpCRC16/32/64 into HlpCRCStandard
  (HlpCRC32Fast kept separate)
- fix stale .inc header refs (e.g. TConverters.le2me_32)
public API (THashFactory.TChecksum.TCRC, TCRCStandard, ICRC) unchanged
@Xor-el Xor-el merged commit 4d2cf6d into master Jun 25, 2026
24 checks passed
@Xor-el Xor-el deleted the feature/arm-simd branch June 25, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant