Skip to content

[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188

Open
Hardcode84 wants to merge 6 commits intoiree-org:mainfrom
Hardcode84:waveasm-srd-contruction-refac
Open

[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188
Hardcode84 wants to merge 6 commits intoiree-org:mainfrom
Hardcode84:waveasm-srd-contruction-refac

Conversation

@Hardcode84
Copy link
Contributor

@Hardcode84 Hardcode84 commented Mar 25, 2026

Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops visible to the register allocator.

What changed

Core refactor: SRD construction in handleFatRawBufferCast, emitSRDBaseAdjustment, and handleMakeBufferRsrc now uses ExtractOp to decompose source SRDs, typed SALU ops (S_MOV_B32, S_AND_B32, S_OR_B32, S_ADD_U32, S_ADDC_U32) to build individual words, and PackOp to assemble the result. Regalloc assigns contiguous registers to pack inputs, eliminating all hardcoded register index arithmetic and string-based RawOp("s_mov_b32 sN, ...") patterns.

Prologue SRD setup (in-place writes to precolored registers) uses typed ops targeting PSRegType + DCEProtectOp instead of RawOp.

Infrastructure: srdValueMap/getSRDValue/setSRDValue added to TranslationContext for SSA-based SRD lookup. emitSrdNumRecords split into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace. Dead getNextSwizzleSRDIndex/getFirstFreeSgprAfterSRDs removed.

Follow-up fixes

Moving from opaque RawOp strings to SSA-visible typed ops exposed several interactions with existing passes:

  • SRD lookup in stores: handleVectorStore had an inline SRD lookup checking getSRDIndex but not getSRDValue. Replaced with lookupSRD().

  • CSE duplicate PackOp inputs: ScopedCSE merges identical S_MOV_B32 ops, creating shared SSA values across PackOp slots. Liveness pass now inserts copies so each slot gets a distinct physical register.

  • SRD liveness across prologue-to-adjustment gap: emitSRDBaseAdjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing regalloc to reuse kernel-arg SRD registers. Propagated srcSrdValue through PendingSRDBaseAdjust and connected S_LOAD_DWORDX2 results to S_MOV_B64 sources.

  • VALU->readfirstlane hazard detection: Hazard mitigation missed conflicts when producer/consumer were different SSA values at the same physical VGPR, or when non-emitting ops sat between them. Now compares physical register indices and skips non-emitting ops.

  • ExtractOp-into-PackOp register conflict: ExtractOp is non-emitting (aliases source sub-register). When used as PackOp input, the post-allocation pass could assign it to a different register than the extract source. Liveness pass now inserts explicit copies for ExtractOp inputs to PackOps.

Hardcode84 and others added 5 commits March 25, 2026 13:52
Eliminate all raw assembly string manipulation from SRD (Shader Resource
Descriptor) construction, replacing it with typed MLIR ops that are
visible to the register allocator.

For new SRD construction (handleFatRawBufferCast, emitSRDBaseAdjustment,
handleMakeBufferRsrc): extract source SRD words via ExtractOp, build
each word with S_MOV_B32/S_AND_B32/S_OR_B32/S_ADD_U32/S_ADDC_U32, and
assemble via PackOp. Regalloc assigns contiguous registers to pack
inputs, eliminating hardcoded register index arithmetic.

For prologue SRD setup (in-place writes to precolored registers): replace
RawOp with typed S_MOV_B32/S_MOV_B64 targeting PSRegType + DCEProtectOp.

Infrastructure changes:
- Add srdValueMap/getSRDValue/setSRDValue to TranslationContext for
  SSA-based SRD lookup (parallel to the existing index-based API).
- Split emitSrdNumRecords into buildSrdWord2 (returns free Value for
  PackOp) and patchSrdWord2InPlace (writes to pinned register).
- Remove dead getNextSwizzleSRDIndex and getFirstFreeSgprAfterSRDs.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Ivan Butygin <[email protected]>
…actor

The previous commit replaced RawOp with typed SALU ops + PackOp for SRD
construction, introducing two bugs:

1. handleVectorStore had an inline SRD lookup that checked getSRDIndex
   but not getSRDValue. Since emitSRDBaseAdjustment now uses setSRDValue,
   subsequent stores fell back to the unadjusted prologue SRD. Fix by
   using lookupSRD(), matching the load and masked-store handlers.

2. ScopedCSE can merge identical S_MOV_B32 ops across PackOps, creating
   a shared SSA value. Since one value can only have one physical
   register, the last PackOp assignment won, leaving earlier PackOps
   with an uninitialized SRD word. Fix by detecting duplicate PackOp
   inputs in the liveness pass and inserting copies (S_MOV_B32/V_MOV_B32)
   to give each slot its own distinct value.

Made-with: Cursor
Signed-off-by: Ivan Butygin <[email protected]>
Add two regression tests for bugs fixed in e9a9fca:

1. buffer-ops-store-reuses-srd.mlir: vector.store on the same adjusted
   memref as a prior vector.load must reuse the PackOp-produced SRD via
   getSRDValue, not fall back to the unadjusted prologue SRD.

2. lit-regalloc-pack-cse-dedup.mlir: when a single SSA value is shared
   across multiple PackOp slots (same value across packs, or duplicated
   within one pack), the liveness pass must insert copies so each slot
   gets a distinct physical register.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Ivan Butygin <[email protected]>
…ection

Two issues caused the bufops e2e test to fail after the typed-op refactor:

1. SRD base adjustment used disconnected PrecoloredSRegOp instead of the
   prologue's SSA value, allowing the register allocator to reuse
   kernel-arg SRD registers. Propagate srcSrdValue through
   PendingSRDBaseAdjust and connect S_LOAD_DWORDX2 results to S_MOV_B64
   sources in the prologue.

2. Hazard mitigation missed VALU→v_readfirstlane conflicts when (a) the
   producer and consumer were different SSA values at the same physical
   VGPR, or (b) non-emitting ops (extract, constant) sat between them in
   the MLIR. Compare physical register indices instead of SSA identity,
   and skip non-emitting ops when searching for the preceding instruction.

Made-with: Cursor
Signed-off-by: Ivan Butygin <[email protected]>
ExtractOp is non-emitting (aliases source[offset] without a copy
instruction). When used as a PackOp input, the post-allocation pass
assigns it to result_base + i, which can differ from the extract's
physical register, silently corrupting the packed value.

Insert an explicit s_mov_b32/v_mov_b32 copy for ExtractOp inputs to
PackOps, matching the existing handling for duplicate inputs.

Fixes test_dynamic_copy_water_waveasm[shape0] where the SRD stride
word (word 2) was left with garbage after the register allocator
placed the new SRD at a different base than the source.

Made-with: Cursor
Signed-off-by: Ivan Butygin <[email protected]>
@Hardcode84 Hardcode84 force-pushed the waveasm-srd-contruction-refac branch from a527eda to f06e145 Compare March 26, 2026 12:45
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Ivan Butygin <[email protected]>
@Hardcode84 Hardcode84 changed the title WIP: Replace RawOp with typed SALU ops + PackOp in SRD construction Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026
@Hardcode84 Hardcode84 changed the title Replace RawOp with typed SALU ops + PackOp in SRD construction [asm] Replace RawOp with typed SALU ops + PackOp in SRD construction Mar 26, 2026
@Hardcode84 Hardcode84 requested a review from harsh-nod March 26, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant