[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188
Open
Hardcode84 wants to merge 6 commits intoiree-org:mainfrom
Open
[asm] Replace RawOp with typed SALU ops + PackOp in SRD construction#1188Hardcode84 wants to merge 6 commits intoiree-org:mainfrom
Hardcode84 wants to merge 6 commits intoiree-org:mainfrom
Conversation
Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops that are visible to the register allocator. For new SRD construction (handleFatRawBufferCast, emitSRDBaseAdjustment, handleMakeBufferRsrc): extract source SRD words via ExtractOp, build each word with S_MOV_B32/S_AND_B32/S_OR_B32/S_ADD_U32/S_ADDC_U32, and assemble via PackOp. Regalloc assigns contiguous registers to pack inputs, eliminating hardcoded register index arithmetic. For prologue SRD setup (in-place writes to precolored registers): replace RawOp with typed S_MOV_B32/S_MOV_B64 targeting PSRegType + DCEProtectOp. Infrastructure changes: - Add srdValueMap/getSRDValue/setSRDValue to TranslationContext for SSA-based SRD lookup (parallel to the existing index-based API). - Split emitSrdNumRecords into buildSrdWord2 (returns free Value for PackOp) and patchSrdWord2InPlace (writes to pinned register). - Remove dead getNextSwizzleSRDIndex and getFirstFreeSgprAfterSRDs. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Ivan Butygin <[email protected]>
…actor The previous commit replaced RawOp with typed SALU ops + PackOp for SRD construction, introducing two bugs: 1. handleVectorStore had an inline SRD lookup that checked getSRDIndex but not getSRDValue. Since emitSRDBaseAdjustment now uses setSRDValue, subsequent stores fell back to the unadjusted prologue SRD. Fix by using lookupSRD(), matching the load and masked-store handlers. 2. ScopedCSE can merge identical S_MOV_B32 ops across PackOps, creating a shared SSA value. Since one value can only have one physical register, the last PackOp assignment won, leaving earlier PackOps with an uninitialized SRD word. Fix by detecting duplicate PackOp inputs in the liveness pass and inserting copies (S_MOV_B32/V_MOV_B32) to give each slot its own distinct value. Made-with: Cursor Signed-off-by: Ivan Butygin <[email protected]>
Add two regression tests for bugs fixed in e9a9fca: 1. buffer-ops-store-reuses-srd.mlir: vector.store on the same adjusted memref as a prior vector.load must reuse the PackOp-produced SRD via getSRDValue, not fall back to the unadjusted prologue SRD. 2. lit-regalloc-pack-cse-dedup.mlir: when a single SSA value is shared across multiple PackOp slots (same value across packs, or duplicated within one pack), the liveness pass must insert copies so each slot gets a distinct physical register. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Ivan Butygin <[email protected]>
…ection Two issues caused the bufops e2e test to fail after the typed-op refactor: 1. SRD base adjustment used disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing the register allocator to reuse kernel-arg SRD registers. Propagate srcSrdValue through PendingSRDBaseAdjust and connect S_LOAD_DWORDX2 results to S_MOV_B64 sources in the prologue. 2. Hazard mitigation missed VALU→v_readfirstlane conflicts when (a) the producer and consumer were different SSA values at the same physical VGPR, or (b) non-emitting ops (extract, constant) sat between them in the MLIR. Compare physical register indices instead of SSA identity, and skip non-emitting ops when searching for the preceding instruction. Made-with: Cursor Signed-off-by: Ivan Butygin <[email protected]>
ExtractOp is non-emitting (aliases source[offset] without a copy instruction). When used as a PackOp input, the post-allocation pass assigns it to result_base + i, which can differ from the extract's physical register, silently corrupting the packed value. Insert an explicit s_mov_b32/v_mov_b32 copy for ExtractOp inputs to PackOps, matching the existing handling for duplicate inputs. Fixes test_dynamic_copy_water_waveasm[shape0] where the SRD stride word (word 2) was left with garbage after the register allocator placed the new SRD at a different base than the source. Made-with: Cursor Signed-off-by: Ivan Butygin <[email protected]>
a527eda to
f06e145
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Ivan Butygin <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eliminate all raw assembly string manipulation from SRD (Shader Resource Descriptor) construction, replacing it with typed MLIR ops visible to the register allocator.
What changed
Core refactor: SRD construction in
handleFatRawBufferCast,emitSRDBaseAdjustment, andhandleMakeBufferRsrcnow uses ExtractOp to decompose source SRDs, typed SALU ops (S_MOV_B32, S_AND_B32, S_OR_B32, S_ADD_U32, S_ADDC_U32) to build individual words, and PackOp to assemble the result. Regalloc assigns contiguous registers to pack inputs, eliminating all hardcoded register index arithmetic and string-basedRawOp("s_mov_b32 sN, ...")patterns.Prologue SRD setup (in-place writes to precolored registers) uses typed ops targeting PSRegType + DCEProtectOp instead of RawOp.
Infrastructure:
srdValueMap/getSRDValue/setSRDValueadded to TranslationContext for SSA-based SRD lookup.emitSrdNumRecordssplit intobuildSrdWord2(returns free Value for PackOp) andpatchSrdWord2InPlace. DeadgetNextSwizzleSRDIndex/getFirstFreeSgprAfterSRDsremoved.Follow-up fixes
Moving from opaque RawOp strings to SSA-visible typed ops exposed several interactions with existing passes:
SRD lookup in stores:
handleVectorStorehad an inline SRD lookup checkinggetSRDIndexbut notgetSRDValue. Replaced withlookupSRD().CSE duplicate PackOp inputs: ScopedCSE merges identical S_MOV_B32 ops, creating shared SSA values across PackOp slots. Liveness pass now inserts copies so each slot gets a distinct physical register.
SRD liveness across prologue-to-adjustment gap:
emitSRDBaseAdjustmentused disconnected PrecoloredSRegOp instead of the prologue's SSA value, allowing regalloc to reuse kernel-arg SRD registers. PropagatedsrcSrdValuethroughPendingSRDBaseAdjustand connected S_LOAD_DWORDX2 results to S_MOV_B64 sources.VALU->readfirstlane hazard detection: Hazard mitigation missed conflicts when producer/consumer were different SSA values at the same physical VGPR, or when non-emitting ops sat between them. Now compares physical register indices and skips non-emitting ops.
ExtractOp-into-PackOp register conflict: ExtractOp is non-emitting (aliases source sub-register). When used as PackOp input, the post-allocation pass could assign it to a different register than the extract source. Liveness pass now inserts explicit copies for ExtractOp inputs to PackOps.