Conversation
cc8fca8 to
33264be
Compare
| /// register. There may be multiple current definitions for a register with | ||
| /// disjunct lanemasks. | ||
| VReg2SUnitMultiMap CurrentVRegDefs; | ||
| VReg2SUnitOperIdxMultiMap CurrentVRegDefs; |
There was a problem hiding this comment.
This was asymmetric between Uses and Defs. We need the operand index of the outstanding defs to compute operand latencies.
|
|
||
| // Use TRI's regsOverlap which handles both physical and virtual registers, | ||
| // including subregisters and lane masks | ||
| return TRI->regsOverlap(SrcReg, DstReg); |
There was a problem hiding this comment.
I guess this was only needed transiently, but it looks really good.
There was a problem hiding this comment.
nice they work on RegUnits
| PostSWP->isPostPipelineCandidate(*TheBlock)) | ||
| staticallyMaterializeMultiSlotInstructions(*TheBlock, HR); | ||
| PostSWP->isPostPipelineCandidate(*TheBlock)) { | ||
| staticallyMaterializeMultiSlotInstructions(*TheBlock, HR, MaterializeAll); |
There was a problem hiding this comment.
Would have been nice to be able to skip the scheduler before postpipelining. Sadly, the scheduler sometimes makes better decisions.
| for (int T = 0; T < II; ++T) { | ||
| LaneBitmask Mask = LanesByOffset[T]; | ||
| if (Mask.any()) { | ||
| // Show a simple indicator - could be enhanced to show actual lanes |
There was a problem hiding this comment.
Indeed. Full lanemasks are bulky though.
| static cl::opt<bool> TestRegDefUseTracker( | ||
| "aie-test-regdefuse-tracker", cl::Hidden, cl::init(false), | ||
| cl::desc("[AIE] TEST MODE: Run RegDefUseTracker analysis on all loops " | ||
| "(for testing only)")); |
There was a problem hiding this comment.
This is accommodating a dump for the early stages of live range analysis.
|
|
||
| void BlockState::restorePipelining() { | ||
| // Restore to the original allocation of the virtual registers | ||
| RegTracker->restoreOriginalPhysRegs(); |
There was a problem hiding this comment.
These registers were used by the scheduler whose result we're going to use as a fallback.
7930abc to
dcc908c
Compare
| BS.FixPoint.PipelinerMode = firstPipelinerMode(); | ||
| if (BS.FixPoint.PipelinerMode != PostPipelinerMode::None) { | ||
| return SchedulingStage::Pipelining; | ||
| } |
There was a problem hiding this comment.
This looks a bit weird: we have been pipelining and are trying to restore to the first allowed pipelinermode for the next II. This should be invariant, so I don't think we can get None here. Perhaps assert.
|
|
||
| // For virtual mode, re-analyze and virtualize | ||
| if (FixPoint.PipelinerMode == PostPipelinerMode::Virtual) { | ||
| // RegTracker might not exist if we have multiple regions |
There was a problem hiding this comment.
Someone missed that we can't do physical mode either if we have more than one region.
I would hope that RegTracker is always there for a SWP candidate.
Also for virtual registers. For architectures where latencies can go negative, this has impact on RecMII
The old check lines matched, ignoring the leading nops.
The FixPoint updaters just return the new state.
This is abstracting the live ranges to be used by PostRegAlloc
This module analyses live ranges of physical registers that can be safely reallocated in a basic block. It supplies facilities to rewrite to virtual registers and to restore the original allocation.
This module produces an EventSchedule from the instructions and their issue cycle. The event schedule contains the read and write events of the virtual registers occuring in the instructions ordered in the processor pipeline stage timeline. From the EventSchedule the modulo liveranges for a particular II can be constructed. These represent the lanes of each register that are live at a particular point.
This is a dedicated register allocator for use by the postpipeliner We compute some metrics, and run with a few different scorefunctions on those metrics to define an allocation order. We allocate in that ordeer, and fail as soon as we can't find a register that is available over the live range.
This is a strategy that prioritizes scheduling of scarce ranges. Scarce ranges are live ranges that compete for one svailable register. The live ranges are virtualized, which means we have no serializing WAR deps. However, we need to be careful not to have more than one live, which means we want to finish the range before starting a new one. We try all legal permutations of these live ranges. For the current live range, we first prioritize all its ancestors, then the instructions in the range itself. Once we are finished with the range, we simulate the WAR dependences that are necessary to keep the next ranges non-overlapping
dcc908c to
133f034
Compare
133f034 to
699934a
Compare
| const AIEHazardRecognizer &HR, | ||
| ResourceScoreboard<FuncUnitWrapper> &Scoreboard) { | ||
| const int Step = fromTop() ? 1 : -1; | ||
| if (Step < 0) { |
There was a problem hiding this comment.
Why does fromTop change the Stepping direction? Also why is the First and Last in the wrong order?
| ; CHECK-NEXT: vshuffle x2, x8, x0, r16; vmac.f bml5, bml5, x9, x7, r2 | ||
| ; CHECK-NEXT: .L_LEnd0: | ||
| ; CHECK-NEXT: vshuffle x10, x1, x3, r3; vmac.f bmh4, bmh4, x6, x5, r2 | ||
| ; CHECK-NEXT: nopb ; nopa ; nops ; nopx ; vshuffle x10, x1, x3, r3; vmac.f bmh4, bmh4, x6, x5, r2 |
There was a problem hiding this comment.
Curious, what induced this change?
| auto *AltA = Formats->getAlternateInstsOpcode(A->getOpcode()); | ||
| auto *AltB = Formats->getAlternateInstsOpcode(B->getOpcode()); | ||
|
|
||
| return AltA->size() < AltB->size(); |
There was a problem hiding this comment.
Check: less options, higher priority.
| auto SlotToBanks = getAssignedSlots(MBB, TII, HR); | ||
|
|
||
| if (!assignSlots(SlotToBanks, MBB, TII, HR)) { | ||
| if (assignSlots(SlotToBanks, MBB, TII, HR)) { |
There was a problem hiding this comment.
Maybe a comment here explaining that we care about memory banks first.
| } | ||
|
|
||
| const int Limit = Last + Step; | ||
| for (int C = First; C != Limit; C += Step) { |
There was a problem hiding this comment.
The original implementation has some debug messages, are they not useful? Also, we could remove the original fit.
| const AIEHazardRecognizer &HR, | ||
| ResourceScoreboard<FuncUnitWrapper> &Scoreboard) { | ||
| const int Step = fromTop() ? 1 : -1; | ||
| if (Step < 0) { |
There was a problem hiding this comment.
We could have something like:
if (!fromTop()) {
std::swap(First, Last);
}
It self documents the code.
There was a problem hiding this comment.
I still find this whole buisness funny. Why exactly are we doing that?
| /// Key identifying a live range and its subregister | ||
| struct LRKey { | ||
| unsigned LRId; // Live range identifier | ||
| unsigned SubRegIdx; // Subregister index (0 for full register) |
There was a problem hiding this comment.
CHECK: is SubRegIdx sufficient for comparing registers of different regclasses like ex and x regs or ex and y regs?
There was a problem hiding this comment.
I think this check could be more strict by checking if blocked regunits overlap
This is a POC of register allocation during postpipelining.
We add
Status:
It's aggressive enough to reach II=7 on gemm-bfp16-opt0, but sadly, the code it produces is not correct. I'm trying to find out what is causing my diff failure.