[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765
[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765andcarminati wants to merge 2 commits intoaie-publicfrom
Conversation
| ; CHECK-LABEL: zol_elongated_preheader_plus_NumOfLoopBundles_ge112bytes: | ||
| ; CHECK: .p2align 4 | ||
| ; CHECK-NEXT: // %bb.0: // %entry | ||
| ; CHECK-NEXT: vldb x6, [p0], #0 |
There was a problem hiding this comment.
This is something relevant. The old approach detected a dependency through p0, but by using maxLatency, we extend the distance to 7 (load latency), what is quite pessimistic.
|
Next step is to refactor a bit the code. |
eced1ea to
d3620c9
Compare
| /*InSeparateRegion=*/false); | ||
|
|
||
| // Find the loop successor - must exist if we have bot-fixed bundles in a | ||
| // prologue |
There was a problem hiding this comment.
What is the difference between regular interblock scheduling and the pipelining interblock scheduling?
There was a problem hiding this comment.
They are different in the sense that without pipelining:
- We push a safety margin on the epilogue block based on what we see on the loop.
- We initialize the bot scoreboard of the prologue based on the loop state.
With a pipelined loop, we have the top and bot insert regions that are already scheduled. We carefully dump those regions to a place where the normal DAG construction will not see them (sequence of bundles), because we need full control to not rearrange/change timing properties. After this timing region placement, we carefully create DAG nodes for the bundles (Also, normal DAG construction is not prepared for bundles- but this is not the reason), tying them together to a proper in-sequence scheduling. After, we construct a timeline of events relative the EntrySU and ExitSU, so we can create dependencies for the "free instructions". Register dependencies are the easy part on this timeline construction, the problem arises when we consider also memory cycles and alias relations (most of the code). As result, for each "free" instruction, we have a single edge to Entry/Exit modeling all memory and register latencies. For example, a memory instruction in the prologue can execute a register write inside de loop, and this must be also tracked - we do this here using events with negative cycles.
Also, this is the starting point for prologue/epilogue merging as a whole, the most complicated part is done by the scheduler: release each fixed node in sequence before any other node. This involves, for example, the switching the from top to bottom-um scheduling and delta cycles in both directions.
| const BlockState &LoopBS = | ||
| Scheduler->getInterBlock().getBlockState(LoopSucc); | ||
| ArrayRef<AIE::MachineBundle> LoopTimedBundles = LoopBS.getTop().Bundles; | ||
| BackwardRET.computeUseDefBackward(LoopTimedBundles, |
There was a problem hiding this comment.
How do we handle the wrap around nature of the loops?
There was a problem hiding this comment.
Once we execute the first iteration of the loop (all instructions will be there), this first iteration will be responsible for the first events in the timeline that may influence the prologue. Once we iterate again, events will not be interesting anymore because we already saw them previously, closer to the prologue. Happily, pipelined loops execute at least once and they contain all instructions.
| } | ||
|
|
||
| void AIERegMemEventTracker::updateLastStoreCycle(int StoreCycle) { | ||
| LastStoreCycle = std::max(LastStoreCycle, StoreCycle); |
There was a problem hiding this comment.
What is the difference between this and the scoreboard we use in the Postpipeliner? Can't we reuse it here?
There was a problem hiding this comment.
A scoreboard can track resources (functional units/slots etc.) in the pipeline without the notion of latency. An instruction may reserve/require resources the the scoreboard, but there in no notion of operand latency/data dependency there. Here, we are not interested any resources, only latencies that can accommodate data dependencies and also memory relations between "free" instructions and already scheduled instructions.
| SDep Dep(FixedDepSU, SDep::Artificial); | ||
| Dep.setLatency(1); | ||
| FreeSU.addPred(Dep, /*Required=*/true); | ||
| // Handle bot-fixed instructions (prologue) with backward tracking |
There was a problem hiding this comment.
Looks really good. I think that we can make this function even more understandable by factoring out the intermediate steps into helper functions. Maybe it's even possible to share some code between the prologue and epilogue handling.
| FreeSU.addPred(Dep, /*Required=*/true); | ||
| // Handle bot-fixed instructions (prologue) with backward tracking | ||
| // Only handle prologues (preheader blocks), not epilogues here | ||
| if (!BotFixedBundles.empty() && BS.Kind != BlockType::Epilogue) { |
There was a problem hiding this comment.
I think this check is redundant. I suggest moving the block type check into an assert (as we do in the prologue case) or removing it altogether.
There was a problem hiding this comment.
New checks available. We can have 1-stage pipelined loop with BotFixedBundles.
| // SUnits get placed exactly at their depth (for the Top zone) or height | ||
| // (for the Bot zone). | ||
| SUnit *Pred = &DAG->EntrySU; | ||
| // We itarate over BUNDLEs or standalone instructions. |
| int Cycle = 0; | ||
| const int TotalCycles = Bundles.size(); | ||
|
|
||
| // Track top-fixed region size (only for the first call, not separate region) |
There was a problem hiding this comment.
Nit: This comment explains the what and not the why. Could you add an explanation of why this tracking is only needed in the first call? Also, why not track it in both cases and move the check to the place where the total cycles is actually used, i.e. if the size is only needed in a computation in certain scenarios, then I think it makes sense to keep that check closer to the computation rather than here.
|
|
||
| // Count progressively from 0 as we iterate backward through bundles | ||
| // Cycle represents the distance from ExitSU for each bundle | ||
| int Cycle = 0; |
There was a problem hiding this comment.
Nit: AFAIK we usually call this the height. Please rename for consistency.
There was a problem hiding this comment.
Definitely not. AIERegMemEventTracker is a register and memory event tracker so we register events in their respective cycles.
| int Cycle = 0; | ||
|
|
||
| for (const auto &Bundle : reverse(Bundles)) { | ||
| // Cycle is the distance from ExitSU for this bundle |
| if (getFirstMemoryAccessCycle() > INT_MIN) { | ||
| const int CoreResumeCycle = TII->getCoreResumeCycleAfterLock(); | ||
| const int MIResumeCycle = CoreResumeCycle - 1; | ||
| const int MemDep = getFirstMemoryAccessCycle() - MIResumeCycle + 2; |
There was a problem hiding this comment.
I rewrote this part for clarity.
| [](const MachineOperand &MO) { return MO.isReg(); }); | ||
| }; | ||
|
|
||
| return MI.hasUnmodeledSideEffects() && !MI.mayLoadOrStore() && |
There was a problem hiding this comment.
why do we treat unmodeled sideffects differently than locks or load/stores?
There was a problem hiding this comment.
They are opaque instructions: no operands, no memory cycles, do not load or store and they are not locks (with respective associated cycles). In essence, they are scheduling barriers.
| } | ||
| // Check stores in top-fixed (RAW for loads, WAR for stores) | ||
| const int StoreCycle = | ||
| AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true) : getLastStoreCycle(); |
There was a problem hiding this comment.
shouldn't this be getLastMemoryAccessCycle since we are unifying store and load?
There was a problem hiding this comment.
This seems correct to me, we care about stores here.
d3620c9 to
9acbf08
Compare
The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.
9acbf08 to
097e353
Compare


The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.