Skip to content

[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765

Draft
andcarminati wants to merge 2 commits intoaie-publicfrom
andreu.prologue.improve
Draft

[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765
andcarminati wants to merge 2 commits intoaie-publicfrom
andreu.prologue.improve

Conversation

@andcarminati
Copy link
Collaborator

The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.

; CHECK-LABEL: zol_elongated_preheader_plus_NumOfLoopBundles_ge112bytes:
; CHECK: .p2align 4
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: vldb x6, [p0], #0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something relevant. The old approach detected a dependency through p0, but by using maxLatency, we extend the distance to 7 (load latency), what is quite pessimistic.

@andcarminati
Copy link
Collaborator Author

Next step is to refactor a bit the code.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch from eced1ea to d3620c9 Compare January 30, 2026 13:05
/*InSeparateRegion=*/false);

// Find the loop successor - must exist if we have bot-fixed bundles in a
// prologue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between regular interblock scheduling and the pipelining interblock scheduling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different in the sense that without pipelining:

  • We push a safety margin on the epilogue block based on what we see on the loop.
  • We initialize the bot scoreboard of the prologue based on the loop state.

With a pipelined loop, we have the top and bot insert regions that are already scheduled. We carefully dump those regions to a place where the normal DAG construction will not see them (sequence of bundles), because we need full control to not rearrange/change timing properties. After this timing region placement, we carefully create DAG nodes for the bundles (Also, normal DAG construction is not prepared for bundles- but this is not the reason), tying them together to a proper in-sequence scheduling. After, we construct a timeline of events relative the EntrySU and ExitSU, so we can create dependencies for the "free instructions". Register dependencies are the easy part on this timeline construction, the problem arises when we consider also memory cycles and alias relations (most of the code). As result, for each "free" instruction, we have a single edge to Entry/Exit modeling all memory and register latencies. For example, a memory instruction in the prologue can execute a register write inside de loop, and this must be also tracked - we do this here using events with negative cycles.

Also, this is the starting point for prologue/epilogue merging as a whole, the most complicated part is done by the scheduler: release each fixed node in sequence before any other node. This involves, for example, the switching the from top to bottom-um scheduling and delta cycles in both directions.

const BlockState &LoopBS =
Scheduler->getInterBlock().getBlockState(LoopSucc);
ArrayRef<AIE::MachineBundle> LoopTimedBundles = LoopBS.getTop().Bundles;
BackwardRET.computeUseDefBackward(LoopTimedBundles,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the wrap around nature of the loops?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we execute the first iteration of the loop (all instructions will be there), this first iteration will be responsible for the first events in the timeline that may influence the prologue. Once we iterate again, events will not be interesting anymore because we already saw them previously, closer to the prologue. Happily, pipelined loops execute at least once and they contain all instructions.

}

void AIERegMemEventTracker::updateLastStoreCycle(int StoreCycle) {
LastStoreCycle = std::max(LastStoreCycle, StoreCycle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this and the scoreboard we use in the Postpipeliner? Can't we reuse it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scoreboard can track resources (functional units/slots etc.) in the pipeline without the notion of latency. An instruction may reserve/require resources the the scoreboard, but there in no notion of operand latency/data dependency there. Here, we are not interested any resources, only latencies that can accommodate data dependencies and also memory relations between "free" instructions and already scheduled instructions.

SDep Dep(FixedDepSU, SDep::Artificial);
Dep.setLatency(1);
FreeSU.addPred(Dep, /*Required=*/true);
// Handle bot-fixed instructions (prologue) with backward tracking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I think that we can make this function even more understandable by factoring out the intermediate steps into helper functions. Maybe it's even possible to share some code between the prologue and epilogue handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

FreeSU.addPred(Dep, /*Required=*/true);
// Handle bot-fixed instructions (prologue) with backward tracking
// Only handle prologues (preheader blocks), not epilogues here
if (!BotFixedBundles.empty() && BS.Kind != BlockType::Epilogue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant. I suggest moving the block type check into an assert (as we do in the prologue case) or removing it altogether.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New checks available. We can have 1-stage pipelined loop with BotFixedBundles.

// SUnits get placed exactly at their depth (for the Top zone) or height
// (for the Bot zone).
SUnit *Pred = &DAG->EntrySU;
// We itarate over BUNDLEs or standalone instructions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: iterate

int Cycle = 0;
const int TotalCycles = Bundles.size();

// Track top-fixed region size (only for the first call, not separate region)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment explains the what and not the why. Could you add an explanation of why this tracking is only needed in the first call? Also, why not track it in both cases and move the check to the place where the total cycles is actually used, i.e. if the size is only needed in a computation in certain scenarios, then I think it makes sense to keep that check closer to the computation rather than here.


// Count progressively from 0 as we iterate backward through bundles
// Cycle represents the distance from ExitSU for each bundle
int Cycle = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: AFAIK we usually call this the height. Please rename for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not. AIERegMemEventTracker is a register and memory event tracker so we register events in their respective cycles.

int Cycle = 0;

for (const auto &Bundle : reverse(Bundles)) {
// Cycle is the distance from ExitSU for this bundle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Repeated comment.

if (getFirstMemoryAccessCycle() > INT_MIN) {
const int CoreResumeCycle = TII->getCoreResumeCycleAfterLock();
const int MIResumeCycle = CoreResumeCycle - 1;
const int MemDep = getFirstMemoryAccessCycle() - MIResumeCycle + 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this part for clarity.

[](const MachineOperand &MO) { return MO.isReg(); });
};

return MI.hasUnmodeledSideEffects() && !MI.mayLoadOrStore() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we treat unmodeled sideffects differently than locks or load/stores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are opaque instructions: no operands, no memory cycles, do not load or store and they are not locks (with respective associated cycles). In essence, they are scheduling barriers.

}
// Check stores in top-fixed (RAW for loads, WAR for stores)
const int StoreCycle =
AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true) : getLastStoreCycle();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be getLastMemoryAccessCycle since we are unifying store and load?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct to me, we care about stores here.

@andcarminati
Copy link
Collaborator Author

QoR:
image
image

This helps to improve the outermost loop of GEMM inst8.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch from d3620c9 to 9acbf08 Compare February 6, 2026 13:25
The goal is to reduce pessimism by precise register event tracking.
Locks and events are handled in the same way as in the epilogue.
@andcarminati andcarminati force-pushed the andreu.prologue.improve branch from 9acbf08 to 097e353 Compare February 6, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants