refactor(models, training): shard_shapes by japols · Pull Request #964 · ecmwf/anemoi-core

japols · 2026-03-10T14:45:29Z

Description

This PR simplifies the sharding metadata used to track tensor shapes across ranks.

Previously, we stored the full tensor shape for each rank (e.g. list[list[int]]). This required layers to manually track reshapes and dimension changes whenever tensors were transformed, which made the sharding logic fragile and tightly coupled to tensor layouts.

This refactor introduces ShardSizes = Union[list[int], None], representing the per-rank shard sizes along only the sharded dimension. Layers now propagate this information through a bundled GraphShardInfo / BipartiteGraphInfo which tracks shard metadata for both nodes and edges. The full shape expansion only happens at the level of the communication primitive where shapes are assumed equal for non-sharded dimensions across ranks.

Also refactor the all-to-all primitives for head/channel <-> grid sharding into a single common all-to-all primitive.

Additional notes

For now I've tested sharding for a global model across combinations of:

ensemble/single
keep_batch_sharded True/False
transformer/graphtransformer
head/edge sharding

please feel free to also test your favourite use case.

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

cathalobrien · 2026-03-10T16:10:56Z

nice, the benchmark tests pass

japols · 2026-03-17T14:52:38Z

waiting for #931 to be merged

refactor: shard_shapes

5873522

japols requested a review from ssmmnn11 March 10, 2026 14:45

japols self-assigned this Mar 10, 2026

japols added training models ATS Approval Needed Approval needed by ATS labels Mar 10, 2026

github-project-automation bot added this to Anemoi-dev Mar 10, 2026

github-project-automation bot moved this to To be triaged in Anemoi-dev Mar 10, 2026

Merge remote-tracking branch 'origin' into refactor/shard-shapes

e08c47a

japols requested a review from cathalobrien March 10, 2026 16:21

Merge branch 'main' into refactor/shard-shapes

5372650

japols marked this pull request as draft March 11, 2026 09:53

mchantry added ATS Approval Not Needed No approval needed by ATS and removed ATS Approval Needed Approval needed by ATS labels Mar 11, 2026

japols mentioned this pull request Mar 11, 2026

fix(training): multiscale wrapper sharding is broken after multi dataset merge #958

Merged

fix tests

acfb801

Rename shard shapes to shard sizes and fix sharding sync

f2035de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(models, training): shard_shapes#964

refactor(models, training): shard_shapes#964
japols wants to merge 5 commits intomainfrom
refactor/shard-shapes

japols commented Mar 10, 2026 •

edited

Loading

Uh oh!

cathalobrien commented Mar 10, 2026

Uh oh!

japols commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

japols commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional notes

Uh oh!

cathalobrien commented Mar 10, 2026

Uh oh!

japols commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

japols commented Mar 10, 2026 •

edited

Loading