Move register & class tokens to be added earlier #1610

sophie-xhonneux · 2026-01-14T16:58:35Z

Description

Make sure register and class tokens are used in query aggregation engine.

Issue Number

Closes #1608
Closes #1673

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…ae_aggregation_engine. More checnking needed.

…iex/dev/include-reg-tokens-in-query-agg-engine

…herGenerator into sophiex/dev/include-reg-tokens-in-query-agg-engine

…gg-engine

…/dev/include-reg-tokens-in-query-agg-engine

clessig · 2026-01-21T20:43:11Z

config/config_dinov2.yml


-# streams_directory: "./config/streams/era5_1deg/"
-streams_directory: "./config/streams/era5_nppatms_synop/"
+streams_directory: "./config/streams/era5_1deg/"


Restore this file

Some changes make sense, and shouldn't be restored, e.g. training_mode including masking

clessig · 2026-01-21T20:44:40Z

config/config_physical_jepa.yml

 # currently fixed to 1.0 (due to limitations with flex_attention and triton)
 forecast_att_dense_rate: 1.0

+sslpred_num_blocks: 12


Move to JEPA loss terms, it also looks like it this has already been done.

src/weathergen/model/ema.py

src/weathergen/model/engines.py

src/weathergen/model/model_interface.py

src/weathergen/train/trainer.py

clessig · 2026-01-21T21:02:13Z

src/weathergen/train/trainer.py

        # get target_aux calculators for different loss terms
        self.target_and_aux_calculators = self.get_target_aux_calculators(self.training_cfg)
-        self.validate_with_ema_cfg = self.get_target_aux_calculators(self.validation_cfg)
+        # self.validate_with_ema_cfg = self.get_target_aux_calculators(self.validation_cfg)


I actually think this breaks things and should be removed

clessig · 2026-01-21T21:02:54Z

src/weathergen/train/trainer.py

                target_aux.update_state_pre_backward(self.cf.general.istep, batch, self.model)
                for _, target_aux in self.target_and_aux_calculators.items()
            ]
-            [


Why has this been removed?

clessig · 2026-01-21T21:03:05Z

src/weathergen/train/trainer.py

                target_aux.update_state_post_opt_step(step, batch, self.model)
                for _, target_aux in self.target_and_aux_calculators.items()
            ]
-            [


Why has this been removed?

…d_shard() to better support all use cases.

src/weathergen/model/ema.py

src/weathergen/model/encoder.py

shmh40 · 2026-01-22T12:45:33Z

src/weathergen/model/encoder.py

+            .repeat(rs, 1)
+        )
+        cell_lens_r = cell_lens.unsqueeze(0).reshape(rs, self.num_healpix_cells)
+        mask = torch.cat([mask_reg_class_tokens, cell_lens_r.to(torch.bool)], dim=1)


@sophie-xhonneux to check, here we have applied aggregation engine to the unmasked tokens and the register + class tokens, then we are just creating this mask for the reg and class tokens (all 1s), concat with the appropriate mask for the normal tokens and then fill tokens_global in the corresponding positions with the output of the aggregation engine unmasked?

shmh40 · 2026-01-22T12:48:10Z

src/weathergen/model/encoder.py

-            if tokens_c.shape[0] == 0:
+            # Check if this chunk is empty
+            if l0 == l1 or toks.shape[0] == 0:
                continue


@sophie-xhonneux do you think the fsdp problem we had before might also be fixed here by the PR on the FSDP package? Worth trying with just a copy and paste of the posted PR https://github.com/pytorch/pytorch/pull/170667/files ?

One can just edit in the local pyenv and try:

.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fully_shard.py

agreed we should

src/weathergen/model/encoder.py

src/weathergen/model/engines.py

…github.com:ecmwf/WeatherGenerator into sophiex/dev/include-reg-tokens-in-query-agg-engine

I think it still hangs in multi-GPU mode even with just DDP :/

…ed and removing dependence on default_config.yml

…github.com:ecmwf/WeatherGenerator into sophiex/dev/include-reg-tokens-in-query-agg-engine

…gg-engine

kctezcan and others added 9 commits January 13, 2026 08:31

WIP added a predictor class

bc19177

using the transformer predictor for jepa

f3d81b8

lint

d3fc692

Merge branch 'develop' into ktezcan/dev/iss1587_predictor_jepa

0d23560

added pred_ params in the test config

37145b3

renamed params to sslpred_

3e9372b

merged develop

1dcd781

lint

c845509

Move register & class tokens to be added earlier

fbba830

sophie-xhonneux requested a review from shmh40 January 14, 2026 16:58

github-project-automation bot added this to WeatherGen-dev Jan 14, 2026

github-actions bot added the model Related to model training or definition (not generic infra) label Jan 14, 2026

sophie-xhonneux requested review from clessig and removed request for shmh40 January 14, 2026 16:58

clessig and others added 16 commits January 15, 2026 07:45

Enable multiple student views for one target

0ee6213

Merge branch 'develop' into clessig/develop/fix_jepa_1616

d315800

Try to fix batchsize>1

bfeb2ff

Add configs

f33c1c9

Fixed code so that it's runnning and should be correct up to call to …

ddf3dbe

…ae_aggregation_engine. More checnking needed.

Linting

7e27674

Merge branch 'develop' into ktezcan/dev/iss1587_predictor_jepa

3a00e40

added the only jepa config

2048d16

Enable loss term

7426cf1

Fix globla positional embeddings

56e8cd9

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into soph…

68717ee

…iex/dev/include-reg-tokens-in-query-agg-engine

Merge branch 'clessig/develop/fix_jepa_1616' of github.com:ecmwf/Weat…

eae0a81

…herGenerator into sophiex/dev/include-reg-tokens-in-query-agg-engine

Move register tokens to after the chunking loop

c7df8cc

Merge branch 'develop' into sophiex/dev/include-reg-tokens-in-query-a…

ef90e1b

…gg-engine

Lint

a338638

Merge branch 'develop' into sophiex/kerem/pr/transformer-head

bca9097

Sophie Xhonneux and others added 6 commits January 21, 2026 16:19

Merge branch 'ktezcan/sophiex/kerem/pr/transformer-head' into sophiex…

21284c9

…/dev/include-reg-tokens-in-query-agg-engine

Fix config make Teacher not use DDP

66331bf

Fix validation

48f9da6

Fix EMAteacher update for EMATeacher with no DDP

d90a665

Re lengthen the mini epoch

dd58768

Fix EMATeacher in single-GPU mode

b4432a0

clessig reviewed Jan 21, 2026

View reviewed changes

clessig mentioned this pull request Jan 22, 2026

[Bug] Tensor Shape and Indexing Mismatches in encoder.py when rs > 1 #1673

Open

clessig added 3 commits January 22, 2026 09:29

Reverting default parameter for num_sample as test case

a4d4573

Linting and removing some old code.

9790ea1

Improved robustness of code and cleaned up interface of init_model_an…

68c74b7

…d_shard() to better support all use cases.