Clessig/jk/develop/1654 fix fsteps #1689

clessig · 2026-01-24T21:38:42Z

Description

Cleanups and improvements to target branch.

Issue Number

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…tput* to be more general. Cleaned up various details.

…d some variable naming

Jubeku · 2026-01-26T10:18:44Z

packages/common/src/weathergen/common/config.py

-            assert all(step > 0 for step in forecast_cfg.num_steps), valid_forecast_steps_offset1
+
+    # check forecast offset
+    if forecast_cfg.get("offset") is not None:


We set a default forecast_offset=0 in the multi_stream_data_sampler - so we should still do the the checks below if forecast_cfg.get("offset") == None.

Which line was this? Still cannot see it in the deleted code above.

Here: https://github.com/ecmwf/WeatherGenerator/blob/clessig/jk/develop/1654_fix_fsteps/src/weathergen/datasets/multi_stream_data_sampler.py#L107
So you could specify no offset in the config but still specify num_steps: 1

I now see https://github.com/ecmwf/WeatherGenerator/blob/clessig/jk/develop/1654_fix_fsteps/src/weathergen/train/trainer.py#L126
So I think it should be: if forecast_cfg is not None:

Jubeku · 2026-01-26T10:31:14Z

src/weathergen/datasets/multi_stream_data_sampler.py

            self.forecast_policy = None
+            self.time_step = np.timedelta64(0, "ms")
+
+        fsm = self.list_num_forecast_steps[0]


Why [0]? list_num_forecast_steps is not necessarily sorted.

If I understand correctly, the forecast_steps (if several options are given) are drawn further down for each batch step. So maybe the dataset indices should be reduced by the maximum possible forecast length?

I didn't rethink the logic here but just moved things to more appropriate places in the function. I think it should also be a separate PR to fix/check this.

src/weathergen/datasets/multi_stream_data_sampler.py

src/weathergen/datasets/batch.py

src/weathergen/datasets/multi_stream_data_sampler.py

…Generator into clessig/jk/develop/1654_fix_fsteps

Jubeku

output_idxs was not only renamed from forecast_idxs but also the definition changed so that it always contains all output indices, not only in the case for forecasting.
len(output_idxs) == batch.get_output_len() unless forecast.offset==1, then len(output_idxs) == batch.get_output_len() -1.

Jubeku · 2026-01-26T14:02:28Z

src/weathergen/model/model.py

-            register_tokens=z[:, : self.register_token_idx],
-            class_token=z[:, self.register_token_idx : self.class_token_idx],
-            patch_tokens=z[:, self.class_token_idx :],
+            register_tokens=z[:, self.register_token_idxs] if z is not None else z,


Maybe [...] if z is not None else None is more readable?

Jubeku · 2026-01-26T14:11:23Z

src/weathergen/model/model.py

+        for step in range(batch.get_output_len()):
+            # apply forecasting engine
+            if self.forecast_engine:
+                tokens = self.forecast_engine(tokens, step)


You always want to call forecast_engine already for step 0 and overwrite the tokens at step 1 when your first output_idx is 1 (aka your first forecast step)?

If output_idxs[0] = 1 and self.forecast_engine is not None then this is the correct behavior. If output_idxs[0] = 0 and self.forecast_engine is None we also obtain the correct behavior. We could enforce that output_idxs[0] = 1 and self.forecast_engine is None is an invalid config option.

Jubeku · 2026-01-26T14:16:23Z

src/weathergen/model/model.py


-        tokens = self.forecast_engine(tokens, fstep)
+        # safe latent prediction
+        tokens_pre_norm = self.latent_pre_norm(tokens) if step == 0 else None


Are tokens_pre_norm the tokens post the pre-model-LayerNorm?

tokens_pre_norm is what was previously called z. But the name is wrong, it should be tokens_post_norm, yes.

Jubeku · 2026-01-26T14:17:52Z

src/weathergen/model/model.py

        output = ModelOutput(batch.get_output_len())

        tokens, posteriors = self.encoder(model_params, batch)
+        output.add_latent_prediction(0, "posteriors", posteriors)


This will be overwritten when output_idxs contain 0. But maybe that's supposed to be?

No, posteriors is only written here (latents is set in predict_latent()).

Jubeku · 2026-01-26T14:24:31Z

src/weathergen/train/target_and_aux_module_base.py

        # collect all targets, concatenating across batch dimension since this is also how it
        # happens for predictions in the model
-        timestep_idxs = [0] if len(forecast_idxs) == 0 else forecast_idxs
+        timestep_idxs = [0] if len(output_idxs) == 0 else output_idxs


That's not needed anymore if we always enforce len(output_idxs) > 0 (see assertion in line 102)

Jubeku · 2026-01-26T14:28:05Z

packages/common/src/weathergen/common/config.py

-            assert all(step > 0 for step in forecast_cfg.num_steps), valid_forecast_steps_offset1
+
+    # check forecast offset
+    if forecast_cfg.get("offset") is not None:


I now see https://github.com/ecmwf/WeatherGenerator/blob/clessig/jk/develop/1654_fix_fsteps/src/weathergen/train/trainer.py#L126
So I think it should be: if forecast_cfg is not None:

Jubeku · 2026-01-26T14:36:51Z

src/weathergen/train/target_and_aux_module_base.py

+        timestep_idxs = [0] if len(output_idxs) == 0 else output_idxs
        for stream_name in stream_names:
            # collect targets for all forecast steps
            for t_idx in timestep_idxs:


We can directly loop over output_idxs.

Jubeku · 2026-01-26T20:28:52Z

Running inference I got:

num_forecast_steps at mini_epoch=0 : 6
  0%|                                                                                                                                           | 0/16 [00:20<?, ?it/s]
Traceback (most recent call last):
  File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/run_train.py", line 78, in inference_from_args
    trainer.inference(cf, devices, args.from_run_id, args.mini_epoch)
  File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 213, in inference
    self.validate(0, self.test_cfg, self.batch_size_test_per_gpu)
  File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 535, in validate
    preds = self.model(
            ^^^^^^^^^^^
  File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/model/model.py", line 590, in forward
    output = self.predict_decoders(model_params, step, tokens, batch, output)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/model/model.py", line 649, in predict_decoders
    tokens = tokens[:, (self.aux_token_idxs[-1] + 1) :]
                        ~~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range
[6] > /users/jkuehner/CODE/WeatherGenerator/src/weathergen/model/model.py(649)predict_decoders()
-> tokens = tokens[:, (self.aux_token_idxs[-1] + 1) :]

Command:
uv run inference --from_run_id=eu7fp1nm --samples=16 --options training_config.forecast.num_steps=6 zarr_store=zip

Jubeku · 2026-01-26T20:33:18Z

...it's because self.aux_token_idxs = [] is empty, so self.aux_token_idxs[-1] throws an error

clessig added 2 commits January 24, 2026 22:07

Fixed forecast config testing

b5e1a49

Simplified and clarified forecast loop. Switched from forecast* to ou…

4ac50b8

…tput* to be more general. Cleaned up various details.

github-project-automation bot added this to WeatherGen-dev Jan 24, 2026

clessig requested review from Jubeku and MatKbauer January 24, 2026 21:39

clessig added 4 commits January 25, 2026 12:32

Re-enabled support for missing or incomplete forecast config. Improve…

bd8625c

…d some variable naming

Fixed bug in definition of class token indices

068d3d3

Fixed bug with aliased variable naming

4649d1a

Fixed outdated model forward call

221059f

Jubeku requested changes Jan 26, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in WeatherGen-dev Jan 26, 2026

clessig added 3 commits January 26, 2026 12:53

Merge branch 'jk/develop/1654_fix_fsteps' of github.com:ecmwf/Weather…

1b3cc9c

…Generator into clessig/jk/develop/1654_fix_fsteps

Fixing bugs surfaced during review

0cc660e

Fixed problem with output writing

4326f8a

Jubeku reviewed Jan 26, 2026

View reviewed changes

Clessig/jk/develop/1654 fix fsteps #1689

Are you sure you want to change the base?

Clessig/jk/develop/1654 fix fsteps #1689

Uh oh!

Conversation

clessig commented Jan 24, 2026

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jubeku left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Jan 26, 2026

Uh oh!

Jubeku commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants