Skip to content

Conversation

@ankitpatnala
Copy link
Contributor

@ankitpatnala ankitpatnala commented Jan 21, 2026

Description

We have indexing issue if step_hr > len_hr. We have repetitive index.
Before the index was based on step_hr. Now, in this PR, it has been change to DTRange.

This change introduces how we get data from datareader

Issue Number

Closes #1659

Is this PR a draft? Mark it as draft.

Still need to test

Need to adapt the change to other datareader type ("FESOM", "Obs")
Failed while using num_input_steps : 2

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@ankitpatnala
Copy link
Contributor Author

@clessig I just created a draft PR for forecast time index issue.
I made it a draft PR as I still need to address 2 things (mentioned in the comment)

@ankitpatnala ankitpatnala changed the title Ankit issue1659 forecast time index issue by using timerange instead of index created by step_hr Issue1659 forecast time index issue by using timerange instead of index created by step_hr Jan 21, 2026
Copy link
Collaborator

@clessig clessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at my comment on _get(). Why should/could the index change there?


@override
def _get(self, idx: TIndex, channels_idx: list[int]) -> ReaderData:
def _get(self, t_range: DTRange, channels_idx: list[int]) -> ReaderData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _get() function is derived and defined in the base class--we cannot just change it here. It's signature is also something that has been thought about, so any change would have to be discussed first. self._get_dataset_idxs(idx) also takes a TIndex as argument and not a DTRange:

def _get_dataset_idxs(self, idx: TIndex) -> tuple[NDArray[np.int64], DTRange]:

Copy link
Contributor Author

@ankitpatnala ankitpatnala Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working directly on indices based on step_hr makes it difficult to obtain the right window.
We have to index them based on forecast_hr then and sampling with a frequency ($step\_hr</code> / <code>forecast\_hr$)

Copy link
Collaborator

@clessig clessig Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We specifically abstracted all the functionality to map from time windows to the indices for each (time-stepped) dataset here:

def get_dataset_indexes_timestep(

Currently I don't see a reason to change this.

Copy link
Contributor Author

@ankitpatnala ankitpatnala Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtr is picked from index at this line.

with step_hr = 24:00:00, the consecutive indices are now 24h apart

def window(self, idx: TIndex) -> DTRange:
"""
Temporal window corresponding to index
Parameters
----------
idx :
index of temporal window
Returns
-------
start and end of temporal window
"""
t_start_win = self.t_start + self.t_window_step * idx
t_end_win = t_start_win + self.t_window_len
return DTRange(t_start_win, t_end_win)

as len_hr is still 06:00:00 it looks for time range within next 6 hours. So some time_windows are not mapped to any indices in such case.

index 1 : day1 00:00:00 - 06:00:00
index 2:  day2  00:00:00 - 06:00:00

Alternate can be we need to change TIndex from np.int to float type to look for right window

t_start_win = self.t_start + self.t_window_step * idx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtr is picked from index at this line.

with step_hr = 24:00:00, the consecutive indices are now 24h apart

def window(self, idx: TIndex) -> DTRange:
"""
Temporal window corresponding to index
Parameters
----------
idx :
index of temporal window
Returns
-------
start and end of temporal window
"""
t_start_win = self.t_start + self.t_window_step * idx
t_end_win = t_start_win + self.t_window_len
return DTRange(t_start_win, t_end_win)

as len_hr is still 06:00:00 it looks for time range within next 6 hours. So some time_windows are not mapped to any indices in such case.

index 1 : day1 00:00:00 - 06:00:00
index 2:  day2  00:00:00 - 06:00:00

Alternate can be we need to change TIndex from np.int to float type to look for right window

t_start_win = self.t_start + self.t_window_step * idx

If step_hrs=24h and len_hrs=6:00 then this is the correct behaviour:

index 1 : day1 00:00:00 - 06:00:00
index 2: day2 00:00:00 - 06:00:00

Copy link
Contributor Author

@ankitpatnala ankitpatnala Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index1 + 1 forecast_step shouldn't it give us : day1 06:00:00 - 12:00:00 ?
there is no index mapping to this time_window then.

step_forecast_dt = idx + (self.forecast_delta_dt * fstep) // self.step_timedelta
time_win_target = self.time_window_handler.window(step_forecast_dt)

output_data = []
for fstep in range(self.forecast_offset, self.forecast_offset + forecast_dt + 1):
step_forecast_dt = base_idx + (self.forecast_delta_dt * fstep) // self.step_timedelta
rdata = collect_datasources(stream_ds, step_forecast_dt, "target")

Forecast data is also obtained by index.

Copy link
Contributor Author

@ankitpatnala ankitpatnala Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def collect_datasources(stream_datasets: list, idx: int, type: str) -> IOReaderData:
"""
Utility function to collect all sources / targets from streams list
"""
rdatas = []
for ds in stream_datasets:
if type == "source":
get_reader_data = ds.get_source
normalize_channels = ds.normalize_source_channels
elif type == "target":
get_reader_data = ds.get_target
normalize_channels = ds.normalize_target_channels
else:
assert False, "invalid value for argument `type`"
# get source (of potentially multi-step length)
rdata = get_reader_data(idx).remove_nan_coords()
rdata.data = normalize_channels(rdata.data)
rdata.geoinfos = ds.normalize_geoinfos(rdata.geoinfos)
rdatas += [rdata]

the index from this is then called to ds.get_target which does not find forecast time_window

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Forecast steps time index error when step_hr exceeds len_hr length

2 participants