Skip to content

Conversation

@TillHae
Copy link
Contributor

@TillHae TillHae commented Jan 20, 2026

Description

This PR changes the SYNOP datareader to also work on german SYNOP station data instead of just the one from MetNorway. The following bullet points are an explanation of all my changes.

  • added height dimension, which affected the base class as well
  • German data does not need to handle missing values as they are already included as np.nan (e.g. ll. 164-166)
  • German data is structured after time and station_id. Geoinfo has to be flattened to 1-D before usage (ll.105-113)
  • German data dimensions needed to be transformed as they are (variables, spatial, time) instead of (variables, time, spatial) (ll. 231-238)
  • I've added a file for testing, which will be removed after the review (The test only works on juwels, as the underlying data is only available there)

Issue Number

This PR closes no issue, but is related to #862

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@TillHae
Copy link
Contributor Author

TillHae commented Jan 20, 2026

@wael-mika

@TillHae
Copy link
Contributor Author

TillHae commented Jan 22, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

@clessig
Copy link
Collaborator

clessig commented Jan 22, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

Yes, this would be great! Let me know if you have questions about this.

@TillHae
Copy link
Contributor Author

TillHae commented Jan 22, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

Yes, this would be great! Let me know if you have questions about this.

Yes, I don't know what exactly is meant with the caching of the mean and standard deviation. It looks like they are already stored in the object. Do you also want them to be stored in the dataset file for the next time the dataset is read?

@clessig
Copy link
Collaborator

clessig commented Jan 22, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

Yes, this would be great! Let me know if you have questions about this.

Yes, I don't know what exactly is meant with the caching of the mean and standard deviation. It looks like they are already stored in the object. Do you also want them to be stored in the dataset file for the next time the dataset is read?

They are computed on the fly if they are not present in the data file. That's expensive and not an option if the dataset if bigger. We should also not modify the existing data file; it should be considered immutable. We could generate an auxiliary file with the data, but then we need to decide where to store it.

@TillHae
Copy link
Contributor Author

TillHae commented Jan 23, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

Yes, this would be great! Let me know if you have questions about this.

Yes, I don't know what exactly is meant with the caching of the mean and standard deviation. It looks like they are already stored in the object. Do you also want them to be stored in the dataset file for the next time the dataset is read?

They are computed on the fly if they are not present in the data file. That's expensive and not an option if the dataset if bigger. We should also not modify the existing data file; it should be considered immutable. We could generate an auxiliary file with the data, but then we need to decide where to store it.

It looks like they are always computed on the fly even when they are present in the data file. I think the best way to handle it is to have just one file where we store a hashmap like structure with the datasets as the key and their corresponding properties stored as values. This is the simplest solution to manage and enables us to seamlessly append datasets and properties to the file.

@clessig
Copy link
Collaborator

clessig commented Jan 25, 2026

@clessig should I also complete the open TODOs that are mentioned through comments as part of this PR?

Yes, this would be great! Let me know if you have questions about this.

Yes, I don't know what exactly is meant with the caching of the mean and standard deviation. It looks like they are already stored in the object. Do you also want them to be stored in the dataset file for the next time the dataset is read?

They are computed on the fly if they are not present in the data file. That's expensive and not an option if the dataset if bigger. We should also not modify the existing data file; it should be considered immutable. We could generate an auxiliary file with the data, but then we need to decide where to store it.

It looks like they are always computed on the fly even when they are present in the data file. I think the best way to handle it is to have just one file where we store a hashmap like structure with the datasets as the key and their corresponding properties stored as values. This is the simplest solution to manage and enables us to seamlessly append datasets and properties to the file.

Having one file is appealing at first sight but it comes with a lot of question marks: most datasets already contain the information and we shouldn't duplicate it since we otherwise run the risk that it becomes inconsistent. Having one file for all other datasets would mean we need to copy it to all other HPCs whenever something for one dataset changes. This also seems sub-optimal.

My suggestion is to have a stream config argument that specifies which field in the data file contains mean and stdev. If it's not specified or does not exist, then we fall back to computing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants