Skip to content

feat: automatic data nodes creation based on dataset entries#968

Open
icedoom888 wants to merge 84 commits intomainfrom
feat/auto_data_graph_creation
Open

feat: automatic data nodes creation based on dataset entries#968
icedoom888 wants to merge 84 commits intomainfrom
feat/auto_data_graph_creation

Conversation

@icedoom888
Copy link
Copy Markdown
Contributor

@icedoom888 icedoom888 commented Mar 11, 2026

Description

This PR tackles automatric data nodes creation based on dataset created in the dataloader.

What problem does this change solve?

Reduces redundancy avoiding multiple specifications of the same datasets.
Realtes to #966

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.


📚 Documentation preview 📚: https://anemoi-training--968.org.readthedocs.build/en/968/


📚 Documentation preview 📚: https://anemoi-graphs--968.org.readthedocs.build/en/968/


📚 Documentation preview 📚: https://anemoi-models--968.org.readthedocs.build/en/968/

@icedoom888 icedoom888 added the ATS Approval Not Needed No approval needed by ATS label Mar 12, 2026
@github-actions github-actions bot added the bug Something isn't working label Mar 12, 2026
@icedoom888 icedoom888 added enhancement New feature or request config Affects the config files to training contributor dependencies and removed bug Something isn't working labels Mar 12, 2026
@github-actions github-actions bot added the bug Something isn't working label Mar 12, 2026
@mpvginde
Copy link
Copy Markdown
Contributor

Hi @icedoom888, one minor comment is that now (unless I looked over it) there no longer an example graph config that would allow you to do stand-alone graph building.

Copy link
Copy Markdown
Contributor

@MeraX MeraX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alberto, thank you for this contribution. I think it helps simplifying the config. At the same time it makes the discoverability of alternative node builders from the configs alone harder, but I think that should be acceptable as long as the note builders are kept in the documentation.

I've put a few comments regarding the maintainability and user-friendliness, that I', happy to discuss.

from pytorch_lightning.utilities.rank_zero import rank_zero_only
from torch_geometric.data import HeteroData

from anemoi.graphs.utils.config import integrate_data_nodes_in_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from anemoi.graphs.utils.config import integrate_data_nodes_in_config
from anemoi.models.utils.config import get_multiple_datasets_config

Is it possible to re-use the existent copy of this function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this was done to remove dependency of graphs from models!

Comment on lines +18 to +24
DEFAULT_NODE_ATTR = {
"area_weight": {
"_target_": "anemoi.graphs.nodes.attributes.SphericalAreaWeights",
"norm": "unit-max",
"fill_value": 0,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR also introduces yet another place where defaults are stored (DEFAULT_NODE_ATTR). Are these defaults actually necessary or would it be possible to set these defaults simply as the default python argument values of anemoi.graphs.nodes.AnemoiDatasetNodes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree, these global variables seem unecessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time being "area_weight" is a necessary attribute for any data nodes, this isn't enforced in the code, and test will fail if not provided.
I could try and find another workaround by enforcing creation somewhere else, but not sure where. Suggestions?

return OmegaConf.create({default_dataset_name: config})


def integrate_data_nodes_in_config(config: DictConfig):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it's handled elsewhere in the code, but I would suggest noting here that config is actually modified. Thus I would also recommend considering returning nothing in this function.

graph_filename = None

# Introduce Data Nodes in graph config
self.config = integrate_data_nodes_in_config(self.config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you manipulating the global config object here? Which version of the config is logged to MLflow?

Did you consider adding a flag to the configuration that explicitly enables the automatic generation of AnemoiDatasetNodes nodes? I think this would help catching config errors where the the user specification of custom note builders is dismissed due to a typo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point @MeraX, will look into this!

@JPXKQX
Copy link
Copy Markdown
Member

JPXKQX commented Mar 18, 2026

Hi Alberto, thanks for this pull request. I think this work is in line with the direction we agreed with anemoi-core for the coming months, but I'm not sure if we should wait for the other pieces discussed in the discussion.

I think removing the graph dependency from the tasks should be done first. Once this has been done, we can remove the new default node attributes, as these will be moved to a standalone scaler. At that point, I would expect the code to use the latlons from the dataset object directly. The current change attempts to solve the problem of updating your graph configuration when you have a new dataset, but I think this would still be problematic because you would need to specify the edges.

I think getting the coordinates from the dataset is what we want, but I would wait a little while longer to do this directly from the dataset object rather than populating the graph config

@dietervdb-meteo
Copy link
Copy Markdown
Contributor

To be very honest I find this idea a bit strange... Doesn't this reverse the logic? Why not base the dataloader config on that of the graph??
We can build graphs on their own, not dataloaders (so far). The graph part of the training config now becomes completely useless as a config to just build a graph with. It looks quite strange, in edge builders things like 'era5' and 'cerra' show up but there is no trace of them anymore in the nodes part.

@anaprietonem
Copy link
Copy Markdown
Contributor

Are there any updates from this @icedoom888? Based on the comments is this feature something we still need or consider merging? To flag that there is some overlap with the work Simon started in #855 and it would be good to get that one in first, as it's already in advanced review state.

@icedoom888
Copy link
Copy Markdown
Contributor Author

Hey @anaprietonem, sorry I got sidetracked previous weeks and didn't have time to look into this.
Reading the previous discussions, i'm not completely sure we all agree on a solution for the upcoming changes.

  • This PR as is: introduces an in-place config change that grabs data nodes information from the declared datasets.
  • Mario's suggestion would be to deprecate completely anemoi.graphs.nodes.AnemoiDatasetNodes and generating datasets from the dataloader's lat-lon coordinates.

At this point i'm not sure how to proceed.
#855 seems to be quite unrelated but somehow acting on graph configs too?

It's quite hard to divide, parallelize work and try to help out if the roadmap is not clear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval Not Needed No approval needed by ATS bug Something isn't working config Affects the config files to training contributor dependencies enhancement New feature or request graphs models training

Projects

Status: To be triaged

Development

Successfully merging this pull request may close these issues.

7 participants