feat: automatic data nodes creation based on dataset entries#968
feat: automatic data nodes creation based on dataset entries#968icedoom888 wants to merge 84 commits intomainfrom
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…oi-core into feat/auto_data_graph_creation
for more information, see https://pre-commit.ci
…oi-core into feat/auto_data_graph_creation
|
Hi @icedoom888, one minor comment is that now (unless I looked over it) there no longer an example graph config that would allow you to do stand-alone graph building. |
MeraX
left a comment
There was a problem hiding this comment.
Hi Alberto, thank you for this contribution. I think it helps simplifying the config. At the same time it makes the discoverability of alternative node builders from the configs alone harder, but I think that should be acceptable as long as the note builders are kept in the documentation.
I've put a few comments regarding the maintainability and user-friendliness, that I', happy to discuss.
| from pytorch_lightning.utilities.rank_zero import rank_zero_only | ||
| from torch_geometric.data import HeteroData | ||
|
|
||
| from anemoi.graphs.utils.config import integrate_data_nodes_in_config |
There was a problem hiding this comment.
| from anemoi.graphs.utils.config import integrate_data_nodes_in_config | |
| from anemoi.models.utils.config import get_multiple_datasets_config |
Is it possible to re-use the existent copy of this function?
There was a problem hiding this comment.
Ideally this was done to remove dependency of graphs from models!
| DEFAULT_NODE_ATTR = { | ||
| "area_weight": { | ||
| "_target_": "anemoi.graphs.nodes.attributes.SphericalAreaWeights", | ||
| "norm": "unit-max", | ||
| "fill_value": 0, | ||
| } | ||
| } |
There was a problem hiding this comment.
This PR also introduces yet another place where defaults are stored (DEFAULT_NODE_ATTR). Are these defaults actually necessary or would it be possible to set these defaults simply as the default python argument values of anemoi.graphs.nodes.AnemoiDatasetNodes?
There was a problem hiding this comment.
I would agree, these global variables seem unecessary
There was a problem hiding this comment.
At the time being "area_weight" is a necessary attribute for any data nodes, this isn't enforced in the code, and test will fail if not provided.
I could try and find another workaround by enforcing creation somewhere else, but not sure where. Suggestions?
| return OmegaConf.create({default_dataset_name: config}) | ||
|
|
||
|
|
||
| def integrate_data_nodes_in_config(config: DictConfig): |
There was a problem hiding this comment.
I'm not sure how it's handled elsewhere in the code, but I would suggest noting here that config is actually modified. Thus I would also recommend considering returning nothing in this function.
| graph_filename = None | ||
|
|
||
| # Introduce Data Nodes in graph config | ||
| self.config = integrate_data_nodes_in_config(self.config) |
There was a problem hiding this comment.
Are you manipulating the global config object here? Which version of the config is logged to MLflow?
Did you consider adding a flag to the configuration that explicitly enables the automatic generation of AnemoiDatasetNodes nodes? I think this would help catching config errors where the the user specification of custom note builders is dismissed due to a typo.
There was a problem hiding this comment.
Very good point @MeraX, will look into this!
|
Hi Alberto, thanks for this pull request. I think this work is in line with the direction we agreed with anemoi-core for the coming months, but I'm not sure if we should wait for the other pieces discussed in the discussion. I think removing the graph dependency from the tasks should be done first. Once this has been done, we can remove the new default node attributes, as these will be moved to a standalone scaler. At that point, I would expect the code to use the latlons from the dataset object directly. The current change attempts to solve the problem of updating your graph configuration when you have a new dataset, but I think this would still be problematic because you would need to specify the edges. I think getting the coordinates from the dataset is what we want, but I would wait a little while longer to do this directly from the dataset object rather than populating the graph config |
|
To be very honest I find this idea a bit strange... Doesn't this reverse the logic? Why not base the dataloader config on that of the graph?? |
|
Are there any updates from this @icedoom888? Based on the comments is this feature something we still need or consider merging? To flag that there is some overlap with the work Simon started in #855 and it would be good to get that one in first, as it's already in advanced review state. |
|
Hey @anaprietonem, sorry I got sidetracked previous weeks and didn't have time to look into this.
At this point i'm not sure how to proceed. It's quite hard to divide, parallelize work and try to help out if the roadmap is not clear! |
Description
This PR tackles automatric data nodes creation based on dataset created in the dataloader.
What problem does this change solve?
Reduces redundancy avoiding multiple specifications of the same datasets.
Realtes to #966
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.
📚 Documentation preview 📚: https://anemoi-training--968.org.readthedocs.build/en/968/
📚 Documentation preview 📚: https://anemoi-graphs--968.org.readthedocs.build/en/968/
📚 Documentation preview 📚: https://anemoi-models--968.org.readthedocs.build/en/968/