Skip to content

Refine/improve collect_training_data#1474

Open
cbur24 wants to merge 30 commits intodevelopfrom
refine_collect_td
Open

Refine/improve collect_training_data#1474
cbur24 wants to merge 30 commits intodevelopfrom
refine_collect_td

Conversation

@cbur24
Copy link
Copy Markdown
Collaborator

@cbur24 cbur24 commented Mar 3, 2026

Proposed changes

The collect_training_data function has been refactored to improve error handling, fix several bugs, and enhance the documentation. Additionally, the multiprocessing has switched from using mp.Pool.apply_async to mp.Pool.imap to ensure results are returned in the order they are input, as well as to increase the overall robustness of the multiprocessing.

The function now returns a pandas.DataFrame instead of the previous NumPy array plus separate list of column names. This introduces a breaking change, so the ML notebooks have been updated accordingly. Minor improvements were also made to those notebooks during this process.

Another breaking change is the time_delta parameter has been removed. To provide custom time-ranges for each row in the input, the user now needs to provide a column in the input gdf that contains time(ranges). These can be in any format that datacube.load accepts. Overall, this makes the function more flexible.

Its now also possible to return time series from the feature_func, where previously it was enforced that only 2D (x,y) datasets could be returned. To faciliate understanding what has been returned, a new paramters return_time_coords has been introduced where the resulting dataframe will contain a new column with time-stamps for each sample.

I've also added a "how-to-guide" notebook to the repo to make more visible the fuctionality of collect_training_data, which is quite flexible and helpful for any task that requires loading many small but sparse data loads, not only training/validation data collection.

A small test suite has been added under Tests/test_classification.py

If reviewing, I suggest focusing on the classification.py, test_classification.py, and the Collect_training_and_validation_data.ipynb files, changes to other files are minor.

Checklist

If this is a notebook, then have you:

  • Checked the structure of the notebook follows our DEA-notebooks template
  • Removed any unused Python packages from Load packages
  • Removed any unused/empty code cells
  • Removed any guidance cells (e.g. General advice)
  • Ensured that all code cells follow the PEP8 standard for code. The jupyterlab_code_formatter tool can be used to format code cells to a consistent style: select each code cell, then click Edit and then one of the Apply X Formatter options (YAPF or Black are recommended).
  • Included relevant tags in the final notebook cell (refer to the DEA Tags Index, and re-use tags if possible)
  • Tested notebook on the DEA Sandbox
  • Cleared all outputs, run notebook from start to finish, and save the notebook in the state where all cells have been sequentially evaluated
  • If applicable, update the Notebook currently compatible with line below the notebook title to reflect the environments the notebook is compatible with
  • Check for any spelling mistakes using the DEA Sandbox's built-in spellchecker (double click on markdown cells then right-click on pink highlighted words). For example:

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cbur24 cbur24 changed the title Refine collect td Refine/improve collect_training_data Mar 3, 2026
@cbur24 cbur24 marked this pull request as ready for review March 11, 2026 22:05
Copy link
Copy Markdown
Collaborator

@GL-S GL-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed and provided feedback in the past few days while the improvements were added.
The function is now better performing and allows for more flexibility regarding what information to include in the output.
The notebook on how to use the updated function is very detailed and easy to follow at the same time, very useful in getting familiar with collect_training_data

@cbur24
Copy link
Copy Markdown
Collaborator Author

cbur24 commented Mar 11, 2026

Seems like the testing is failing for a reason unrelated to these changes:

ImportError: cannot import name 'configure_rio' from 'odc.loader' (unknown location).

For the stable image which uses datacube v1.8, configure rio is located in datacube.utils.aws.configure_s3_access

Copy link
Copy Markdown
Member

@robbibt robbibt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cbur24, this is an incredible piece of work! The updated function is fantastic, and should be so much easier to use.

I have pushed a few changes to clear up the tests by ignoring several problematic notebooks (they work fine on Sandbox, but return 403 errors in the tests). I think most of the odc-loader issues should already be fixed externally.

Only two suggested changes to the new notebook:

@cbur24
Copy link
Copy Markdown
Collaborator Author

cbur24 commented Mar 26, 2026

Many thanks for reviewing, @robbibt! On the use of load_reproject, I've been finding it uses about 40 % more memory than rio_slurp_xarray, which is okay for these demonstrator notebooks, but when doing heavier work with lots of CPUs and larger sized geotiffs (high-res, float), it can become a bottleneck for scaling. However, I can see the argument for not adding more datacube dependencies into the repo. Would changing the code to using load_reproject but leaving a markdown comment that rio_slurp_xarray may be preferable in some situations be acceptable?

@robbibt
Copy link
Copy Markdown
Member

robbibt commented Mar 26, 2026

@cbur24 Ah, I didn't know that - in that case, it's probably fine either way (maybe a note explaining that there are different options would be good).

I am offline today but I realise I broke the tests yesterday with my change... seems like a simple formatting thing. Will have a look next week if it's still an issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants