Refine/improve collect_training_data by cbur24 · Pull Request #1474 · GeoscienceAustralia/dea-notebooks

cbur24 · 2026-03-03T03:34:57Z

Proposed changes

The collect_training_data function has been refactored to improve error handling, fix several bugs, and enhance the documentation. Additionally, the multiprocessing has switched from using mp.Pool.apply_async to mp.Pool.imap to ensure results are returned in the order they are input, as well as to increase the overall robustness of the multiprocessing.

The function now returns a pandas.DataFrame instead of the previous NumPy array plus separate list of column names. This introduces a breaking change, so the ML notebooks have been updated accordingly. Minor improvements were also made to those notebooks during this process.

Another breaking change is the time_delta parameter has been removed. To provide custom time-ranges for each row in the input, the user now needs to provide a column in the input gdf that contains time(ranges). These can be in any format that datacube.load accepts. Overall, this makes the function more flexible.

Its now also possible to return time series from the feature_func, where previously it was enforced that only 2D (x,y) datasets could be returned. To faciliate understanding what has been returned, a new paramters return_time_coords has been introduced where the resulting dataframe will contain a new column with time-stamps for each sample.

I've also added a "how-to-guide" notebook to the repo to make more visible the fuctionality of collect_training_data, which is quite flexible and helpful for any task that requires loading many small but sparse data loads, not only training/validation data collection.

A small test suite has been added under Tests/test_classification.py

If reviewing, I suggest focusing on the classification.py, test_classification.py, and the Collect_training_and_validation_data.ipynb files, changes to other files are minor.

Checklist

If this is a notebook, then have you:

…ng, and performance

review-notebook-app · 2026-03-03T03:35:02Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

GL-S

Reviewed and provided feedback in the past few days while the improvements were added.
The function is now better performing and allows for more flexibility regarding what information to include in the output.
The notebook on how to use the updated function is very detailed and easy to follow at the same time, very useful in getting familiar with collect_training_data

cbur24 · 2026-03-11T23:38:36Z

Seems like the testing is failing for a reason unrelated to these changes:

ImportError: cannot import name 'configure_rio' from 'odc.loader' (unknown location).

For the stable image which uses datacube v1.8, configure rio is located in datacube.utils.aws.configure_s3_access

… issues in tests

robbibt

Hey @cbur24, this is an incredible piece of work! The updated function is fantastic, and should be so much easier to use.

I have pushed a few changes to clear up the tests by ignoring several problematic notebooks (they work fine on Sandbox, but return 403 errors in the tests). I think most of the odc-loader issues should already be fixed externally.

Only two suggested changes to the new notebook:

There's a few cells that need Black etc formatting
We've been trying to remove use of rio_slurp_xarray as it requires datacube. In this case you're already using datacube so it's not a major issue, but it would be nice to replace it with load_reproject if you can: it also exists to load small sections of a larger COG, but is fully contained within dea-tools. See example here: https://github.com/GeoscienceAustralia/dea-notebooks/blob/refine_collect_td/How_to_guides/Downloading_data_with_STAC.ipynb and here: https://knowledge.dea.ga.gov.au/notebooks/Tools/gen/dea_tools.datahandling/#dea_tools.datahandling.load_reproject

cbur24 · 2026-03-26T23:28:45Z

Many thanks for reviewing, @robbibt! On the use of load_reproject, I've been finding it uses about 40 % more memory than rio_slurp_xarray, which is okay for these demonstrator notebooks, but when doing heavier work with lots of CPUs and larger sized geotiffs (high-res, float), it can become a bottleneck for scaling. However, I can see the argument for not adding more datacube dependencies into the repo. Would changing the code to using load_reproject but leaving a markdown comment that rio_slurp_xarray may be preferable in some situations be acceptable?

robbibt · 2026-03-26T23:49:19Z

@cbur24 Ah, I didn't know that - in that case, it's probably fine either way (maybe a note explaining that there are different options would be good).

I am offline today but I realise I broke the tests yesterday with my change... seems like a simple formatting thing. Will have a look next week if it's still an issue!

cbur24 added 5 commits March 2, 2026 06:15

refactor collect_training_data for increased robustness, error handli…

6edc06c

…ng, and performance

begin altering notebooks

053c9ae

notebooks working

1d21e40

fix return docstring

76f3427

fix up wo nb

f64e42f

cbur24 changed the title ~~Refine collect td~~ Refine/improve collect_training_data Mar 3, 2026

cbur24 added 17 commits March 4, 2026 01:12

switch to pool.imap and remove shared lists

7182095

typos

d9737f8

refactor to use pandas throughout

f5b32c3

fix up code comments

036885c

rm unneeded lists

5e4afe4

refactor how time filed works and all user to return time-coords

840d994

fix typo

a213624

rm print

23300cd

draft a notebook explaining collect_training_data

a3dacae

typos

3e89d7f

typos

588d5c8

rm reference to shapefiles

94d6ab3

rm reference to models

2cb18cf

add tests for collect_traing_data

a16e72a

get test suite working

dfa8356

reference to new nb in ML nbs

1dfe3a5

rm chanes to sklearn_flatten

8a43c79

cbur24 marked this pull request as ready for review March 11, 2026 22:05

cbur24 requested review from BexDunn, Kooie-cate, erialC-P, geoscience-aman and robbibt as code owners March 11, 2026 22:05

cbur24 requested review from Ariana-B, GL-S, JM-GA, KimBaldry, LaurenSchenk1, arcisad, caitlinadams, colourmeamused, jennaguffogg, margaretharrison, supermarkion and vnewey as code owners March 11, 2026 22:05

GL-S approved these changes Mar 11, 2026

View reviewed changes

cbur24 and others added 5 commits March 17, 2026 04:18

revert back to apply_async to better handle database reads

f0067df

update notebook after function change

5901247

Update tests to temporarily remove working notebooks with data access…

c687c0c

… issues in tests

Formatting

ccd5815

Documentation update for xr_reproject

40dec95

robbibt requested changes Mar 26, 2026

View reviewed changes

cbur24 added 3 commits March 26, 2026 23:57

minor revisions to nb, rm rio_slurp_xarray dependency

009bc0d

make nb work on stable image

72e9cc6

reorder description markdown

6a5b43f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine/improve collect_training_data#1474

Refine/improve collect_training_data#1474
cbur24 wants to merge 30 commits intodevelopfrom
refine_collect_td

cbur24 commented Mar 3, 2026 •

edited

Loading

Uh oh!

review-notebook-app bot commented Mar 3, 2026

Uh oh!

GL-S left a comment

Uh oh!

cbur24 commented Mar 11, 2026 •

edited

Loading

Uh oh!

robbibt left a comment •

edited

Loading

Uh oh!

cbur24 commented Mar 26, 2026 •

edited

Loading

Uh oh!

robbibt commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cbur24 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

review-notebook-app bot commented Mar 3, 2026

Uh oh!

GL-S left a comment

Choose a reason for hiding this comment

Uh oh!

cbur24 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robbibt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbur24 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robbibt commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cbur24 commented Mar 3, 2026 •

edited

Loading

cbur24 commented Mar 11, 2026 •

edited

Loading

robbibt left a comment •

edited

Loading

cbur24 commented Mar 26, 2026 •

edited

Loading