Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
GL-S
left a comment
There was a problem hiding this comment.
Reviewed and provided feedback in the past few days while the improvements were added.
The function is now better performing and allows for more flexibility regarding what information to include in the output.
The notebook on how to use the updated function is very detailed and easy to follow at the same time, very useful in getting familiar with collect_training_data
|
Seems like the testing is failing for a reason unrelated to these changes:
For the |
There was a problem hiding this comment.
Hey @cbur24, this is an incredible piece of work! The updated function is fantastic, and should be so much easier to use.
I have pushed a few changes to clear up the tests by ignoring several problematic notebooks (they work fine on Sandbox, but return 403 errors in the tests). I think most of the odc-loader issues should already be fixed externally.
Only two suggested changes to the new notebook:
- There's a few cells that need Black etc formatting
- We've been trying to remove use of
rio_slurp_xarrayas it requiresdatacube. In this case you're already using datacube so it's not a major issue, but it would be nice to replace it withload_reprojectif you can: it also exists to load small sections of a larger COG, but is fully contained withindea-tools. See example here: https://github.com/GeoscienceAustralia/dea-notebooks/blob/refine_collect_td/How_to_guides/Downloading_data_with_STAC.ipynb and here: https://knowledge.dea.ga.gov.au/notebooks/Tools/gen/dea_tools.datahandling/#dea_tools.datahandling.load_reproject
|
Many thanks for reviewing, @robbibt! On the use of |
|
@cbur24 Ah, I didn't know that - in that case, it's probably fine either way (maybe a note explaining that there are different options would be good). I am offline today but I realise I broke the tests yesterday with my change... seems like a simple formatting thing. Will have a look next week if it's still an issue! |
Proposed changes
The
collect_training_datafunction has been refactored to improve error handling, fix several bugs, and enhance the documentation. Additionally, the multiprocessing has switched from usingmp.Pool.apply_asynctomp.Pool.imapto ensure results are returned in the order they are input, as well as to increase the overall robustness of the multiprocessing.The function now returns a
pandas.DataFrameinstead of the previous NumPy array plus separate list of column names. This introduces a breaking change, so the ML notebooks have been updated accordingly. Minor improvements were also made to those notebooks during this process.Another breaking change is the
time_deltaparameter has been removed. To provide custom time-ranges for each row in the input, the user now needs to provide a column in the inputgdfthat contains time(ranges). These can be in any format thatdatacube.loadaccepts. Overall, this makes the function more flexible.Its now also possible to return time series from the
feature_func, where previously it was enforced that only 2D (x,y) datasets could be returned. To faciliate understanding what has been returned, a new paramtersreturn_time_coordshas been introduced where the resulting dataframe will contain a new column with time-stamps for each sample.I've also added a "how-to-guide" notebook to the repo to make more visible the fuctionality of
collect_training_data, which is quite flexible and helpful for any task that requires loading many small but sparse data loads, not only training/validation data collection.A small test suite has been added under
Tests/test_classification.pyIf reviewing, I suggest focusing on the
classification.py,test_classification.py, and theCollect_training_and_validation_data.ipynbfiles, changes to other files are minor.Checklist
If this is a notebook, then have you:
Load packagesGeneral advice)jupyterlab_code_formattertool can be used to format code cells to a consistent style: select each code cell, then clickEditand then one of theApply X Formatteroptions (YAPForBlackare recommended).Notebook currently compatible withline below the notebook title to reflect the environments the notebook is compatible with