-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi everyone,
I noticed something that I think is worth mentioning as I don't see it in the doc, although I am not sure this should lead to a modification of the library.
It is very important to specify the input_ds_chunks parameter when calling load_xorca_dataset, because otherwise it can trigger an unreasonably large amount of tasks when reading the data -- lots of them are open_dataset and rechunking operations -- which critically increases the computation time. My call is that it is best to specify the actual chunks that are in the netCDF file (not sure how this turns for contiguous storage), but specifying the same as target_ds_chunks might be an option, depending on whether reading operation or rechunking/transferring data between workers dominates the computation time.
I attach a PDF of my notebook where I tested this on a small subdomain of a larger simulation, on my laptop. It shows that computing a mean that takes only
1 s when specifying the input_ds_chunks goes up to 1 minute if input_ds_chunks is left blank.