Skip to content

critical impact of specifying ds_chunks #45

@NoeLahaye

Description

@NoeLahaye

Hi everyone,

I noticed something that I think is worth mentioning as I don't see it in the doc, although I am not sure this should lead to a modification of the library.

It is very important to specify the input_ds_chunks parameter when calling load_xorca_dataset, because otherwise it can trigger an unreasonably large amount of tasks when reading the data -- lots of them are open_dataset and rechunking operations -- which critically increases the computation time. My call is that it is best to specify the actual chunks that are in the netCDF file (not sure how this turns for contiguous storage), but specifying the same as target_ds_chunks might be an option, depending on whether reading operation or rechunking/transferring data between workers dominates the computation time.

I attach a PDF of my notebook where I tested this on a small subdomain of a larger simulation, on my laptop. It shows that computing a mean that takes only
1 s when specifying the input_ds_chunks goes up to 1 minute if input_ds_chunks is left blank.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions