Skip to content

[FEA] Dask Array Support for rsc.pp.scrublet: A Straightforward implementation #388

@MPebworthEpana

Description

@MPebworthEpana

Right now, rsc.pp.scrublet doesn't support Dask arrays, and there's a relatively straightforward path to implement one (at least, from what I know).

Background:

  1. Scrublet only really needs to run within a sample, or batch. This is provided to the function as a 'batch_key'
  2. These samples/batches are typically on the order of < 100k cells for batches, or < 10,000 for samples, meaning that they can fit within a typical GPU's memory.

Implementation concept:

  1. Check the the anndata object has a Dask array. If so, require a batch_key be provided.
  2. Rechunk the dask array by batch_key - one dask array for each batch_key
  3. Run scrublet in memory on each GPU (.compute_chunk_sizes())
  4. Save results in obs as normal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions