Skip to content

[User question] speed of a random access query #3990

@sjfleming

Description

@sjfleming

I was interested in testing out TileDB for some in-house single-cell data I've been working with. I created a TileDB datastore in a google cloud bucket using tiledbsoma.io.from_anndata (uploading lots of h5ad files after running tiledbsoma.io.register_anndatas.

I then tried querying data from this TileDB datastore backed by a google cloud bucket (3 million cells in there).

A query like this (10k cells)

    logger.info("starting quick check")
    inds = range(10000)
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")

ran in 30 seconds, and I was thrilled!

But as soon as I tried to query 10k random cell indices, I ran into a long delay:

    logger.info("starting quick shuffled check")
    inds = np.arange(3_000_000)
    inds_shuffled = np.random.permutation(inds)
    inds = [i for i in inds_shuffled[:10000]]
    with tiledbsoma.Experiment.open(tiledb_bucket_path) as exp:
        with exp.axis_query(
            measurement_name,
            obs_query=tiledbsoma.AxisQuery(coords=(inds,)),
        ) as query:
            adata = query.to_anndata(
                X_name=x_layer_name,
                column_names={"obs": ["soma_joinid"], "var": ["soma_joinid"]},
            )
            logger.info("quick check done")

The above took an hour to run.

Questions:

  1. Am I doing something wrong / suboptimal above?
  2. Is this kind of much longer query time for random access expected? Is it just part of TileDB, where a truly random query forces TileDB to open a ton of tiles, and so it's just gonna take a really long time?

Thanks!!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions