Skip to content

Host datasets on scverse S3 #4005

@Zethson

Description

@Zethson

What kind of feature would you like to request?

Other?

Please describe your wishes

Hey,

the pertpy CI fails every once in a while because a test tries to load the pbmc3k dataset. I don't know whether this is a concurrency issue or something else is at fault. Nevertheless, I haven't run into any issue with my datasets that are on our S3. To consolidate where we host our datasets and to reduce the noise, I'd be happy if we could host them on our S3.

Example issue: scverse/pertpy#937

Error example:

   @pytest.fixture
    def add_nhood_expression_mdata(milo):
>       adata = sc.datasets.pbmc3k()
                ^^^^^^^^^^^^^^^^^^^^

tests/tools/test_milo.py:274: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/datasets/_utils.py:16: in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/datasets/_datasets.py:410: in pbmc3k
    adata = read(settings.datasetdir / "pbmc3k_raw.h5ad", backup_url=url)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/readwrite.py:150: in read
    return _read(
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/readwrite.py:832: in _read
    is_present = _check_datafile_present_and_download(filename, backup_url=backup_url)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/readwrite.py:1135: in _check_datafile_present_and_download
    _download(backup_url, path)
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/scanpy/readwrite.py:1092: in _download
    open_url = urlopen(req, context=create_default_context(cafile=where()))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:215: in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:515: in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:532: in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:492: in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:1392: in https_open
                h.request(req.get_method(), req.selector, req.data, headers,
                          encode_chunked=req.has_header('Transfer-encoding'))
            except OSError as err: # timeout error
>               raise URLError(err)
E               urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/urllib/request.py:1347: URLError

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions