Skip to content

[Feature request]: Download specific folders from datasets #285

@innat

Description

@innat

Description

Currently, kagglehub supports downloading:

  • An entire dataset/competition, or
  • A single file via the path argument

However, it does not support downloading specific folders within a dataset. This becomes a major limitation when working with large public datasets or competition data where only a subset of folders is needed.

Example dataset structure

dataset/
├── train_images/
├── train_mask/
└── additional_dataset/

In many real-world scenarios:

  • The full dataset is very large
  • Users may only need train_images + train_mask
  • Or only additional_dataset

At the moment, the only option is to download the entire dataset, which is inefficient and sometimes impractical.

Current behavior

Single file download is supported:

filename = "dataset/img.png"
path = kagglehub.dataset_download(
    dataset_id,
    path=filename
)

Folder-level downloads are not supported:

# Not supported (but highly desired)
folder = "dataset/additional_dataset"
path = kagglehub.dataset_download(
    dataset_id,
    path=folder
)
# Not supported (but highly desired)
folders = ["dataset/train_images", "dataset/train_mask"]
path = kagglehub.dataset_download(
    dataset_id,
    path=folders
)

Proposed enhancement

Extend the path argument to support:

  1. Single folder download
  2. Multiple folder downloads

Suggested API behavior:

# Single folder
path = kagglehub.dataset_download(
    dataset_id,
    path="dataset/additional_dataset"
)

# Multiple folders
path = kagglehub.dataset_download(
    dataset_id,
    path=["dataset/train_images", "dataset/train_mask"]
)

Expected behavior:

  • Only the specified folder(s) are downloaded
  • Directory structure is preserved
  • Works for both datasets and competition data

Why this matters

  • Large Kaggle datasets can be hundreds of GBs
  • Partial downloads save:
    • Bandwidth
    • Disk space
    • Time
  • This feature is crucial for:
    • Model prototyping
    • Educational and research workflows

Additional notes

  • Kaggle’s backend already supports file-level access
  • Folder-level support could be implemented as:
    • Recursive file resolution under a prefix
    • Or client-side filtering after metadata listing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions