Description
Currently, kagglehub supports downloading:
- An entire dataset/competition, or
- A single file via the
path argument
However, it does not support downloading specific folders within a dataset. This becomes a major limitation when working with large public datasets or competition data where only a subset of folders is needed.
Example dataset structure
dataset/
├── train_images/
├── train_mask/
└── additional_dataset/
In many real-world scenarios:
- The full dataset is very large
- Users may only need train_images + train_mask
- Or only additional_dataset
At the moment, the only option is to download the entire dataset, which is inefficient and sometimes impractical.
Current behavior
Single file download is supported:
filename = "dataset/img.png"
path = kagglehub.dataset_download(
dataset_id,
path=filename
)
Folder-level downloads are not supported:
# Not supported (but highly desired)
folder = "dataset/additional_dataset"
path = kagglehub.dataset_download(
dataset_id,
path=folder
)
# Not supported (but highly desired)
folders = ["dataset/train_images", "dataset/train_mask"]
path = kagglehub.dataset_download(
dataset_id,
path=folders
)
Proposed enhancement
Extend the path argument to support:
- Single folder download
- Multiple folder downloads
Suggested API behavior:
# Single folder
path = kagglehub.dataset_download(
dataset_id,
path="dataset/additional_dataset"
)
# Multiple folders
path = kagglehub.dataset_download(
dataset_id,
path=["dataset/train_images", "dataset/train_mask"]
)
Expected behavior:
- Only the specified folder(s) are downloaded
- Directory structure is preserved
- Works for both datasets and competition data
Why this matters
- Large Kaggle datasets can be hundreds of GBs
- Partial downloads save:
- Bandwidth
- Disk space
- Time
- This feature is crucial for:
- Model prototyping
- Educational and research workflows
Additional notes
- Kaggle’s backend already supports file-level access
- Folder-level support could be implemented as:
- Recursive file resolution under a prefix
- Or client-side filtering after metadata listing
Description
Currently,
kagglehubsupports downloading:pathargumentHowever, it does not support downloading specific folders within a dataset. This becomes a major limitation when working with large public datasets or competition data where only a subset of folders is needed.
Example dataset structure
In many real-world scenarios:
At the moment, the only option is to download the entire dataset, which is inefficient and sometimes impractical.
Current behavior
Single file download is supported:
Folder-level downloads are not supported:
Proposed enhancement
Extend the path argument to support:
Suggested API behavior:
Expected behavior:
Why this matters
Additional notes