Issue: I am trying to create a dataset from a Kaggle kernel. My workflow involves converting an existing input dataset into TFRecord format, shard by shard. Because of limited disk space in the kernel, my plan was to create and upload the dataset incrementally for each shard, deleting the local shard after upload to free space.
The problem is that uploading the next shard creates a new version of the dataset each time. This behavior is inconvenient for workflows that require incremental uploads of large datasets.
Currently, in this code, if the dataset already exists, dataset_upload() automatically creates a new version for each upload. Ideally, there could be a way to append files to the same dataset version before publishing, so that large datasets can be uploaded incrementally without generating multiple versions.
Issue: I am trying to create a dataset from a Kaggle kernel. My workflow involves converting an existing input dataset into TFRecord format, shard by shard. Because of limited disk space in the kernel, my plan was to create and upload the dataset incrementally for each shard, deleting the local shard after upload to free space.
The problem is that uploading the next shard creates a new version of the dataset each time. This behavior is inconvenient for workflows that require incremental uploads of large datasets.
Currently, in this code, if the dataset already exists, dataset_upload() automatically creates a new version for each upload. Ideally, there could be a way to append files to the same dataset version before publishing, so that large datasets can be uploaded incrementally without generating multiple versions.