-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
Datasets library is widely used by many Python packages. Naturally, it is a requirement on many platforms. This includes vLLM for ROCm. During audio dataset tests, there is an exception triggered:
def decode_example(
self, value: dict, token_per_repo_id: Optional[dict[str, Union[str, bool, None]]] = None
) -> "AudioDecoder":
"""Decode example audio file into audio data.
Args:
value (`dict`):
A dictionary with keys:
- `path`: String with relative audio file path.
- `bytes`: Bytes of the audio file.
token_per_repo_id (`dict`, *optional*):
To access and decode
audio files from private repositories on the Hub, you can pass
a dictionary repo_id (`str`) -> token (`bool` or `str`)
Returns:
`torchcodec.decoders.AudioDecoder`
"""
if config.TORCHCODEC_AVAILABLE:
from ._torchcodec import AudioDecoder
else:
> raise ImportError("To support decoding audio data, please install 'torchcodec'.")
E ImportError: To support decoding audio data, please install 'torchcodec'.At the same time, torchcodec cannot be installed on ROCm, because Its GPU acceleration uses NVIDIA's NVDEC (hardware decoder), which is NVIDIA-specific. Therefore, code paths that call this block trigger errors on ROCm. Can you add an alternative package there as fallback instead of an ImportError?
Steps to reproduce the bug
On a machine with MI300/MI325/MI355:
pytest -s -v tests/entrypoints/openai/correctness/test_transcription_api_correctness.py::test_wer_correctness[12.74498-D4nt3/esb-datasets-earnings22-validation-tiny-filtered-openai/whisper-large-v3]Expected behavior
_________________________________________________ test_wer_correctness[12.74498-D4nt3/esb-datasets-earnings22-validation-tiny-filtered-openai/whisper-large-v3] ________________________________________[383/535$
model_name = 'openai/whisper-large-v3', dataset_repo = 'D4nt3/esb-datasets-earnings22-validation-tiny-filtered', expected_wer = 12.74498, n_examples = -1, max_concurrent_request = None
@pytest.mark.parametrize("model_name", ["openai/whisper-large-v3"])
# Original dataset is 20GB+ in size, hence we use a pre-filtered slice.
@pytest.mark.parametrize(
"dataset_repo", ["D4nt3/esb-datasets-earnings22-validation-tiny-filtered"]
)
# NOTE: Expected WER measured with equivalent hf.transformers args:
# whisper-large-v3 + esb-datasets-earnings22-validation-tiny-filtered.
@pytest.mark.parametrize("expected_wer", [12.744980])
def test_wer_correctness(
model_name, dataset_repo, expected_wer, n_examples=-1, max_concurrent_request=None
):
# TODO refactor to use `ASRDataset`
with RemoteOpenAIServer(model_name, ["--enforce-eager"]) as remote_server:
> dataset = load_hf_dataset(dataset_repo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/entrypoints/openai/correctness/test_transcription_api_correctness.py:160:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/entrypoints/openai/correctness/test_transcription_api_correctness.py:111: in load_hf_dataset
if "duration_ms" not in dataset[0]:
^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py:2876: in __getitem__
return self._getitem(key)
^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py:2858: in _getitem
formatted_output = format_table(
/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py:658: in format_table
return formatter(pa_table, query_type=query_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py:411: in __call__
return self.format_row(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py:460: in format_row
row = self.python_features_decoder.decode_row(row)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/formatting/formatting.py:224: in decode_row
return self.features.decode_example(row, token_per_repo_id=self.token_per_repo_id) if self.features else row
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/features/features.py:2111: in decode_example
column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/local/lib/python3.12/dist-packages/datasets/features/features.py:1419: in decode_nested_example
return schema.decode_example(obj, token_per_repo_id=token_per_repo_id) if obj is not None else None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Environment info
datasetsversion: 4.4.2- Platform: Linux-5.15.0-161-generic-x86_64-with-glibc2.35
- Python version: 3.12.12
huggingface_hubversion: 0.36.0- PyArrow version: 22.0.0
- Pandas version: 2.3.3
fsspecversion: 2025.10.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels