RWC_popular update to 2.0#699
RWC_popular update to 2.0#699yujin-kimmm wants to merge 7 commits intomir-dataset-loaders:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #699 +/- ##
==========================================
+ Coverage 97.13% 97.22% +0.08%
==========================================
Files 71 71
Lines 7825 7856 +31
==========================================
+ Hits 7601 7638 +37
+ Misses 224 218 -6 🚀 New features to boost your workflow:
|
|
@stefan-balke Hi Stefan! This is the PR for the new 2.0 RWC database loader. Right now it's only (cc'd @magdalenafuentes ) |
|
Thanks for this, I will look into it! |
There was a problem hiding this comment.
Started to look into it. Maybe the main questions:
- You replaced now
rwc_popularwith RWC 2.0.
Doesn't this break the interfaces for all people using the original data or was this not used at all? - Does it make sense to call ist rwc2_popular to make the distinctions?
Would it make sense to stick with the abbreviations we used in the paper? RWC-P, RWC-C, ...? - The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?
- When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?
Testscript
import mirdata
rwc_popular = mirdata.initialize('rwc_popular')
rwc_popular.download() # download the dataset
rwc_popular.validate() # validate that all the expected files are there
example_track = rwc_popular.choice_track() # choose a random example track
print(example_track) # see the available data
Output:
╰─ python test_rwc.py (mirdata)
/Users/stefan/micromamba/envs/mirdata/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
warnings.warn(
WARNING: Downloading ['audio', 'annotation', 'annotation_archive', 'index']. Index is being stored in /Users/stefan/dev/mirdata/mirdata/datasets/indexes, and the rest of files in /Users/stefan/mir_datasets/rwc_popular
WARNING: [audio] downloading RWC-P.zip.zip
3.79GB [08:08, 8.33MB/s]
WARNING: [annotation] downloading rwc-annotations-main.zip
5.59MB [00:01, 5.03MB/s]
WARNING: [annotation_archive] downloading rwc-annotations-archive-main.zip
24.2MB [00:05, 5.06MB/s]
WARNING: [index] downloading rwc_popular_index_2.0.json
0.00B [00:00, ?B/s]ERROR:
mirdata failed to download the dataset from https://zenodo.org/records/18751784/files/rwc_popular_index_2.0.json?download=1!
Please try again in a few minutes.
If this error persists, please raise an issue at
https://github.com/mir-dataset-loaders/mirdata,
and tag it with 'broken-link'.
0.00B [00:00, ?B/s]
Traceback (most recent call last):
File "/Users/stefan/dev/mirdata/test_rwc.py", line 4, in <module>
rwc_popular.download() # download the dataset
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/stefan/dev/mirdata/mirdata/core.py", line 454, in download
download_utils.downloader(
File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 156, in downloader
download_path = download_from_remote(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 279, in download_from_remote
raise exc
File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 264, in download_from_remote
urllib.request.urlretrieve(
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 240, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 521, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 630, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 559, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 639, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND
I guess the index is not yet released? Any ways of circumventing this?
|
|
||
| sections_rel = os.path.join( | ||
| "rwc-annotations-archive-main", | ||
| "AIST_RWC-MDB-P-2001_CHORUS", |
There was a problem hiding this comment.
I would not add annotations from the archive. Let use first switch to the preprocessed.
|
|
||
| voca_inst_rel = os.path.join( | ||
| "rwc-annotations-archive-main", | ||
| "AIST_RWC-MDB-P-2001_VOCA_INST", |
| voca_inst_idx, voca_inst_md5 = None, None | ||
|
|
||
| melody_rel = os.path.join( | ||
| "rwc-annotations-archive-main", |
| year={2002}, | ||
| series={ISMIR}, | ||
| note={Cite this if using audio, beat or section annotations}, | ||
| BIBTEX = """@inproceedings{GotoHNO02_RWC_ISMIR, |
There was a problem hiding this comment.
This would be the paper describing RWC2.0
@Article{BalkeZAMTGM26_RWCRevisited_TISMIR,
author = {Stefan Balke and Johannes Zeitler and Vlora Arifi-Müller and Brian McFee and Tomoyasu Nakano and Masataka Goto and Meinard M{"u}ller},
title = {{RWC} Revisited: {T}owards a Community-Driven {MIR} Corpus},
journal = {Transactions of the International Society for Music Information Retrieval},
volume = {9},
issue = {1},
pages = {21--35},
year = {2026},
doi = {10.5334/tismir.326}
}
| modern Japanese popular music typical of songs on the Japanese hit charts in | ||
| the 1990s. | ||
|
|
||
| For more details, please visit: https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html |
There was a problem hiding this comment.
This website will be offline soon. Better link to the Zenodo repository
|
Hey @stefan-balke, thanks for taking a look, some thoughts:
It will be clear it is version 2.0 in the index version and the docs.
For the loder name you mean? e.g. initialize('rwc-p')?
This one is an open question. One loader per collection is how this is done now, and common functions are pulled from rwc classical. We could go this way, or adapt a single loader to have the subcollections as "versions" of the dataset, so you could load each of them separately still, and maybe have a "full" version with all collections. I don't think we have a loader like this but it can be done. I'm fine either case, but I do think that the collections should be accessible independently for simplicity (e.g. users that only want to work with one of them). This "unified" loader could be called simply
Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.
It should be available now. I have a question regarding annotations: RWC-P has a lot of annotations available that we pulled from here and there right now (e.g. voice activity), but I saw that you have less annotations included in RWC2. Do you plan to include them in the future? |
@magdalenafuentes I think this is about pointing the specific version/commit for annotation repository from index. Right now the rwc-popular index points to the annotation repository's master branch, which might occur checksum issue if the commit happens in the master branch. I remember we had a similar issue for |
I think it would be better to have separate datasets
As Yujin said, I mean specific versions of the annotations. Maybe that is also a new concept that the community might update the annotations which will lead to a new release on the annotation end.
Will give it a shot!
All annotations I am aware of are here: If you have more, let me know. This is kind of the landing zone for the annotations. Some have really strange formats. Thanks for all this! |
This PR updates the RWC-Popular dataset loader for 2.0 version. (rereleased 2026)
Major Update
load_beatsfunction which was originally sharing withrwc_classicalis now inrwc_popular. (Future updates with other RWC collection can shareload_beatsfunction fromrwc_popular).Dataset loaders checklist:
scripts/, e.g.make_my_dataset_index.py, which generates an index file.mirdata/my_dataset.pytests/datasets/, e.g.test_my_dataset.pydocs/source/mirdata.rstanddocs/source/table.rstblack,flake8andmypy(see Running your tests locally).tests/test_full_dataset.pyon your dataset.