RWC_popular update to 2.0 by yujin-kimmm · Pull Request #699 · mir-dataset-loaders/mirdata

yujin-kimmm · 2026-02-24T15:16:49Z

This PR updates the RWC-Popular dataset loader for 2.0 version. (rereleased 2026)

Index for 2.0 version is submitted to Zenodo. Pending approval.

Major Update

Audio is open to public, and available to download with the loader.
Annotation: Beats, Chords annotations are now using the new annotations from 2.0 RWC 2.0 Annotations. Therefore, load_beats function which was originally sharing with rwc_classical is now in rwc_popular. (Future updates with other RWC collection can share load_beats function from rwc_popular).
Metadata: using new metadata from 2.0.
F0 annotations added.

Dataset loaders checklist:

Create a script in scripts/, e.g. make_my_dataset_index.py, which generates an index file.
Run the script on the canonical version of the dataset and upload the index to Zenodo Audio Data Loaders community.
Create a sample version of the index with the necessary information for testing.
Create a module in mirdata, e.g. mirdata/my_dataset.py
Create tests for your loader in tests/datasets/, e.g. test_my_dataset.py
Add your module to docs/source/mirdata.rst and docs/source/table.rst
Run black, flake8 and mypy (see Running your tests locally).
Run tests/test_full_dataset.py on your dataset.
Check that codecov coverage does not decrease.

codecov · 2026-02-24T15:42:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.22%. Comparing base (b95bf38) to head (e8b5165).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #699      +/-   ##
==========================================
+ Coverage   97.13%   97.22%   +0.08%     
==========================================
  Files          71       71              
  Lines        7825     7856      +31     
==========================================
+ Hits         7601     7638      +37     
+ Misses        224      218       -6

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yujin-kimmm · 2026-02-24T17:06:08Z

@stefan-balke Hi Stefan! This is the PR for the new 2.0 RWC database loader. Right now it's only rwc-popular, and we are still working on it. This PR is still Working In Progress at the moment, but feel free to start taking a look if you want. Thanks!

(cc'd @magdalenafuentes )

stefan-balke · 2026-02-24T21:26:07Z

Thanks for this, I will look into it!

stefan-balke

Started to look into it. Maybe the main questions:

You replaced now rwc_popular with RWC 2.0.
Doesn't this break the interfaces for all people using the original data or was this not used at all?
Does it make sense to call ist rwc2_popular to make the distinctions?
Would it make sense to stick with the abbreviations we used in the paper? RWC-P, RWC-C, ...?
The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?
When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Testscript

import mirdata

rwc_popular = mirdata.initialize('rwc_popular')
rwc_popular.download()  # download the dataset
rwc_popular.validate()  # validate that all the expected files are there

example_track = rwc_popular.choice_track()  # choose a random example track
print(example_track)  # see the available data

Output:

╰─ python test_rwc.py                                                                                            (mirdata)
/Users/stefan/micromamba/envs/mirdata/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
  warnings.warn(
WARNING: Downloading ['audio', 'annotation', 'annotation_archive', 'index']. Index is being stored in /Users/stefan/dev/mirdata/mirdata/datasets/indexes, and the rest of files in /Users/stefan/mir_datasets/rwc_popular
WARNING: [audio] downloading RWC-P.zip.zip
3.79GB [08:08, 8.33MB/s]
WARNING: [annotation] downloading rwc-annotations-main.zip
5.59MB [00:01, 5.03MB/s]
WARNING: [annotation_archive] downloading rwc-annotations-archive-main.zip
24.2MB [00:05, 5.06MB/s]
WARNING: [index] downloading rwc_popular_index_2.0.json
0.00B [00:00, ?B/s]ERROR:
                            mirdata failed to download the dataset from https://zenodo.org/records/18751784/files/rwc_popular_index_2.0.json?download=1!
                            Please try again in a few minutes.
                            If this error persists, please raise an issue at
                            https://github.com/mir-dataset-loaders/mirdata,
                            and tag it with 'broken-link'.

0.00B [00:00, ?B/s]
Traceback (most recent call last):
  File "/Users/stefan/dev/mirdata/test_rwc.py", line 4, in <module>
    rwc_popular.download()  # download the dataset
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/dev/mirdata/mirdata/core.py", line 454, in download
    download_utils.downloader(
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 156, in downloader
    download_path = download_from_remote(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 279, in download_from_remote
    raise exc
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 264, in download_from_remote
    urllib.request.urlretrieve(
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

I guess the index is not yet released? Any ways of circumventing this?

stefan-balke · 2026-02-26T07:57:28Z

scripts/make_rwc_popular_index.py

+
+        sections_rel = os.path.join(
+            "rwc-annotations-archive-main",
+            "AIST_RWC-MDB-P-2001_CHORUS",


I would not add annotations from the archive. Let use first switch to the preprocessed.

stefan-balke · 2026-02-26T07:57:43Z

scripts/make_rwc_popular_index.py

+
+        voca_inst_rel = os.path.join(
+            "rwc-annotations-archive-main",
+            "AIST_RWC-MDB-P-2001_VOCA_INST",


same here with the archive

stefan-balke · 2026-02-26T07:57:50Z

scripts/make_rwc_popular_index.py

+            voca_inst_idx, voca_inst_md5 = None, None
+
+        melody_rel = os.path.join(
+            "rwc-annotations-archive-main",


stefan-balke · 2026-02-26T07:59:14Z

mirdata/datasets/rwc_popular.py

-  year={2002},
-  series={ISMIR},
-  note={Cite this if using audio, beat or section annotations},
+BIBTEX = """@inproceedings{GotoHNO02_RWC_ISMIR,


This would be the paper describing RWC2.0

@Article{BalkeZAMTGM26_RWCRevisited_TISMIR,
author = {Stefan Balke and Johannes Zeitler and Vlora Arifi-Müller and Brian McFee and Tomoyasu Nakano and Masataka Goto and Meinard M{"u}ller},
title = {{RWC} Revisited: {T}owards a Community-Driven {MIR} Corpus},
journal = {Transactions of the International Society for Music Information Retrieval},
volume = {9},
issue = {1},
pages = {21--35},
year = {2026},
doi = {10.5334/tismir.326}
}

stefan-balke · 2026-02-26T08:16:05Z

mirdata/datasets/rwc_popular.py

    modern Japanese popular music typical of songs on the Japanese hit charts in
    the 1990s.

    For more details, please visit: https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html


This website will be offline soon. Better link to the Zenodo repository

magdalenafuentes · 2026-03-03T17:14:04Z

Hey @stefan-balke, thanks for taking a look, some thoughts:

Started to look into it. Maybe the main questions:

You replaced now rwc_popular with RWC 2.0.
Doesn't this break the interfaces for all people using the original data or was this not used at all?

You're right that this is a breaking change. Although Mirdata does have a way of supporting dataset versions (1.0, 2.0) within the same loader (i.e. downloading/loading from different sources/indexes), for this particular loader, we have a strong suspicion that it is not being used because the data wasn't available. So at the risk of making few users unhappy, but to have cleaner code, we decided to replace the loader with the current, open version of RWC. As a note, folks could still use the past loader if they pin the mirdata version before this next release. If in the future, there are requests of using the old, proprietary version of the data (again I'm skeptical), we can bring it back later.

Does it make sense to call ist rwc2_popular to make the distinctions?

It will be clear it is version 2.0 in the index version and the docs.

Would it make sense to stick with the abbreviations we used in the paper? RWC-P, RWC-C, ...?

For the loder name you mean? e.g. initialize('rwc-p')?

The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?

This one is an open question. One loader per collection is how this is done now, and common functions are pulled from rwc classical. We could go this way, or adapt a single loader to have the subcollections as "versions" of the dataset, so you could load each of them separately still, and maybe have a "full" version with all collections. I don't think we have a loader like this but it can be done. I'm fine either case, but I do think that the collections should be accessible independently for simplicity (e.g. users that only want to work with one of them). This "unified" loader could be called simply rwc.py, and you would specify the version as initialize('rwc', version='rwc-p') or something like that. We can discuss further if you think this is better.

When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

I guess the index is not yet released? Any ways of circumventing this?

It should be available now.

I have a question regarding annotations: RWC-P has a lot of annotations available that we pulled from here and there right now (e.g. voice activity), but I saw that you have less annotations included in RWC2. Do you plan to include them in the future?

yujin-kimmm · 2026-03-03T20:29:15Z

When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

@magdalenafuentes I think this is about pointing the specific version/commit for annotation repository from index. Right now the rwc-popular index points to the annotation repository's master branch, which might occur checksum issue if the commit happens in the master branch. I remember we had a similar issue for ESC50 dataset from Soundata, which was addressed in this PR 209 from soundata. I can update with pointing specific commit hash. @stefan-balke is this correct?

stefan-balke · 2026-03-04T07:26:17Z

The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?

This one is an open question. One loader per collection is how this is done now, and common functions are pulled from rwc classical. We could go this way, or adapt a single loader to have the subcollections as "versions" of the dataset, so you could load each of them separately still, and maybe have a "full" version with all collections. I don't think we have a loader like this but it can be done. I'm fine either case, but I do think that the collections should be accessible independently for simplicity (e.g. users that only want to work with one of them). This "unified" loader could be called simply rwc.py, and you would specify the version as initialize('rwc', version='rwc-p') or something like that. We can discuss further if you think this is better.

I think it would be better to have separate datasets rwc_p, rwc_c, rwc_g, rwc_r.
Those are all separate classes/modules but share a common backbone.
Could be done via inheritance or alike.
When you start allowing string-based parameters, it usually makes things more flexible but also harder to test on the long run.

When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

As Yujin said, I mean specific versions of the annotations. Maybe that is also a new concept that the community might update the annotations which will lead to a new release on the annotation end.

I guess the index is not yet released? Any ways of circumventing this?

It should be available now.

Will give it a shot!

I have a question regarding annotations: RWC-P has a lot of annotations available that we pulled from here and there right now (e.g. voice activity), but I saw that you have less annotations included in RWC2. Do you plan to include them in the future?

All annotations I am aware of are here:
https://github.com/rwc-music/rwc-annotations-archive

If you have more, let me know. This is kind of the landing zone for the annotations. Some have really strange formats.
Once converted, they land in https://github.com/rwc-music/rwc-annotations, where we have data integration tests in place to test the annotations further. That is work in progress and ideally also taken over by the community at some point :-)

Thanks for all this!

yujin-kimmm added 5 commits February 23, 2026 21:52

updating rwc_popular to version 2.0

f0b21eb

fixing error in index

da185c7

removing preview for index

61f50f3

fixing test folder path

0883e80

fixing test track id name

42e1403

adding test covering else in metadata

90fae9d

yujin-kimmm changed the title ~~[WIP]RWC_popular update to 2.0~~ RWC_popular update to 2.0 Feb 24, 2026

adding bibitex for vocal-inst activity, section annotation

e8b5165

stefan-balke reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RWC_popular update to 2.0#699

RWC_popular update to 2.0#699
yujin-kimmm wants to merge 7 commits intomir-dataset-loaders:masterfrom
yujin-kimmm:rwc

yujin-kimmm commented Feb 24, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

yujin-kimmm commented Feb 24, 2026 •

edited

Loading

Uh oh!

stefan-balke commented Feb 24, 2026

Uh oh!

stefan-balke left a comment •

edited

Loading

Uh oh!

stefan-balke Feb 26, 2026

Uh oh!

stefan-balke Feb 26, 2026

Uh oh!

stefan-balke Feb 26, 2026

Uh oh!

stefan-balke Feb 26, 2026

Uh oh!

stefan-balke Feb 26, 2026

Uh oh!

magdalenafuentes commented Mar 3, 2026

Uh oh!

yujin-kimmm commented Mar 3, 2026

Uh oh!

stefan-balke commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yujin-kimmm commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Update

Dataset loaders checklist:

Uh oh!

codecov bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yujin-kimmm commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan-balke commented Feb 24, 2026

Uh oh!

stefan-balke left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Testscript

Uh oh!

stefan-balke Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

stefan-balke Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

stefan-balke Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

stefan-balke Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

stefan-balke Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

magdalenafuentes commented Mar 3, 2026

Uh oh!

yujin-kimmm commented Mar 3, 2026

Uh oh!

stefan-balke commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yujin-kimmm commented Feb 24, 2026 •

edited

Loading

codecov bot commented Feb 24, 2026 •

edited

Loading

yujin-kimmm commented Feb 24, 2026 •

edited

Loading

stefan-balke left a comment •

edited

Loading