Skip to content

RWC_popular update to 2.0#699

Open
yujin-kimmm wants to merge 7 commits intomir-dataset-loaders:masterfrom
yujin-kimmm:rwc
Open

RWC_popular update to 2.0#699
yujin-kimmm wants to merge 7 commits intomir-dataset-loaders:masterfrom
yujin-kimmm:rwc

Conversation

@yujin-kimmm
Copy link
Collaborator

@yujin-kimmm yujin-kimmm commented Feb 24, 2026

This PR updates the RWC-Popular dataset loader for 2.0 version. (rereleased 2026)

  • Index for 2.0 version is submitted to Zenodo. Pending approval.

Major Update

  • Audio is open to public, and available to download with the loader.
  • Annotation: Beats, Chords annotations are now using the new annotations from 2.0 RWC 2.0 Annotations. Therefore, load_beats function which was originally sharing with rwc_classical is now in rwc_popular. (Future updates with other RWC collection can share load_beats function from rwc_popular).
  • Metadata: using new metadata from 2.0.
  • F0 annotations added.

Dataset loaders checklist:

  • Create a script in scripts/, e.g. make_my_dataset_index.py, which generates an index file.
  • Run the script on the canonical version of the dataset and upload the index to Zenodo Audio Data Loaders community.
  • Create a sample version of the index with the necessary information for testing.
  • Create a module in mirdata, e.g. mirdata/my_dataset.py
  • Create tests for your loader in tests/datasets/, e.g. test_my_dataset.py
  • Add your module to docs/source/mirdata.rst and docs/source/table.rst
  • Run black, flake8 and mypy (see Running your tests locally).
  • Run tests/test_full_dataset.py on your dataset.
  • Check that codecov coverage does not decrease.

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.22%. Comparing base (b95bf38) to head (e8b5165).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #699      +/-   ##
==========================================
+ Coverage   97.13%   97.22%   +0.08%     
==========================================
  Files          71       71              
  Lines        7825     7856      +31     
==========================================
+ Hits         7601     7638      +37     
+ Misses        224      218       -6     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yujin-kimmm yujin-kimmm changed the title [WIP]RWC_popular update to 2.0 RWC_popular update to 2.0 Feb 24, 2026
@yujin-kimmm
Copy link
Collaborator Author

yujin-kimmm commented Feb 24, 2026

@stefan-balke Hi Stefan! This is the PR for the new 2.0 RWC database loader. Right now it's only rwc-popular, and we are still working on it. This PR is still Working In Progress at the moment, but feel free to start taking a look if you want. Thanks!

(cc'd @magdalenafuentes )

@stefan-balke
Copy link

Thanks for this, I will look into it!

Copy link

@stefan-balke stefan-balke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to look into it. Maybe the main questions:

  • You replaced now rwc_popular with RWC 2.0.
    Doesn't this break the interfaces for all people using the original data or was this not used at all?
  • Does it make sense to call ist rwc2_popular to make the distinctions?
    Would it make sense to stick with the abbreviations we used in the paper? RWC-P, RWC-C, ...?
  • The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?
  • When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Testscript

import mirdata

rwc_popular = mirdata.initialize('rwc_popular')
rwc_popular.download()  # download the dataset
rwc_popular.validate()  # validate that all the expected files are there

example_track = rwc_popular.choice_track()  # choose a random example track
print(example_track)  # see the available data

Output:

╰─ python test_rwc.py                                                                                            (mirdata)
/Users/stefan/micromamba/envs/mirdata/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version!
  warnings.warn(
WARNING: Downloading ['audio', 'annotation', 'annotation_archive', 'index']. Index is being stored in /Users/stefan/dev/mirdata/mirdata/datasets/indexes, and the rest of files in /Users/stefan/mir_datasets/rwc_popular
WARNING: [audio] downloading RWC-P.zip.zip
3.79GB [08:08, 8.33MB/s]
WARNING: [annotation] downloading rwc-annotations-main.zip
5.59MB [00:01, 5.03MB/s]
WARNING: [annotation_archive] downloading rwc-annotations-archive-main.zip
24.2MB [00:05, 5.06MB/s]
WARNING: [index] downloading rwc_popular_index_2.0.json
0.00B [00:00, ?B/s]ERROR:
                            mirdata failed to download the dataset from https://zenodo.org/records/18751784/files/rwc_popular_index_2.0.json?download=1!
                            Please try again in a few minutes.
                            If this error persists, please raise an issue at
                            https://github.com/mir-dataset-loaders/mirdata,
                            and tag it with 'broken-link'.

0.00B [00:00, ?B/s]
Traceback (most recent call last):
  File "/Users/stefan/dev/mirdata/test_rwc.py", line 4, in <module>
    rwc_popular.download()  # download the dataset
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/dev/mirdata/mirdata/core.py", line 454, in download
    download_utils.downloader(
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 156, in downloader
    download_path = download_from_remote(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 279, in download_from_remote
    raise exc
  File "/Users/stefan/dev/mirdata/mirdata/download_utils.py", line 264, in download_from_remote
    urllib.request.urlretrieve(
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 240, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/Users/stefan/micromamba/envs/mirdata/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

I guess the index is not yet released? Any ways of circumventing this?


sections_rel = os.path.join(
"rwc-annotations-archive-main",
"AIST_RWC-MDB-P-2001_CHORUS",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add annotations from the archive. Let use first switch to the preprocessed.


voca_inst_rel = os.path.join(
"rwc-annotations-archive-main",
"AIST_RWC-MDB-P-2001_VOCA_INST",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with the archive

voca_inst_idx, voca_inst_md5 = None, None

melody_rel = os.path.join(
"rwc-annotations-archive-main",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

year={2002},
series={ISMIR},
note={Cite this if using audio, beat or section annotations},
BIBTEX = """@inproceedings{GotoHNO02_RWC_ISMIR,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the paper describing RWC2.0

@Article{BalkeZAMTGM26_RWCRevisited_TISMIR,
author = {Stefan Balke and Johannes Zeitler and Vlora Arifi-Müller and Brian McFee and Tomoyasu Nakano and Masataka Goto and Meinard M{"u}ller},
title = {{RWC} Revisited: {T}owards a Community-Driven {MIR} Corpus},
journal = {Transactions of the International Society for Music Information Retrieval},
volume = {9},
issue = {1},
pages = {21--35},
year = {2026},
doi = {10.5334/tismir.326}
}

modern Japanese popular music typical of songs on the Japanese hit charts in
the 1990s.

For more details, please visit: https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This website will be offline soon. Better link to the Zenodo repository

@magdalenafuentes
Copy link
Collaborator

Hey @stefan-balke, thanks for taking a look, some thoughts:

Started to look into it. Maybe the main questions:

  • You replaced now rwc_popular with RWC 2.0.
    Doesn't this break the interfaces for all people using the original data or was this not used at all?
  1. You're right that this is a breaking change. Although Mirdata does have a way of supporting dataset versions (1.0, 2.0) within the same loader (i.e. downloading/loading from different sources/indexes), for this particular loader, we have a strong suspicion that it is not being used because the data wasn't available. So at the risk of making few users unhappy, but to have cleaner code, we decided to replace the loader with the current, open version of RWC. As a note, folks could still use the past loader if they pin the mirdata version before this next release. If in the future, there are requests of using the old, proprietary version of the data (again I'm skeptical), we can bring it back later.
  • Does it make sense to call ist rwc2_popular to make the distinctions?

It will be clear it is version 2.0 in the index version and the docs.

Would it make sense to stick with the abbreviations we used in the paper? RWC-P, RWC-C, ...?

For the loder name you mean? e.g. initialize('rwc-p')?

  • The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?

This one is an open question. One loader per collection is how this is done now, and common functions are pulled from rwc classical. We could go this way, or adapt a single loader to have the subcollections as "versions" of the dataset, so you could load each of them separately still, and maybe have a "full" version with all collections. I don't think we have a loader like this but it can be done. I'm fine either case, but I do think that the collections should be accessible independently for simplicity (e.g. users that only want to work with one of them). This "unified" loader could be called simply rwc.py, and you would specify the version as initialize('rwc', version='rwc-p') or something like that. We can discuss further if you think this is better.

  • When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

I guess the index is not yet released? Any ways of circumventing this?

It should be available now.

I have a question regarding annotations: RWC-P has a lot of annotations available that we pulled from here and there right now (e.g. voice activity), but I saw that you have less annotations included in RWC2. Do you plan to include them in the future?

@yujin-kimmm
Copy link
Collaborator Author

  • When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

@magdalenafuentes I think this is about pointing the specific version/commit for annotation repository from index. Right now the rwc-popular index points to the annotation repository's master branch, which might occur checksum issue if the commit happens in the master branch. I remember we had a similar issue for ESC50 dataset from Soundata, which was addressed in this PR 209 from soundata. I can update with pointing specific commit hash. @stefan-balke is this correct?

@stefan-balke
Copy link

  • The plan is to have a loader for each subcollection? The will be a lot of shared code since the annotations should all be in the same format across datasets. How will this be handled? Start with RWC-P and import those common functions to RWC-C?

This one is an open question. One loader per collection is how this is done now, and common functions are pulled from rwc classical. We could go this way, or adapt a single loader to have the subcollections as "versions" of the dataset, so you could load each of them separately still, and maybe have a "full" version with all collections. I don't think we have a loader like this but it can be done. I'm fine either case, but I do think that the collections should be accessible independently for simplicity (e.g. users that only want to work with one of them). This "unified" loader could be called simply rwc.py, and you would specify the version as initialize('rwc', version='rwc-p') or something like that. We can discuss further if you think this is better.

I think it would be better to have separate datasets rwc_p, rwc_c, rwc_g, rwc_r.
Those are all separate classes/modules but share a common backbone.
Could be done via inheritance or alike.
When you start allowing string-based parameters, it usually makes things more flexible but also harder to test on the long run.

  • When creating the index, are you using a specific tag/version from the annotation repository or always the most recent one?

Not sure I understand this. The index points to a particular version of the dataset, and you could have more than one version.

As Yujin said, I mean specific versions of the annotations. Maybe that is also a new concept that the community might update the annotations which will lead to a new release on the annotation end.

I guess the index is not yet released? Any ways of circumventing this?

It should be available now.

Will give it a shot!

I have a question regarding annotations: RWC-P has a lot of annotations available that we pulled from here and there right now (e.g. voice activity), but I saw that you have less annotations included in RWC2. Do you plan to include them in the future?

All annotations I am aware of are here:
https://github.com/rwc-music/rwc-annotations-archive

If you have more, let me know. This is kind of the landing zone for the annotations. Some have really strange formats.
Once converted, they land in https://github.com/rwc-music/rwc-annotations, where we have data integration tests in place to test the annotations further. That is work in progress and ideally also taken over by the community at some point :-)

Thanks for all this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants