From 080791f7167631262e70897c3a23c904e89cf4c2 Mon Sep 17 00:00:00 2001 From: Youhan Lee Date: Tue, 2 Jun 2026 10:01:12 -0700 Subject: [PATCH 1/4] docs: add O2' post-processing guide and update contributing/acknowledgements - Add docs/post_processing.md describing how to strip and rebuild O2' atoms with Arena before all-atom LDDT evaluation - Replace CONTRIBUTING.md with standard NVIDIA fork/PR/DCO workflow - Acknowledge NVIDIA and Das Lab contributors in README Signed-off-by: Youhan Lee --- CONTRIBUTING.md | 89 +++++++++++-- README.md | 9 ++ docs/post_processing.md | 272 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 360 insertions(+), 10 deletions(-) create mode 100644 docs/post_processing.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 00ea36b..faace19 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,13 +1,82 @@ -# Contributing +# Contribution Rules -This repository is currently published as a read-only project. We do not accept external code or documentation contributions at this time. +## Issue Tracking -## Do -- Do open an Issue for reproducible bugs using the Bug Report template. -- Do include environment details, steps to reproduce, and minimal examples. -- Do cite this work in academic use. +* All requests for enhancements, bug fixes, or features must begin with the creation of an [issue](https://github.com/NVIDIA-BioNeMo/RNAPro/issues). + * The issue request will be reviewed by the NVIDIA team and approved prior to pull request integration and code review. -## Don't -- Don’t open Pull Requests — they will be auto-closed. -- Don’t file feature requests or support requests outside the Issue template scope. -- Don’t share proprietary data or confidential information in Issues. + +## Coding Guidelines + +- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved. + +- Try to keep pull requests (PRs) as concise as possible: + - Avoid committing commented-out code. + - Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes. + +- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit. + +- Thanks in advance for your patience as we review your contributions; we do appreciate them! + + +## Pull Requests +Developer workflow for code contributions is as follows: + +1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](https://github.com/NVIDIA-BioNeMo/RNAPro) RNAPro repository. + +2. Git clone the forked repository and push changes to the personal fork. + +```bash +git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git RNAPro +# Checkout the targeted branch and commit changes +# Push the commits to a branch on the fork (remote). +git push -u origin : +``` + +3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream. + * Exercise caution when selecting the source and target branches for the PR. + * Creation of a PR creation kicks off the code review process. + * While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP]. + + +## Signing Your Work +We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. +Any contribution which contains commits that are not Signed-Off will not be accepted. + +To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes: +```bash +$ git commit -s -m "Add cool feature." +``` + +This will append the following to your commit message: +``` +Signed-off-by: Your Name +``` + +#### Full text of the DCO + +``` +Developer Certificate of Origin +Version 1.1 + +Copyright (C) 2004, 2006 The Linux Foundation and its contributors. +1 Letterman Drive +Suite D4700 +San Francisco, CA, 94129 + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. +``` + +``` +Developer's Certificate of Origin 1.1 + +By making a contribution to this project, I certify that: + +(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or + +(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or + +(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. + +(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. +``` diff --git a/README.md b/README.md index 0a41ef3..2c0dad9 100644 --- a/README.md +++ b/README.md @@ -249,6 +249,15 @@ RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encode We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research. +### Contributors + +The first release of RNAPro was developed through a collaboration between NVIDIA and the Das Lab: + +- **NVIDIA:** Youhan Lee, Christian Munley, Theo Viel, Emine Kucukbenli +- **Das Lab (Stanford):** Rhiju Das, Chaitanya K. Joshi (Cambridge during development; now Stanford Das Lab) + +We also thank everyone at NVIDIA and the Das Lab who contributed to the development and release. + ## License diff --git a/docs/post_processing.md b/docs/post_processing.md new file mode 100644 index 0000000..d80b988 --- /dev/null +++ b/docs/post_processing.md @@ -0,0 +1,272 @@ +### Post-processing: Fixing O2' geometry + +RNAPro predictions can have incorrect O2′ placement. Run the step below before all-atom LDDT or any analysis that needs correct ribose atoms. + +The script removes O2′ and rebuilds it with [Arena](https://github.com/pylelab/Arena). Backbone and base atoms are not moved. + +--- + +### Prerequisites + +1. **Python dependencies** (in addition to RNAPro's environment): + + ```bash + pip install biopython + ``` + +2. **Arena** — required for O2' reconstruction. Install and compile: + + ```bash + git clone https://github.com/pylelab/Arena.git + cd Arena + make Arena + ``` + + The script searches for the `Arena` binary in common locations (`~/src/Arena/Arena`, `./Arena/Arena`, etc.). Ensure the compiled binary is on that path, or place/symlink it where the script can find it. + +--- + +### Script: `fix_o2prime.py` + +Save the following as `fix_o2prime.py` (or copy from your working directory): + +```python +#!/usr/bin/env python3 +""" +fix_o2prime.py — Strip O2' atoms from RNA structure files and rebuild them +using Arena. + +Bad O2' geometry (e.g. from early Protenix checkpoints) causes all-atom LDDT +to report 0.0; rebuilding from the remaining heavy atoms via Arena restores +meaningful LDDT while leaving backbone / base atoms untouched. + +Usage: + python3 fix_o2prime.py file1.cif file2.pdb ... + python3 fix_o2prime.py *.cif --outdir /tmp/fixed/ + +Output files are named {stem}.fixO2prime.{ext} (same format as input). + +Note: Arena requires RNA-only input. Use --strip-nonrna if the structure +contains protein or ligand chains. Those chains are removed before Arena and +are NOT restored in the output (the fixed file is suitable for LDDT scoring +but not for visualising the full complex). +""" + +import os +import sys +import tempfile +import argparse +import subprocess +from pathlib import Path + +from Bio.PDB import MMCIFParser, PDBParser, PDBIO, MMCIFIO + + +# Standard + common modified RNA residue names +RNA_RESIDUES = { + 'A', 'C', 'G', 'U', + 'PSU', '5MU', 'H2U', 'M2G', 'YG', 'OMG', 'OMC', 'OMU', + '2MG', '7MG', 'A2M', 'G7M', 'MA6', 'I', 'INO', +} + + +def look_for_arena(): + candidates = [ + os.path.expanduser("~/src/"), + os.path.expanduser("~/"), + "./", + "../", + "../../", + "../../../", + ] + + for candidate in candidates: + path = Path(candidate) / "Arena" / "Arena" + if os.path.exists(path): + return path + + cmds = [ + "git clone https://github.com/pylelab/Arena.git", + "cd Arena", + "make Arena", + ] + sys.exit("Arena not found, you can install it using:\n" + "\n".join(cmds)) + + +def strip_o2prime(structure): + removed = 0 + for model in structure: + for chain in model: + for residue in chain: + if "O2'" in residue: + residue.detach_child("O2'") + removed += 1 + return removed + + +def strip_nonrna(structure): + for model in structure: + empty_chains = [] + for chain in model: + non_rna = [r.id for r in chain if r.resname.strip() not in RNA_RESIDUES] + for rid in non_rna: + chain.detach_child(rid) + if not list(chain.get_residues()): + empty_chains.append(chain.id) + for cid in empty_chains: + model.detach_child(cid) + + +def load_structure(filepath: Path): + ext = filepath.suffix.lower() + if ext in ('.cif', '.mmcif'): + return MMCIFParser(QUIET=True).get_structure(filepath.stem, str(filepath)), True + return PDBParser(QUIET=True).get_structure(filepath.stem, str(filepath)), False + + +def save_structure(structure, outpath: Path, is_cif: bool): + if is_cif: + io = MMCIFIO() + else: + io = PDBIO() + io.set_structure(structure) + io.save(str(outpath)) + + +def run_arena(arena_path: Path, input_pdb: Path, output_pdb: Path, option: int = 5): + result = subprocess.run( + [arena_path, str(input_pdb), str(output_pdb), str(option)], + capture_output=True, text=True, + ) + if result.returncode != 0: + raise RuntimeError(f"Arena exited {result.returncode}: {result.stderr[:500]}") + if not output_pdb.exists(): + raise RuntimeError("Arena ran but produced no output file") + + +def process_file( + arena_path: Path, + filepath: Path, + outdir: Path | None, + do_strip_nonrna: bool, + arena_option: int, + verbose: bool = False, +) -> Path: + structure, is_cif = load_structure(filepath) + + if do_strip_nonrna: + strip_nonrna(structure) + + n_removed = strip_o2prime(structure) + if verbose: + print(f" {filepath.name}: stripped {n_removed} O2' atoms") + + with tempfile.TemporaryDirectory() as tmpdir: + tmp_in = Path(tmpdir) / "stripped.pdb" + tmp_out = Path(tmpdir) / "arena_out.pdb" + + # Write RNA-only PDB for Arena + pdbio = PDBIO() + pdbio.set_structure(structure) + pdbio.save(str(tmp_in)) + + run_arena(arena_path=arena_path, input_pdb=tmp_in, output_pdb=tmp_out, option=arena_option) + + fixed = PDBParser(QUIET=True).get_structure("fixed", str(tmp_out)) + + # Count how many O2' Arena added back + n_added = sum(1 for m in fixed for ch in m for r in ch if "O2'" in r) + if verbose: + print(f" {filepath.name}: Arena rebuilt {n_added} O2' atoms") + + out_name = filepath.stem + ".fixO2prime" + filepath.suffix + out_path = (outdir / out_name) if outdir else filepath.parent / out_name + save_structure(fixed, out_path, is_cif) + if verbose: + print(f" → {out_path}") + return out_path + + +def main(): + ap = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + ap.add_argument("files", nargs="+", help="CIF or PDB input files") + ap.add_argument("--outdir", default=None, + help="Output directory (default: same directory as each input file)") + ap.add_argument("--strip-nonrna", action="store_true", + help="Remove non-RNA chains before Arena (needed for RNA+protein targets)") + ap.add_argument("--arena-option", type=int, default=5, choices=[1, 2, 3, 4, 5, 7], + help="Arena reconstruction option (default 5)") + ap.add_argument("--verbose", action="store_true", help="Print verbose output") + args = ap.parse_args() + + arena_path = look_for_arena() + if args.verbose: + print(f"Found Arena in {arena_path}") + + outdir = Path(args.outdir) if args.outdir else None + if outdir: + outdir.mkdir(parents=True, exist_ok=True) + + if args.verbose: + print(f"Processing {len(args.files)} files in {outdir}") + + errors = 0 + for f in args.files: + try: + process_file(arena_path, Path(f).resolve(), outdir, args.strip_nonrna, args.arena_option, args.verbose) + except Exception as e: + print(f" ERROR {f}: {e}", file=sys.stderr) + errors += 1 + + if errors: + sys.exit(f"{errors} file(s) failed") + + +if __name__ == "__main__": + main() +``` + +--- + +### Usage + +After RNAPro inference, run the script on the output structure(s): + +```bash +# Single CIF from inference +python3 fix_o2prime.py path/to/prediction.cif --verbose + +# Multiple files, custom output directory +python3 fix_o2prime.py predictions/*.cif --outdir ./fixed/ --verbose + +# RNA + protein complex (strip non-RNA for Arena; output is RNA-only) +python3 fix_o2prime.py complex.pdb --strip-nonrna --verbose +``` + +**Output naming:** `{original_stem}.fixO2prime.{ext}` (same format as input, e.g. `.cif` or `.pdb`). + +**Recommended workflow:** + +1. Run RNAPro inference → obtain `.cif` / `.pdb` predictions. +2. Run `fix_o2prime.py` on those files (**required** if you evaluate all-atom LDDT or care about ribose geometry). +3. Use the `.fixO2prime.*` files for LDDT and other all-atom metrics. + +--- + +### Options + +| Flag | Description | +|------|-------------| +| `--outdir DIR` | Write fixed files to `DIR` instead of next to each input | +| `--strip-nonrna` | Remove protein/ligand chains before Arena (output will not contain them) | +| `--arena-option N` | Arena mode (default `5`; see [Arena README](https://github.com/pylelab/Arena)) | +| `--verbose` | Print strip/rebuild counts per file | + +--- + +### References + +- **Arena:** Zion R Perry, Anna Marie Pyle, Chengxin Zhang (2023). *Arena: rapid and accurate reconstruction of full atomic RNA structures from coarse-grained models.* Journal of Molecular Biology. [GitHub](https://github.com/pylelab/Arena) · [Zenodo](https://doi.org/10.5281/zenodo.7566670) From 4cffff6b8585e6d6e0dd48198f6b923dae4af2e5 Mon Sep 17 00:00:00 2001 From: Youhan Lee Date: Tue, 2 Jun 2026 10:05:56 -0700 Subject: [PATCH 2/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2c0dad9..02484e0 100644 --- a/README.md +++ b/README.md @@ -253,7 +253,7 @@ We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D The first release of RNAPro was developed through a collaboration between NVIDIA and the Das Lab: -- **NVIDIA:** Youhan Lee, Christian Munley, Theo Viel, Emine Kucukbenli +- **NVIDIA:** Youhan Lee, Christian Munley, Theo Viel, Emine Küçükbenli - **Das Lab (Stanford):** Rhiju Das, Chaitanya K. Joshi (Cambridge during development; now Stanford Das Lab) We also thank everyone at NVIDIA and the Das Lab who contributed to the development and release. From 70bda1e64216fd88c3f0a98b82e2a82fdb19b8e6 Mon Sep 17 00:00:00 2001 From: Youhan Lee Date: Tue, 2 Jun 2026 11:59:10 -0700 Subject: [PATCH 3/4] docs: source Arena from Zenodo archive instead of GitHub The pylelab/Arena GitHub repo has no license file. The Zenodo archive (doi.org/10.5281/zenodo.7566670) ships the same source under CC BY 4.0, so point users there for a properly licensed download. Signed-off-by: Youhan Lee --- docs/post_processing.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/post_processing.md b/docs/post_processing.md index d80b988..a87c185 100644 --- a/docs/post_processing.md +++ b/docs/post_processing.md @@ -2,7 +2,7 @@ RNAPro predictions can have incorrect O2′ placement. Run the step below before all-atom LDDT or any analysis that needs correct ribose atoms. -The script removes O2′ and rebuilds it with [Arena](https://github.com/pylelab/Arena). Backbone and base atoms are not moved. +The script removes O2′ and rebuilds it with [Arena](https://doi.org/10.5281/zenodo.7566670). Backbone and base atoms are not moved. --- @@ -14,10 +14,14 @@ The script removes O2′ and rebuilds it with [Arena](https://github.com/pylelab pip install biopython ``` -2. **Arena** — required for O2' reconstruction. Install and compile: +2. **Arena** — required for O2' reconstruction. Download the source archive from + the [Zenodo record](https://doi.org/10.5281/zenodo.7566670) (released under the + Creative Commons Attribution 4.0 International license), extract it into a + folder named `Arena`, and compile: ```bash - git clone https://github.com/pylelab/Arena.git + # Download the archive from https://doi.org/10.5281/zenodo.7566670, then: + unzip Arena-*.zip -d Arena # adjust to the downloaded archive name cd Arena make Arena ``` @@ -86,7 +90,9 @@ def look_for_arena(): return path cmds = [ - "git clone https://github.com/pylelab/Arena.git", + "# Download the Arena source archive (CC BY 4.0) from:", + "# https://doi.org/10.5281/zenodo.7566670", + "unzip Arena-*.zip -d Arena # adjust to the downloaded archive name", "cd Arena", "make Arena", ] @@ -262,11 +268,11 @@ python3 fix_o2prime.py complex.pdb --strip-nonrna --verbose |------|-------------| | `--outdir DIR` | Write fixed files to `DIR` instead of next to each input | | `--strip-nonrna` | Remove protein/ligand chains before Arena (output will not contain them) | -| `--arena-option N` | Arena mode (default `5`; see [Arena README](https://github.com/pylelab/Arena)) | +| `--arena-option N` | Arena mode (default `5`; see the [Arena documentation](https://doi.org/10.5281/zenodo.7566670)) | | `--verbose` | Print strip/rebuild counts per file | --- ### References -- **Arena:** Zion R Perry, Anna Marie Pyle, Chengxin Zhang (2023). *Arena: rapid and accurate reconstruction of full atomic RNA structures from coarse-grained models.* Journal of Molecular Biology. [GitHub](https://github.com/pylelab/Arena) · [Zenodo](https://doi.org/10.5281/zenodo.7566670) +- **Arena:** Zion R Perry, Anna Marie Pyle, Chengxin Zhang (2023). *Arena: rapid and accurate reconstruction of full atomic RNA structures from coarse-grained models.* Journal of Molecular Biology. Obtain Arena from the [Zenodo archive](https://doi.org/10.5281/zenodo.7566670) (Creative Commons Attribution 4.0 International). From 56ee188e65b2b5840709ecb1a9ffd385b36c0b48 Mon Sep 17 00:00:00 2001 From: Youhan Lee Date: Thu, 4 Jun 2026 21:43:03 -0700 Subject: [PATCH 4/4] docs: clarify README credits and citation --- README.md | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 02484e0..090ce12 100644 --- a/README.md +++ b/README.md @@ -245,18 +245,38 @@ RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encode ``` + ## Acknowledgements We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research. -### Contributors +For the full list of authors and contributors to the research, please refer to the preprint. + +### Code Contributors -The first release of RNAPro was developed through a collaboration between NVIDIA and the Das Lab: +We specifically thank the following people for direct code contributions to RNAPro: - **NVIDIA:** Youhan Lee, Christian Munley, Theo Viel, Emine Küçükbenli - **Das Lab (Stanford):** Rhiju Das, Chaitanya K. Joshi (Cambridge during development; now Stanford Das Lab) -We also thank everyone at NVIDIA and the Das Lab who contributed to the development and release. + +## Citation + +If you use RNAPro, please cite the preprint: + +```bibtex +@article{Lee2025.12.30.696949, + author = {Lee, Youhan and He, Shujun and Oda, Toshiyuki and Rao, G. John and Kim, Yehyun and Kim, Raehyun and Kim, Hyunjin and Heng, Cher Keng and Kowerko, Danny and Li, Haowei and Nguyen, Hoa and Sampathkumar, Arunodhayan and Enrique G{\'o}mez, Ra{\'u}l and Chen, Meng and Yoshizawa, Atsushi and Kuraishi, Shun and Ogawa, Kenji and Zou, Shuxian and Paullier, Alejo and Zhao, Bingkang and Chen, Huey-Long and Hsu, Tsu-An and Hirano, Tatsuya and Chiu, Wah and Gezelle, Jeanine G. and Haack, Daniel and Hong, Yibao and Jadhav, Shekhar and Koirala, Deepak and Kretsch, Rachael C and Lewicka, Anna and Li, Shanshan and Marcia, Marco and Piccirilli, Joseph and Rudolfs, Boris and Srivastava, Yoshita and Steckelberg, Anna-Lena and Su, Zhaoming and Toor, Navtej and Wang, Liu and Yang, Zi and Zhang, Kaiming and Zou, Jian and Baker, David and Chen, Shi-Jie and Demkin, Maggie and Favor, Andrew and Hummer, Alissa M and Joshi, Chaitanya K. and Kryshtafovych, Andriy and Kucukbenli, Emine and Miao, Zhichao and Moult, John and Munley, Christian and Reade, Walter and Viel, Theo and Westhof, Eric and Zhang, Sicheng and Das, Rhiju}, + title = {Template-based RNA structure prediction advanced through a blind code competition}, + elocation-id = {2025.12.30.696949}, + year = {2025}, + doi = {10.64898/2025.12.30.696949}, + publisher = {Cold Spring Harbor Laboratory}, + URL = {https://www.biorxiv.org/content/early/2025/12/30/2025.12.30.696949}, + eprint = {https://www.biorxiv.org/content/early/2025/12/30/2025.12.30.696949.full.pdf}, + journal = {bioRxiv} +} +``` ## License