This repository contains the source code, plotting notebooks, and training data for the paper 'Mapping the combinatorial coding between olfactory receptors and perception with deep learning' (v2 in preparation).
A Zenodo release containing model weights, pre-computed ESM embeddings, and OR activation logits (for both HORDE and M2OR receptor sets) will accompany the v2 release. See data/datasets/ for the small CSV/FASTA artifacts checked into the repo; large .pt/.pth artifacts are kept out of git via .gitignore and should be downloaded from Zenodo and placed in data/datasets/.
For an example of running inference with the MolOR model over the HORDE set of receptor sequences (including pseudogene controls), refer to scripts/generate_OR_predictions_pseudogenes.py.
conda env create -f olfaction.yml
conda activate olfactionclassification_ESM.py: trains odorant-receptor models (MolOR) with fused per-residue ESM embeddings and bidirectional cross-attention. The--model_encoderflag selects between GCN and MPNN molecular encoders; configs live underdata/configures/M2OR_Pairs/(e.g.MolOR_canonical.json,MolOR_MPNN_canonical.json). Requires ESM embeddings pre-computed on disk; first run will cache them.classification_OR_feat_ESM.py: trains odorant-percept models using predicted MolOR activations as input features (alongside the molecular GCN). Requires OR activation logits pre-computed on disk, or will run inference first to generate them for the given dataset.classification.py: basic GCN/MPNN classification baselines without ESM features.
run_OR_percept_ablations_HORDE.sh: main paper ablation — scales # of HORDE OR activations as input features for odorant-percept prediction. After downloading data from Zenodo intodata/datasets/, runbash scripts/run_OR_percept_ablations_HORDE.sh.run_OR_percept_ablations.sh: equivalent ablation against the M2OR receptor set (1237 ORs).run_OR_percept_ablations_all_DBs.sh: ablation over the union of HORDE and M2OR ORs.generate_OR_predictions_pseudogenes.py: generates MolOR activation logits for HORDE receptors (functional and pseudogene splits).prepare_enzpred_data.py: produces the M2OR train/val/test splits used for the Goldman et al. (FFN+ESM) and PerceiverCPI baselines.blast_uniprot.py,get_gene_uniprot_IDs_blast.py,merge_blast_annotations.py,m2or_ed_distance_matrix.py,get_HORDE_metadata.ipynb: receptor annotation and pre-processing utilities.
fig2_plots.ipynb,figures_OR_percept.ipynb,percept_OR_plots.ipynb: main-text figures.fig4_stat_tests.ipynb: statistical analyses including Benjamini–Hochberg-corrected ablation comparisons and the Jonckheere–Terpstra trend test reported in Table S1.nutty_receptor_analysis.ipynb,filtered_nutty_receptor_analysis.ipynb,OR_subfamily_analysis.ipynb,cross_task_stats.ipynb,percept_receptor_null_distribution.ipynb: per-percept and per-receptor analyses.test_OR_logits_shuffle.ipynb: shuffled-OR-logits control referenced in the revisions.
Notebooks and utilities for preparing the M2OR pairwise dataset and computing receptor-level statistics (sequence-similarity matrix, BLAST-based annotations).
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.