Skip to content

veg/prime

Repository files navigation

PRIME Reproducibility Package

This package contains the scripts, data, and validated results required to regenerate the primary figures, tables, and statistical analyses presented in the manuscript "Characterizing Physicochemical Selection in Protein Evolution with Property-Informed Models (PRIME)".

Overview

The reproducibility package is organized into three modular sub-directories corresponding to the major sections of the manuscript:

  1. mammalian_analysis/: Proteome-wide characterization of 18,944 mammalian genes.
  2. simulation_study/: Statistical performance assessment (Power and False Positive Rate) using chimeric simulations.
  3. benchmark_tables/: Detailed technical comparisons and structural mapping across 24 benchmark datasets.

System Dependencies

To run the scripts in this package, ensure your environment meets the following requirements:

  • Python 3.8+
    • Libraries: pandas, numpy, matplotlib, scipy, sqlite3, biopython, logomaker
  • R 4.0+
    • Libraries: ggplot2, dplyr, tidyr, grid, cmcrameri (for scientific colormaps)
  • HyPhy 2.5.94+
    • Required only if you intend to regenerate raw model fits from original sequence alignments.

Quick Start (Regeneration Guide)

Each module is designed to be executed from its own directory. Results are typically saved to the verification/ root directory for easy comparison.

1. Mammalian Analysis (FIG 1, 2, 3)

cd mammalian_analysis/scripts/
Rscript plot_genome_wide.R      # Generates FIG 1 (genome_wide_summary.pdf)
python3 analyze_property_patterns.py  # Generates statistics for FIG 2
Rscript plot_case_studies.R     # Generates FIG 3 (case_studies.pdf)

2. Simulation Study (FIG 6, 7)

cd simulation_study/scripts/
python3 plot_power_heatmap.py   # Generates FIG 6 (power_heatmap_6panel.png)
python3 plot_power_gene_scenarios.py # Generates FIG 7 (power_gene_scenarios.png)

3. Benchmark Comparisons (TAB 4-7, FIG 5, 8)

cd benchmark_tables/scripts/
python3 generate_table6.py      # Generates TAB 5 & 6
python3 plot_benchmark_heatmap.py # Generates FIG 5 (benchmark_heatmap.pdf)
python3 plot_h3n2_comparison_logos.py # Generates FIG 8 (h3n2_site_logos.pdf)

Data and Database Schemas

All primary data is stored in SQLite databases or compressed CSV files. Detailed schema documentation for each module is provided in its respective README.md.

  • mammalian_analysis/data/genome_wide_summary.csv: Aggregated fits for 18,944 genes.
  • simulation_study/data/simulation_results.db: Site-level outcomes for 23,000 selective scenarios.
  • benchmark_tables/data/model_fits.db: Comparative statistics for the 24 benchmark alignments.

Verification and Validated Outputs

The verification/ directory contains a set of pre-generated outputs produced by the scripts in this package. These serve as a reference to ensure that your local environment is correctly configured and produces results identical to those presented in the manuscript.

Verified Assets:

  • genome_wide_summary.pdf (Matches Figure 1)
  • power_heatmap_6panel.png (Matches Figure 6)
  • structural_anatomy_boxplot.pdf (Matches Figure 4)
  • h3n2_site_logos.pdf (Matches Figure 8)
  • table5.tex (Matches Table 5)

Code and Data Availability

The core PRIME inference engine is part of the HyPhy software package. The latest source code and installation instructions are available at https://github.com/veg/hyphy.

The full repository including raw sequence alignments and larger database files can be found at https://github.com/veg/PRIME.

About

Data and scripts for the PRIME paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages