Skip to content

cbg-ethz/SCITE-RNA

Repository files navigation

SCITE-RNA

This repository contains the code and data for the paper Phylogenetic tree inference from single-cell RNA sequencing data. The code and datasets provided here enable users to replicate the experiments and figures presented in the paper, as well as to run SCITE-RNA on new data.

Table of Contents


Overview

We implement a new method for reconstructing phylogenetic trees from single-cell RNA sequencing data. SCITE-RNA selects single-nucleotide variants (SNVs), and reconstructs a phylogenetic tree of the sequenced cells. We maximize the likelihood of the inferred tree by alternating between the cell lineage and mutation tree spaces until convergence is achieved in both. This repository provides:

  1. Scripts to execute SCITE-RNA. The model is split into C++ src_cpp and Python files src_python. Especially for large numbers of cells and SNVs it is recommended to use the C++ code, as it is significantly faster. The inferred trees should be comparable between the C++ and Python implementations, but as the method is stochastic likely won't produce the exact same tree.
  2. Data used in the paper are available in the data_summary/and data/ directories, which contain all necessary files to reproduce the figures.
  3. Visualization scripts/notebooks to generate plots as presented in the paper.

Repository Structure

SCITE-RNA
├── data_summary/ # Summary data files, as the raw output is quite large
├── data/ # Input data files and results
      ├── input_data # Alternative and reference read counts among other files.
      ├── results # Inferred trees of the cancer datasets and consensus tree results
      └── simulated_data # Simulated data and inference results
├── generate_results_cpp/ # C++ scripts to run SCITE-RNA on various datasets
├── generate_results_python_r/ # Python and R scripts for simulating data, inferring trees and visualization
├── phylinsic_scripts/ # Slightly adapted code to run PhylinSic https://github.com/U54Bioinformatics/PhylinSic_Project on our simulated data
├── src_cpp/ # C++ source files for SCITE-RNA
├── src_python/ # Python source files for SCITE-RNA
├── config/ # Model parameters
├── CMakeLists.txt # Primary configuration file for CMake
├── README.md # Project overview and setup instructions
├── environment.yml # Conda environment file with required Python packages
├── Snakefile # Snakemake file to run the PhylinSic pipeline on simulated data
└── Snakefile_real_data.smk # Snakemake file to run the PhylinSic pipeline on cancer data

Installation

Requirements

Python Libraries:

  • numpy
  • pandas
  • scikit-learn
  • matplotlib
  • seaborn
  • scipy
  • numba
  • jupyter
  • pyyaml
  • python-graphviz
  • pygraphviz
  • dendropy

Using conda:

conda env create -f environment.yml

C++ Requirements

Cloning the Repository

To set up the SCITE-RNA project locally:

git clone https://github.com/cbg-ethz/SCITE-RNA.git
cd SCITE-RNA

Running the Model

Set Model Parameters

If desired adjust model parameters in config/config.yaml.

Simulated Data

To generate new simulated data with various parameter settings and different numbers of clones execute:

generate_results_python_r/comparison_data_generation.py

It offers the option to set the number of cells and SNVs and the number of clones simulated. The same file can also be used for tree inference. Alternatively, run the C++ version:

generate_results_cpp/comparison_num_clones.cpp 

for tree inference (not data generation), which is a lot faster.

To compare different optimization strategies of tree space switching run:

generate_results_cpp/comparison_tree_spaces_switching.cpp

Otherwise, the model will by default alternate between cell lineage and mutation tree optimization, starting from a random cell lineage tree.

All simulated results will be saved in data/simulated_data/.

Cancer Data

To run SCITE-RNA on the cancer datasets:

Run either

   generate_results_python_r/real_data_processing.py

or (recommended) run the faster C++ version:

   generate_results_cpp/real_data_processing.cpp

Results will be saved in data/results/.

Run on New Data

To use SCITE-RNA on new data:

  1. Prepare reference and alternative allele count files in .csv format. Use the format provided in data/input_data/ as a reference, where columns represent cells and rows represent SNVs.

  2. Set the number of bootstrap samples (optional) and run SCITE-RNA tree inference with the following script:

    generate_results_cpp/real_data_processing.cpp
    
  3. The results are saved in data/results.

Generating Figures

Data Preparation

To reproduce the figures quickly you can use the files provided in data and data_summary. As the size and the number of raw data files was quite large, we produced summary statistics. To reproduce the plots presented in the paper, follow the instructions below:

  • Figure 3: Comparison of tree optimization strategies

    generate_results_python_r/comparison_tree_spaces_switching.py
    

    If you want to rerun the full analysis first run with the desired number of cells and SNVs

      generate_results_python_r/comparison_data_generation.py
      generate_results_cpp/comparison_tree_spaces_switching.cpp
      generate_results_cpp/space_switching_results_postprocessing.cpp
    

  • Figure 4: Comparison to SClineager, PhylinSic and DENDRO including runtimes

      generate_results_python_r/comparison_num_clones.ipynb
    

    If you want to rerun the full analysis first run with the desired number of clones, SNVs, cells

        generate_results_python_r/comparison_data_generation.py
        generate_results_cpp/comparison_num_clones.cpp
        generate_results_python_r/comparison_clones_sclineager_dendro_sciterna.R       
    
  • Figure 5/6/7: Cancer datasets

        generate_results_python_r/results_real_data.ipynb
    

    Rerun the full analysis with and without bootstrapping

        generate_results_cpp/real_data_processing.cpp
        generate_results_python_r/generate_consensus_parent_vector.py
    

Figures will be saved in data/results/figures/.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published