This repository contains the code and data for the paper Phylogenetic tree inference from single-cell RNA sequencing data. The code and datasets provided here enable users to replicate the experiments and figures presented in the paper, as well as to run SCITE-RNA on new data.
We implement a new method for reconstructing phylogenetic trees from single-cell RNA sequencing data. SCITE-RNA selects single-nucleotide variants (SNVs), and reconstructs a phylogenetic tree of the sequenced cells. We maximize the likelihood of the inferred tree by alternating between the cell lineage and mutation tree spaces until convergence is achieved in both. This repository provides:
- Scripts to execute SCITE-RNA. The model is split into C++
src_cppand Python filessrc_python. Especially for large numbers of cells and SNVs it is recommended to use the C++ code, as it is significantly faster. The inferred trees should be comparable between the C++ and Python implementations, but as the method is stochastic likely won't produce the exact same tree. - Data used in the paper are available in the
data_summary/anddata/directories, which contain all necessary files to reproduce the figures. - Visualization scripts/notebooks to generate plots as presented in the paper.
SCITE-RNA
├── data_summary/ # Summary data files, as the raw output is quite large
├── data/ # Input data files and results
├── input_data # Alternative and reference read counts among other files.
├── results # Inferred trees of the cancer datasets and consensus tree results
└── simulated_data # Simulated data and inference results
├── generate_results_cpp/ # C++ scripts to run SCITE-RNA on various datasets
├── generate_results_python_r/ # Python and R scripts for simulating data, inferring trees and visualization
├── phylinsic_scripts/ # Slightly adapted code to run PhylinSic https://github.com/U54Bioinformatics/PhylinSic_Project on our simulated data
├── src_cpp/ # C++ source files for SCITE-RNA
├── src_python/ # Python source files for SCITE-RNA
├── config/ # Model parameters
├── CMakeLists.txt # Primary configuration file for CMake
├── README.md # Project overview and setup instructions
├── environment.yml # Conda environment file with required Python packages
├── Snakefile # Snakemake file to run the PhylinSic pipeline on simulated data
└── Snakefile_real_data.smk # Snakemake file to run the PhylinSic pipeline on cancer data
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- scipy
- numba
- jupyter
- pyyaml
- python-graphviz
- pygraphviz
- dendropy
Using conda:
conda env create -f environment.yml
- CMake (>= 3.27)
- LBFGS https://github.com/yixuan/LBFGSpp
- Eigen https://gitlab.com/libeigen/eigen
To set up the SCITE-RNA project locally:
git clone https://github.com/cbg-ethz/SCITE-RNA.git
cd SCITE-RNA
If desired adjust model parameters in config/config.yaml.
To generate new simulated data with various parameter settings and different numbers of clones execute:
generate_results_python_r/comparison_data_generation.py
It offers the option to set the number of cells and SNVs and the number of clones simulated. The same file can also be used for tree inference. Alternatively, run the C++ version:
generate_results_cpp/comparison_num_clones.cpp
for tree inference (not data generation), which is a lot faster.
To compare different optimization strategies of tree space switching run:
generate_results_cpp/comparison_tree_spaces_switching.cpp
Otherwise, the model will by default alternate between cell lineage and mutation tree optimization, starting from a random cell lineage tree.
All simulated results will be saved in data/simulated_data/.
To run SCITE-RNA on the cancer datasets:
Run either
generate_results_python_r/real_data_processing.py
or (recommended) run the faster C++ version:
generate_results_cpp/real_data_processing.cpp
Results will be saved in data/results/.
To use SCITE-RNA on new data:
-
Prepare reference and alternative allele count files in
.csvformat. Use the format provided indata/input_data/as a reference, where columns represent cells and rows represent SNVs. -
Set the number of bootstrap samples (optional) and run SCITE-RNA tree inference with the following script:
generate_results_cpp/real_data_processing.cpp -
The results are saved in
data/results.
To reproduce the figures quickly you can use the files provided in data and data_summary.
As the size and the number of raw data files was quite large, we produced summary statistics. To reproduce the plots presented in the paper, follow the instructions below:
-
Figure 3: Comparison of tree optimization strategies
generate_results_python_r/comparison_tree_spaces_switching.pyIf you want to rerun the full analysis first run with the desired number of cells and SNVs
generate_results_python_r/comparison_data_generation.py generate_results_cpp/comparison_tree_spaces_switching.cpp generate_results_cpp/space_switching_results_postprocessing.cpp
-
Figure 4: Comparison to SClineager, PhylinSic and DENDRO including runtimes
generate_results_python_r/comparison_num_clones.ipynbIf you want to rerun the full analysis first run with the desired number of clones, SNVs, cells
generate_results_python_r/comparison_data_generation.py generate_results_cpp/comparison_num_clones.cpp generate_results_python_r/comparison_clones_sclineager_dendro_sciterna.R -
Figure 5/6/7: Cancer datasets
generate_results_python_r/results_real_data.ipynbRerun the full analysis with and without bootstrapping
generate_results_cpp/real_data_processing.cpp generate_results_python_r/generate_consensus_parent_vector.py
Figures will be saved in data/results/figures/.