Skip to content

DIT-HAP/DIT_HAP_streamlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

139 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIT-HAP Streamlit Visualization

Interactive web application for visualizing and analyzing data from the DIT-HAP pipeline - a comprehensive bioinformatics workflow for diploid transposon mutagenesis and haploid fitness analysis in Schizosaccharomyces pombe.

Overview

DIT-HAP Streamlit serves as the visualization and analysis component for the open-source DIT-HAP pipeline. This application provides researchers with an intuitive interface to explore transposon insertion sequencing data and analyze gene essentiality in fission yeast through:

  • Interactive gene visualization: Explore gene depletion curves and insertion patterns from DIT-HAP pipeline outputs
  • Feature space analysis: Visualize genes in multidimensional feature space using pipeline-generated statistics
  • Enrichment analysis: Gene Ontology, FYPO, and disease ontology enrichment for pipeline results
  • Real-time data exploration: Dynamic filtering and selection of genes from pipeline datasets

Connection to DIT-HAP Pipeline

This Streamlit application is designed to work directly with data generated by the DIT-HAP Snakemake pipeline. The pipeline processes raw sequencing data through multiple analysis stages to produce:

  • Insertion-level statistics: Base mean, log fold changes, fitting results for individual transposon insertions
  • Gene-level statistics: Aggregated statistics and depletion curves for each gene
  • Quality metrics: Imputation statistics and transformed weights for data quality assessment
  • Annotation data: Integration with PomBase gene annotations and ontologies

The visualization application reads these structured outputs and provides interactive tools for biological interpretation and hypothesis generation.

Features

Pipeline Data Visualization

  • Gene-level depletion curve plotting from pipeline output files
  • Insertion-level statistical analysis and quality control visualizations
  • Combined visualizations using Altair charts for comprehensive analysis
  • Interactive gene structure visualization with genomic context

Feature Space Analysis

  • Multi-dimensional gene feature visualization from pipeline statistics
  • Interactive scatter plots with gene selection and clustering
  • Comparative analysis across different experimental conditions
  • Pattern identification in pipeline-generated feature matrices

Enrichment Analysis

  • Gene Ontology (GO) enrichment for pipeline-identified gene sets
  • FYPO (Fission Yeast Phenotype Ontology) analysis for phenotypic interpretations
  • MONDO disease ontology associations for translational insights
  • Statistical significance testing with multiple hypothesis correction

Pipeline Integration

  • Direct reading of standard DIT-HAP pipeline output formats
  • Support for multiple pipeline configurations (default, long timecourse, haploid)
  • Automatic validation of pipeline output file structure
  • Error handling for missing or incomplete pipeline data

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Data generated by the DIT-HAP pipeline

Setup

  1. Clone the repository

    git clone https://github.com/DIT-HAP/DIT_HAP_streamlit.git
    cd DIT_HAP_streamlit
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up DIT-HAP pipeline data

    # Ensure you have pipeline output data organized in the expected structure
    # See "Data Requirements" section below for details

Usage

Running the Application

# Basic run
streamlit run DIT_HAP_app.py

# Run on specific port
streamlit run DIT_HAP_app.py --server.port 8501

# Development mode with auto-reload
streamlit run DIT_HAP_app.py --server.runOnSave true

Accessing the Application

Open your web browser and navigate to http://localhost:8501 (or the configured port).

Application Pages

  1. Curve Plot (depletion_data.py)

    • Select genes from your DIT-HAP pipeline results
    • View pipeline-generated depletion curves and statistical analyses
    • Explore insertion-level and gene-level statistics from the pipeline
  2. Feature Space (feature_space.py)

    • Visualize genes in feature space using pipeline-generated metrics
    • Interactive scatter plot exploration of pipeline results
    • Gene clustering and pattern identification
  3. Enrichment Analysis (enrichment_analysis.py)

    • Perform GO, FYPO, and MONDO enrichment on pipeline-identified gene sets
    • Statistical significance testing for pipeline results
    • Results visualization and export for downstream analysis

Data Requirements

The application expects pipeline output data organized in the following structure:

data/
├── raw/                           # DIT-HAP pipeline outputs
│   ├── HD_DIT_HAP/               # Standard pipeline results
│   │   ├── insertion_level/      # Insertion-level analysis results
│   │   │   ├── annotations.tsv.gz    # Insertion annotations from pipeline
│   │   │   ├── baseMean.tsv          # Base mean statistics
│   │   │   ├── LFC.tsv               # Log fold change values
│   │   │   ├── fitting_LFCs.tsv      # Fitted log fold changes
│   │   │   ├── fitting_results.tsv   # Model fitting results
│   │   │   └── transformed_weights.tsv # Statistical weights
│   │   └── gene_level/          # Gene-level analysis results
│   │       ├── LFC.tsv               # Gene-level log fold changes
│   │       ├── fitting_LFCs.tsv      # Gene-level fitted values
│   │       └── fitting_results.tsv   # Gene-level model results
│   ├── Long_timecourse_data/    # Long timecourse pipeline results
│   └── haploid_data/            # Haploid pipeline results
└── resource/                     # Reference data (downloaded separately)
    ├── Hayles_2013_OB_merged_categories_sysIDupdated.xlsx
    └── pombase_data/            # PomBase reference data
        └── 2025-10-01/
            ├── Gene_metadata/
            ├── genome_region/
            ├── ontologies_and_associations/
            └── RNA_metadata/

Required Pipeline Output Files

Insertion-level analysis results:

  • annotations.tsv.gz - Insertion annotations from pipeline preprocessing
  • baseMean.tsv - Base mean expression statistics
  • LFC.tsv - Log fold change calculations
  • fitting_LFCs.tsv - Statistical modeling of log fold changes
  • fitting_results.tsv - Model fitting quality metrics
  • transformed_weights.tsv - Weighted statistical analyses

Gene-level analysis results:

  • LFC.tsv - Aggregated gene-level log fold changes
  • fitting_LFCs.tsv - Gene-level statistical modeling
  • fitting_results.tsv - Gene-level model quality assessments

Reference data:

  • PomBase gene annotations and metadata
  • Genome intervals and genomic features
  • Ontology files (GO, FYPO, MONDO) for enrichment analysis

Configuration

The application uses src/data_config.py for flexible configuration of different pipeline outputs:

# Use default pipeline configuration
config = get_default_config()

# Use long timecourse pipeline data
config = get_long_timecourse_config()

# Use haploid pipeline data
config = get_haploid_config()

# Configure custom pipeline output locations
config = get_custom_config(custom_paths...)

Development

Code Structure

├── DIT_HAP_app.py              # Main application entry point
├── pages/                      # Streamlit pages for different analyses
│   ├── depletion_data.py       # Pipeline depletion visualization
│   ├── feature_space.py        # Pipeline feature space analysis
│   └── enrichment_analysis.py  # Ontology enrichment of pipeline results
├── src/                        # Core functionality modules
│   ├── data_config.py          # Pipeline output file configuration
│   ├── data_manager.py         # Pipeline data loading and management
│   ├── preparation.py          # Data preparation utilities
│   ├── get_gene_data.py        # Pipeline data retrieval functions
│   ├── display_gene_data.py    # Visualization components
│   └── enrichment_functions.py # Enrichment analysis functions
└── requirements.txt            # Python dependencies

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run in development mode
streamlit run DIT_HAP_app.py --server.runOnSave true

# Code formatting (optional)
pip install black flake8
black src/ pages/
flake8 src/ pages/

Adding New Pipeline Analysis Types

  1. New Pages: Add to pages/ directory and register in DIT_HAP_app.py
  2. Data Sources: Update data_config.py with new pipeline output types
  3. Visualizations: Use Altair for consistent interactive charts
  4. Data Processing: Leverage existing caching patterns in data_manager.py

Dependencies

Key packages and their purposes for pipeline integration:

  • streamlit (≥1.51.0): Web application framework with caching and reactive UI
  • pandas (≥2.3.0): Data manipulation of pipeline outputs
  • numpy (≥2.3.0): Numerical computing for pipeline statistics
  • scipy (≥1.16.0): Statistical analysis of pipeline data
  • pydantic (≥2.11.7): Data validation for pipeline file formats using BaseModel
  • matplotlib (≥3.10.0): Static plotting capabilities
  • altair (≥5.5.0): Interactive visualizations of pipeline results

Bioinformatics Packages

  • goatools (≥1.5.2): Gene Ontology enrichment analysis
  • beautifulsoup4: XML/HTML parsing for ontology files
  • lxml: XML parsing library for bioinformatics data formats

Network Analysis

  • networkx: Network analysis and graph algorithms
  • ndex2: NDEx network data exchange integration
  • gocam: GO-CAM specific functionality
  • st-cytoscape: Streamlit component for network visualization

Data Processing & Utilities

  • openpyxl: Excel file handling for reference data
  • loguru (≥0.7.3): Advanced logging utilities
  • tqdm: Progress bars for data processing
  • stqdm: Streamlit integration for progress bars

Development Tools

  • pytest: Testing framework for code quality

Complete Dependency List

Install all dependencies with:

pip install -r requirements.txt

See requirements.txt for the complete list with version requirements.

Troubleshooting

Common Pipeline Integration Issues

  1. Pipeline Output Loading Errors

    • Verify pipeline output paths in data_config.py match actual locations
    • Ensure all required pipeline output files are present
    • Check that pipeline completed successfully with all analysis steps
  2. Pipeline Data Format Issues

    • Validate that pipeline output files match expected format
    • Check for missing or corrupted pipeline result files
    • Verify pipeline version compatibility with visualization code
  3. Memory Issues with Large Pipeline Datasets

    • Reduce dataset size for initial testing
    • Increase available system memory
    • Use Streamlit caching effectively for large pipeline results
  4. Missing Reference Data

    • Download required PomBase reference files
    • Ensure ontology files are present for enrichment analysis
    • Verify reference data version compatibility

Getting Help

  • Check the application logs for detailed error messages from pipeline data loading
  • Validate pipeline output configuration using config.validate_all_paths()
  • Test with smaller pipeline result subsets first
  • Consult the DIT-HAP pipeline documentation for pipeline-specific issues

Pipeline Documentation

For detailed information about the DIT-HAP pipeline that generates the data visualized in this application, please refer to:

Contributing

We welcome contributions to both the DIT-HAP pipeline and this visualization component! For this visualization application:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Development Guidelines

  • Follow existing code style and patterns
  • Use type hints where appropriate
  • Add docstrings for new functions
  • Test with different pipeline output configurations
  • Ensure compatibility with standard DIT-HAP pipeline outputs

Citation

If you use this software in your research, please cite both the DIT-HAP pipeline and this visualization component:

DIT-HAP: Diploid Insertional Mutagenesis by Transposon and Haploid Analysis of Phenotype
[Year] - Comprehensive pipeline for transposon mutagenesis analysis in S. pombe
https://github.com/DIT-HAP/DIT_HAP_pipeline

DIT-HAP Streamlit Visualization
[Year] - Interactive visualization for DIT-HAP pipeline results
https://github.com/DIT-HAP/DIT_HAP_streamlit

License

[Add your license information here - may differ from main pipeline license]

Acknowledgments

  • The DIT-HAP pipeline development team for the core analysis workflow
  • PomBase for gene annotations and ontology data
  • The Streamlit team for the web framework
  • The broader bioinformatics community for tools and libraries

Contact

For questions, issues, or contributions specifically related to this visualization component:


Note: This application is designed to work specifically with data generated by the DIT-HAP Snakemake pipeline. Ensure your pipeline has completed successfully and produced all required output files before use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors