DIT-HAP Streamlit Visualization

Interactive web application for visualizing and analyzing data from the DIT-HAP pipeline - a comprehensive bioinformatics workflow for diploid transposon mutagenesis and haploid fitness analysis in Schizosaccharomyces pombe.

Overview

DIT-HAP Streamlit serves as the visualization and analysis component for the open-source DIT-HAP pipeline. This application provides researchers with an intuitive interface to explore transposon insertion sequencing data and analyze gene essentiality in fission yeast through:

Interactive gene visualization: Explore gene depletion curves and insertion patterns from DIT-HAP pipeline outputs
Feature space analysis: Visualize genes in multidimensional feature space using pipeline-generated statistics
Enrichment analysis: Gene Ontology, FYPO, and disease ontology enrichment for pipeline results
Real-time data exploration: Dynamic filtering and selection of genes from pipeline datasets

Connection to DIT-HAP Pipeline

This Streamlit application is designed to work directly with data generated by the DIT-HAP Snakemake pipeline. The pipeline processes raw sequencing data through multiple analysis stages to produce:

Insertion-level statistics: Base mean, log fold changes, fitting results for individual transposon insertions
Gene-level statistics: Aggregated statistics and depletion curves for each gene
Quality metrics: Imputation statistics and transformed weights for data quality assessment
Annotation data: Integration with PomBase gene annotations and ontologies

The visualization application reads these structured outputs and provides interactive tools for biological interpretation and hypothesis generation.

Features

Pipeline Data Visualization

Gene-level depletion curve plotting from pipeline output files
Insertion-level statistical analysis and quality control visualizations
Combined visualizations using Altair charts for comprehensive analysis
Interactive gene structure visualization with genomic context

Feature Space Analysis

Multi-dimensional gene feature visualization from pipeline statistics
Interactive scatter plots with gene selection and clustering
Comparative analysis across different experimental conditions
Pattern identification in pipeline-generated feature matrices

Enrichment Analysis

Gene Ontology (GO) enrichment for pipeline-identified gene sets
FYPO (Fission Yeast Phenotype Ontology) analysis for phenotypic interpretations
MONDO disease ontology associations for translational insights
Statistical significance testing with multiple hypothesis correction

Pipeline Integration

Direct reading of standard DIT-HAP pipeline output formats
Support for multiple pipeline configurations (default, long timecourse, haploid)
Automatic validation of pipeline output file structure
Error handling for missing or incomplete pipeline data

Installation

Prerequisites

Python 3.8 or higher
pip package manager
Data generated by the DIT-HAP pipeline

Setup

Clone the repository

git clone https://github.com/DIT-HAP/DIT_HAP_streamlit.git
cd DIT_HAP_streamlit

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Set up DIT-HAP pipeline data

# Ensure you have pipeline output data organized in the expected structure
# See "Data Requirements" section below for details

Usage

Running the Application

# Basic run
streamlit run DIT_HAP_app.py

# Run on specific port
streamlit run DIT_HAP_app.py --server.port 8501

# Development mode with auto-reload
streamlit run DIT_HAP_app.py --server.runOnSave true

Accessing the Application

Open your web browser and navigate to http://localhost:8501 (or the configured port).

Application Pages

Curve Plot (depletion_data.py)
- Select genes from your DIT-HAP pipeline results
- View pipeline-generated depletion curves and statistical analyses
- Explore insertion-level and gene-level statistics from the pipeline
Feature Space (feature_space.py)
- Visualize genes in feature space using pipeline-generated metrics
- Interactive scatter plot exploration of pipeline results
- Gene clustering and pattern identification
Enrichment Analysis (enrichment_analysis.py)
- Perform GO, FYPO, and MONDO enrichment on pipeline-identified gene sets
- Statistical significance testing for pipeline results
- Results visualization and export for downstream analysis

Data Requirements

The application expects pipeline output data organized in the following structure:

data/
├── raw/                           # DIT-HAP pipeline outputs
│   ├── HD_DIT_HAP/               # Standard pipeline results
│   │   ├── insertion_level/      # Insertion-level analysis results
│   │   │   ├── annotations.tsv.gz    # Insertion annotations from pipeline
│   │   │   ├── baseMean.tsv          # Base mean statistics
│   │   │   ├── LFC.tsv               # Log fold change values
│   │   │   ├── fitting_LFCs.tsv      # Fitted log fold changes
│   │   │   ├── fitting_results.tsv   # Model fitting results
│   │   │   └── transformed_weights.tsv # Statistical weights
│   │   └── gene_level/          # Gene-level analysis results
│   │       ├── LFC.tsv               # Gene-level log fold changes
│   │       ├── fitting_LFCs.tsv      # Gene-level fitted values
│   │       └── fitting_results.tsv   # Gene-level model results
│   ├── Long_timecourse_data/    # Long timecourse pipeline results
│   └── haploid_data/            # Haploid pipeline results
└── resource/                     # Reference data (downloaded separately)
    ├── Hayles_2013_OB_merged_categories_sysIDupdated.xlsx
    └── pombase_data/            # PomBase reference data
        └── 2025-10-01/
            ├── Gene_metadata/
            ├── genome_region/
            ├── ontologies_and_associations/
            └── RNA_metadata/

Required Pipeline Output Files

Insertion-level analysis results:

annotations.tsv.gz - Insertion annotations from pipeline preprocessing
baseMean.tsv - Base mean expression statistics
LFC.tsv - Log fold change calculations
fitting_LFCs.tsv - Statistical modeling of log fold changes
fitting_results.tsv - Model fitting quality metrics
transformed_weights.tsv - Weighted statistical analyses

Gene-level analysis results:

LFC.tsv - Aggregated gene-level log fold changes
fitting_LFCs.tsv - Gene-level statistical modeling
fitting_results.tsv - Gene-level model quality assessments

Reference data:

PomBase gene annotations and metadata
Genome intervals and genomic features
Ontology files (GO, FYPO, MONDO) for enrichment analysis

Configuration

The application uses src/data_config.py for flexible configuration of different pipeline outputs:

# Use default pipeline configuration
config = get_default_config()

# Use long timecourse pipeline data
config = get_long_timecourse_config()

# Use haploid pipeline data
config = get_haploid_config()

# Configure custom pipeline output locations
config = get_custom_config(custom_paths...)

Development

Code Structure

├── DIT_HAP_app.py              # Main application entry point
├── pages/                      # Streamlit pages for different analyses
│   ├── depletion_data.py       # Pipeline depletion visualization
│   ├── feature_space.py        # Pipeline feature space analysis
│   └── enrichment_analysis.py  # Ontology enrichment of pipeline results
├── src/                        # Core functionality modules
│   ├── data_config.py          # Pipeline output file configuration
│   ├── data_manager.py         # Pipeline data loading and management
│   ├── preparation.py          # Data preparation utilities
│   ├── get_gene_data.py        # Pipeline data retrieval functions
│   ├── display_gene_data.py    # Visualization components
│   └── enrichment_functions.py # Enrichment analysis functions
└── requirements.txt            # Python dependencies

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run in development mode
streamlit run DIT_HAP_app.py --server.runOnSave true

# Code formatting (optional)
pip install black flake8
black src/ pages/
flake8 src/ pages/

Adding New Pipeline Analysis Types

New Pages: Add to pages/ directory and register in DIT_HAP_app.py
Data Sources: Update data_config.py with new pipeline output types
Visualizations: Use Altair for consistent interactive charts
Data Processing: Leverage existing caching patterns in data_manager.py

Dependencies

Key packages and their purposes for pipeline integration:

streamlit (≥1.51.0): Web application framework with caching and reactive UI
pandas (≥2.3.0): Data manipulation of pipeline outputs
numpy (≥2.3.0): Numerical computing for pipeline statistics
scipy (≥1.16.0): Statistical analysis of pipeline data
pydantic (≥2.11.7): Data validation for pipeline file formats using BaseModel
matplotlib (≥3.10.0): Static plotting capabilities
altair (≥5.5.0): Interactive visualizations of pipeline results

Bioinformatics Packages

goatools (≥1.5.2): Gene Ontology enrichment analysis
beautifulsoup4: XML/HTML parsing for ontology files
lxml: XML parsing library for bioinformatics data formats

Network Analysis

networkx: Network analysis and graph algorithms
ndex2: NDEx network data exchange integration
gocam: GO-CAM specific functionality
st-cytoscape: Streamlit component for network visualization

Data Processing & Utilities

openpyxl: Excel file handling for reference data
loguru (≥0.7.3): Advanced logging utilities
tqdm: Progress bars for data processing
stqdm: Streamlit integration for progress bars

Development Tools

pytest: Testing framework for code quality

Complete Dependency List

Install all dependencies with:

pip install -r requirements.txt

See requirements.txt for the complete list with version requirements.

Troubleshooting

Common Pipeline Integration Issues

Pipeline Output Loading Errors
- Verify pipeline output paths in data_config.py match actual locations
- Ensure all required pipeline output files are present
- Check that pipeline completed successfully with all analysis steps
Pipeline Data Format Issues
- Validate that pipeline output files match expected format
- Check for missing or corrupted pipeline result files
- Verify pipeline version compatibility with visualization code
Memory Issues with Large Pipeline Datasets
- Reduce dataset size for initial testing
- Increase available system memory
- Use Streamlit caching effectively for large pipeline results
Missing Reference Data
- Download required PomBase reference files
- Ensure ontology files are present for enrichment analysis
- Verify reference data version compatibility

Getting Help

Check the application logs for detailed error messages from pipeline data loading
Validate pipeline output configuration using config.validate_all_paths()
Test with smaller pipeline result subsets first
Consult the DIT-HAP pipeline documentation for pipeline-specific issues

Pipeline Documentation

For detailed information about the DIT-HAP pipeline that generates the data visualized in this application, please refer to:

Main Pipeline Repository: https://github.com/DIT-HAP/DIT_HAP_pipeline
Pipeline Documentation: Available in the repository wiki and README
Pipeline Issues: Report pipeline-specific problems in the main repository

Contributing

We welcome contributions to both the DIT-HAP pipeline and this visualization component! For this visualization application:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Development Guidelines

Follow existing code style and patterns
Use type hints where appropriate
Add docstrings for new functions
Test with different pipeline output configurations
Ensure compatibility with standard DIT-HAP pipeline outputs

Citation

If you use this software in your research, please cite both the DIT-HAP pipeline and this visualization component:

DIT-HAP: Diploid Insertional Mutagenesis by Transposon and Haploid Analysis of Phenotype
[Year] - Comprehensive pipeline for transposon mutagenesis analysis in S. pombe
https://github.com/DIT-HAP/DIT_HAP_pipeline

DIT-HAP Streamlit Visualization
[Year] - Interactive visualization for DIT-HAP pipeline results
https://github.com/DIT-HAP/DIT_HAP_streamlit

License

[Add your license information here - may differ from main pipeline license]

Acknowledgments

The DIT-HAP pipeline development team for the core analysis workflow
PomBase for gene annotations and ontology data
The Streamlit team for the web framework
The broader bioinformatics community for tools and libraries

Contact

For questions, issues, or contributions specifically related to this visualization component:

Create an issue on this repository
For pipeline-specific questions, use the main DIT-HAP pipeline repository

Note: This application is designed to work specifically with data generated by the DIT-HAP Snakemake pipeline. Ensure your pipeline has completed successfully and produced all required output files before use.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
data		data
pages		pages
src		src
.gitignore		.gitignore
DIT_HAP_app.py		DIT_HAP_app.py
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DIT-HAP Streamlit Visualization

Overview

Connection to DIT-HAP Pipeline

Features

Pipeline Data Visualization

Feature Space Analysis

Enrichment Analysis

Pipeline Integration

Installation

Prerequisites

Setup

Usage

Running the Application

Accessing the Application

Application Pages

Data Requirements

Required Pipeline Output Files

Configuration

Development

Code Structure

Development Setup

Adding New Pipeline Analysis Types

Dependencies

Bioinformatics Packages

Network Analysis

Data Processing & Utilities

Development Tools

Complete Dependency List

Troubleshooting

Common Pipeline Integration Issues

Getting Help

Pipeline Documentation

Contributing

Development Guidelines

Citation

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages