Interactive web application for visualizing and analyzing data from the DIT-HAP pipeline - a comprehensive bioinformatics workflow for diploid transposon mutagenesis and haploid fitness analysis in Schizosaccharomyces pombe.
DIT-HAP Streamlit serves as the visualization and analysis component for the open-source DIT-HAP pipeline. This application provides researchers with an intuitive interface to explore transposon insertion sequencing data and analyze gene essentiality in fission yeast through:
- Interactive gene visualization: Explore gene depletion curves and insertion patterns from DIT-HAP pipeline outputs
- Feature space analysis: Visualize genes in multidimensional feature space using pipeline-generated statistics
- Enrichment analysis: Gene Ontology, FYPO, and disease ontology enrichment for pipeline results
- Real-time data exploration: Dynamic filtering and selection of genes from pipeline datasets
This Streamlit application is designed to work directly with data generated by the DIT-HAP Snakemake pipeline. The pipeline processes raw sequencing data through multiple analysis stages to produce:
- Insertion-level statistics: Base mean, log fold changes, fitting results for individual transposon insertions
- Gene-level statistics: Aggregated statistics and depletion curves for each gene
- Quality metrics: Imputation statistics and transformed weights for data quality assessment
- Annotation data: Integration with PomBase gene annotations and ontologies
The visualization application reads these structured outputs and provides interactive tools for biological interpretation and hypothesis generation.
- Gene-level depletion curve plotting from pipeline output files
- Insertion-level statistical analysis and quality control visualizations
- Combined visualizations using Altair charts for comprehensive analysis
- Interactive gene structure visualization with genomic context
- Multi-dimensional gene feature visualization from pipeline statistics
- Interactive scatter plots with gene selection and clustering
- Comparative analysis across different experimental conditions
- Pattern identification in pipeline-generated feature matrices
- Gene Ontology (GO) enrichment for pipeline-identified gene sets
- FYPO (Fission Yeast Phenotype Ontology) analysis for phenotypic interpretations
- MONDO disease ontology associations for translational insights
- Statistical significance testing with multiple hypothesis correction
- Direct reading of standard DIT-HAP pipeline output formats
- Support for multiple pipeline configurations (default, long timecourse, haploid)
- Automatic validation of pipeline output file structure
- Error handling for missing or incomplete pipeline data
- Python 3.8 or higher
- pip package manager
- Data generated by the DIT-HAP pipeline
-
Clone the repository
git clone https://github.com/DIT-HAP/DIT_HAP_streamlit.git cd DIT_HAP_streamlit -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up DIT-HAP pipeline data
# Ensure you have pipeline output data organized in the expected structure # See "Data Requirements" section below for details
# Basic run
streamlit run DIT_HAP_app.py
# Run on specific port
streamlit run DIT_HAP_app.py --server.port 8501
# Development mode with auto-reload
streamlit run DIT_HAP_app.py --server.runOnSave trueOpen your web browser and navigate to http://localhost:8501 (or the configured port).
-
Curve Plot (
depletion_data.py)- Select genes from your DIT-HAP pipeline results
- View pipeline-generated depletion curves and statistical analyses
- Explore insertion-level and gene-level statistics from the pipeline
-
Feature Space (
feature_space.py)- Visualize genes in feature space using pipeline-generated metrics
- Interactive scatter plot exploration of pipeline results
- Gene clustering and pattern identification
-
Enrichment Analysis (
enrichment_analysis.py)- Perform GO, FYPO, and MONDO enrichment on pipeline-identified gene sets
- Statistical significance testing for pipeline results
- Results visualization and export for downstream analysis
The application expects pipeline output data organized in the following structure:
data/
├── raw/ # DIT-HAP pipeline outputs
│ ├── HD_DIT_HAP/ # Standard pipeline results
│ │ ├── insertion_level/ # Insertion-level analysis results
│ │ │ ├── annotations.tsv.gz # Insertion annotations from pipeline
│ │ │ ├── baseMean.tsv # Base mean statistics
│ │ │ ├── LFC.tsv # Log fold change values
│ │ │ ├── fitting_LFCs.tsv # Fitted log fold changes
│ │ │ ├── fitting_results.tsv # Model fitting results
│ │ │ └── transformed_weights.tsv # Statistical weights
│ │ └── gene_level/ # Gene-level analysis results
│ │ ├── LFC.tsv # Gene-level log fold changes
│ │ ├── fitting_LFCs.tsv # Gene-level fitted values
│ │ └── fitting_results.tsv # Gene-level model results
│ ├── Long_timecourse_data/ # Long timecourse pipeline results
│ └── haploid_data/ # Haploid pipeline results
└── resource/ # Reference data (downloaded separately)
├── Hayles_2013_OB_merged_categories_sysIDupdated.xlsx
└── pombase_data/ # PomBase reference data
└── 2025-10-01/
├── Gene_metadata/
├── genome_region/
├── ontologies_and_associations/
└── RNA_metadata/
Insertion-level analysis results:
annotations.tsv.gz- Insertion annotations from pipeline preprocessingbaseMean.tsv- Base mean expression statisticsLFC.tsv- Log fold change calculationsfitting_LFCs.tsv- Statistical modeling of log fold changesfitting_results.tsv- Model fitting quality metricstransformed_weights.tsv- Weighted statistical analyses
Gene-level analysis results:
LFC.tsv- Aggregated gene-level log fold changesfitting_LFCs.tsv- Gene-level statistical modelingfitting_results.tsv- Gene-level model quality assessments
Reference data:
- PomBase gene annotations and metadata
- Genome intervals and genomic features
- Ontology files (GO, FYPO, MONDO) for enrichment analysis
The application uses src/data_config.py for flexible configuration of different pipeline outputs:
# Use default pipeline configuration
config = get_default_config()
# Use long timecourse pipeline data
config = get_long_timecourse_config()
# Use haploid pipeline data
config = get_haploid_config()
# Configure custom pipeline output locations
config = get_custom_config(custom_paths...)├── DIT_HAP_app.py # Main application entry point
├── pages/ # Streamlit pages for different analyses
│ ├── depletion_data.py # Pipeline depletion visualization
│ ├── feature_space.py # Pipeline feature space analysis
│ └── enrichment_analysis.py # Ontology enrichment of pipeline results
├── src/ # Core functionality modules
│ ├── data_config.py # Pipeline output file configuration
│ ├── data_manager.py # Pipeline data loading and management
│ ├── preparation.py # Data preparation utilities
│ ├── get_gene_data.py # Pipeline data retrieval functions
│ ├── display_gene_data.py # Visualization components
│ └── enrichment_functions.py # Enrichment analysis functions
└── requirements.txt # Python dependencies
# Install development dependencies
pip install -r requirements.txt
# Run in development mode
streamlit run DIT_HAP_app.py --server.runOnSave true
# Code formatting (optional)
pip install black flake8
black src/ pages/
flake8 src/ pages/- New Pages: Add to
pages/directory and register inDIT_HAP_app.py - Data Sources: Update
data_config.pywith new pipeline output types - Visualizations: Use Altair for consistent interactive charts
- Data Processing: Leverage existing caching patterns in
data_manager.py
Key packages and their purposes for pipeline integration:
- streamlit (≥1.51.0): Web application framework with caching and reactive UI
- pandas (≥2.3.0): Data manipulation of pipeline outputs
- numpy (≥2.3.0): Numerical computing for pipeline statistics
- scipy (≥1.16.0): Statistical analysis of pipeline data
- pydantic (≥2.11.7): Data validation for pipeline file formats using BaseModel
- matplotlib (≥3.10.0): Static plotting capabilities
- altair (≥5.5.0): Interactive visualizations of pipeline results
- goatools (≥1.5.2): Gene Ontology enrichment analysis
- beautifulsoup4: XML/HTML parsing for ontology files
- lxml: XML parsing library for bioinformatics data formats
- networkx: Network analysis and graph algorithms
- ndex2: NDEx network data exchange integration
- gocam: GO-CAM specific functionality
- st-cytoscape: Streamlit component for network visualization
- openpyxl: Excel file handling for reference data
- loguru (≥0.7.3): Advanced logging utilities
- tqdm: Progress bars for data processing
- stqdm: Streamlit integration for progress bars
- pytest: Testing framework for code quality
Install all dependencies with:
pip install -r requirements.txtSee requirements.txt for the complete list with version requirements.
-
Pipeline Output Loading Errors
- Verify pipeline output paths in
data_config.pymatch actual locations - Ensure all required pipeline output files are present
- Check that pipeline completed successfully with all analysis steps
- Verify pipeline output paths in
-
Pipeline Data Format Issues
- Validate that pipeline output files match expected format
- Check for missing or corrupted pipeline result files
- Verify pipeline version compatibility with visualization code
-
Memory Issues with Large Pipeline Datasets
- Reduce dataset size for initial testing
- Increase available system memory
- Use Streamlit caching effectively for large pipeline results
-
Missing Reference Data
- Download required PomBase reference files
- Ensure ontology files are present for enrichment analysis
- Verify reference data version compatibility
- Check the application logs for detailed error messages from pipeline data loading
- Validate pipeline output configuration using
config.validate_all_paths() - Test with smaller pipeline result subsets first
- Consult the DIT-HAP pipeline documentation for pipeline-specific issues
For detailed information about the DIT-HAP pipeline that generates the data visualized in this application, please refer to:
- Main Pipeline Repository: https://github.com/DIT-HAP/DIT_HAP_pipeline
- Pipeline Documentation: Available in the repository wiki and README
- Pipeline Issues: Report pipeline-specific problems in the main repository
We welcome contributions to both the DIT-HAP pipeline and this visualization component! For this visualization application:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Follow existing code style and patterns
- Use type hints where appropriate
- Add docstrings for new functions
- Test with different pipeline output configurations
- Ensure compatibility with standard DIT-HAP pipeline outputs
If you use this software in your research, please cite both the DIT-HAP pipeline and this visualization component:
DIT-HAP: Diploid Insertional Mutagenesis by Transposon and Haploid Analysis of Phenotype
[Year] - Comprehensive pipeline for transposon mutagenesis analysis in S. pombe
https://github.com/DIT-HAP/DIT_HAP_pipeline
DIT-HAP Streamlit Visualization
[Year] - Interactive visualization for DIT-HAP pipeline results
https://github.com/DIT-HAP/DIT_HAP_streamlit
[Add your license information here - may differ from main pipeline license]
- The DIT-HAP pipeline development team for the core analysis workflow
- PomBase for gene annotations and ontology data
- The Streamlit team for the web framework
- The broader bioinformatics community for tools and libraries
For questions, issues, or contributions specifically related to this visualization component:
- Create an issue on this repository
- For pipeline-specific questions, use the main DIT-HAP pipeline repository
Note: This application is designed to work specifically with data generated by the DIT-HAP Snakemake pipeline. Ensure your pipeline has completed successfully and produced all required output files before use.