Production-ready pipeline for analyzing human genome functional characteristics using Gene Ontology (GO) data. Multiple implementations optimized for different use cases.
Data Source: https://functionome.geneontology.org/
Ingests GO ontology and human genome annotations, computes functional impact metrics, and provides interactive visualization for exploring gene functions.
High-performance columnar processing with operation fusion.
Advantages:
- 5-25x faster than pure Python
- Lower memory footprint (~30% reduction)
- Operation fusion eliminates intermediate allocations
- Portable (Linux, macOS, Windows)
Build:
cd cpp && mkdir build && cd build
cmake .. -DUSE_ARROW=OFF && make && make install
cd ../../src && python pipeline_dag.pyUse Case: Production workloads, local processing (10K-1M genes)
Distributed big data processing at cloud scale.
Advantages:
- Handles 100K-10M+ genes
- Delta Lake ACID transactions
- Cloud-native deployment
- Production data engineering patterns
Quick Start:
pip install -r requirements.txt
python src/main_spark.pyUse Case: Cloud deployment, massive datasets, team collaboration
Simple single-file implementation for development.
Advantages:
- Zero setup, instant dev
- Easy to understand and modify
- No compilation required
Quick Start:
pip install pandas requests dash plotly
python src/main.pyUse Case: Prototyping, learning, small datasets (<10K genes)
| Implementation | Speed | Memory | Setup | Best For |
|---|---|---|---|---|
| C++ DAG | ⚡⚡⚡⚡⚡ | 💾💾 | Medium | Local production |
| Spark/Databricks | ⚡⚡⚡⚡ | 💾💾💾 | Complex | Cloud scale |
| Pure Python | ⚡ | 💾💾💾💾 | Easy | Development |
- Validated Pipeline: Production-ready error handling and data validation
- Functional Impact Index: Weighted scoring combining pathway, function, and component annotations
- Interactive Dashboard: 5-tab Dash interface for exploration and export
- Multiple Architectures: Choose the right tool for your scale
# 1. Build C++ engine
cd cpp && mkdir build && cd build
cmake .. -DUSE_ARROW=OFF && make -j4 && make install
# 2. Run pipeline
cd ../../src
python pipeline_dag.py --export results.csv
# Output: Optimized DAG execution with fusion
# Speedup: ~10x over pure Pythoncpp/README.md- C++ DAG engine architecture and designcpp/BUILD.md- Build instructions for all platformsARCHITECTURE.md- Spark/Databricks architectureREADME_SPARK.md- Spark user guide and deployment
src/ # Python pipeline implementations
├── pipeline_dag.py # C++ DAG-based (recommended)
├── main_spark.py # Spark-based (cloud scale)
└── main.py # Pure Python (legacy)
cpp/ # C++ DAG engine
├── include/ # Headers (dag, ops, columnar)
├── src/ # Implementation
└── BUILD.md # Build instructions
config/ # Configuration
databricks/ # Cloud deployment notebook
Functional Impact Index = pathways_count + interactions_count
(Spark version uses weighted formula with evidence quality tiers)
These three implementations are deliberately separate and should NOT be merged.
Each serves a distinct purpose with different trade-offs. Attempting to unify them would:
- Create unnecessary abstraction layers and complexity
- Force compromises that hurt each use case
- Make the codebase harder to maintain and understand
Instead, we maintain clean separation with shared data contracts (GO JSON → DataFrame schema).
Use C++ DAG if:
- Processing 10K-1M genes locally
- Need production performance without cloud setup
- Want minimal dependencies
Use Spark/Databricks if:
- Processing 100K-10M+ genes
- Need cloud deployment
- Require enterprise data engineering features
Use Pure Python if:
- Learning the pipeline
- Prototyping new features
- Processing <10K genes