Gene Functional Abstraction Pipeline

Production-ready pipeline for analyzing human genome functional characteristics using Gene Ontology (GO) data. Multiple implementations optimized for different use cases.

Data Source: https://functionome.geneontology.org/

Overview

Ingests GO ontology and human genome annotations, computes functional impact metrics, and provides interactive visualization for exploring gene functions.

Implementation Options

1. C++ DAG Engine (Recommended for Production)

High-performance columnar processing with operation fusion.

Advantages:

5-25x faster than pure Python
Lower memory footprint (~30% reduction)
Operation fusion eliminates intermediate allocations
Portable (Linux, macOS, Windows)

Build:

cd cpp && mkdir build && cd build
cmake .. -DUSE_ARROW=OFF && make && make install
cd ../../src && python pipeline_dag.py

Use Case: Production workloads, local processing (10K-1M genes)

2. Apache Spark + Databricks

Distributed big data processing at cloud scale.

Advantages:

Handles 100K-10M+ genes
Delta Lake ACID transactions
Cloud-native deployment
Production data engineering patterns

Quick Start:

pip install -r requirements.txt
python src/main_spark.py

Use Case: Cloud deployment, massive datasets, team collaboration

3. Pure Python (Legacy)

Simple single-file implementation for development.

Advantages:

Zero setup, instant dev
Easy to understand and modify
No compilation required

Quick Start:

pip install pandas requests dash plotly
python src/main.py

Use Case: Prototyping, learning, small datasets (<10K genes)

Performance Comparison

Implementation	Speed	Memory	Setup	Best For
C++ DAG	⚡⚡⚡⚡⚡	💾💾	Medium	Local production
Spark/Databricks	⚡⚡⚡⚡	💾💾💾	Complex	Cloud scale
Pure Python	⚡	💾💾💾💾	Easy	Development

Key Features

Validated Pipeline: Production-ready error handling and data validation
Functional Impact Index: Weighted scoring combining pathway, function, and component annotations
Interactive Dashboard: 5-tab Dash interface for exploration and export
Multiple Architectures: Choose the right tool for your scale

Quick Start (C++ DAG - Recommended)

# 1. Build C++ engine
cd cpp && mkdir build && cd build
cmake .. -DUSE_ARROW=OFF && make -j4 && make install

# 2. Run pipeline
cd ../../src
python pipeline_dag.py --export results.csv

# Output: Optimized DAG execution with fusion
# Speedup: ~10x over pure Python

Documentation

cpp/README.md - C++ DAG engine architecture and design
cpp/BUILD.md - Build instructions for all platforms
ARCHITECTURE.md - Spark/Databricks architecture
README_SPARK.md - Spark user guide and deployment

Project Structure

src/                    # Python pipeline implementations
├── pipeline_dag.py    # C++ DAG-based (recommended)
├── main_spark.py      # Spark-based (cloud scale)
└── main.py            # Pure Python (legacy)

cpp/                   # C++ DAG engine
├── include/           # Headers (dag, ops, columnar)
├── src/               # Implementation
└── BUILD.md           # Build instructions

config/                # Configuration
databricks/            # Cloud deployment notebook

Computed Metrics

Functional Impact Index = pathways_count + interactions_count

(Spark version uses weighted formula with evidence quality tiers)

Design Philosophy: Intentional Separation

These three implementations are deliberately separate and should NOT be merged.

Each serves a distinct purpose with different trade-offs. Attempting to unify them would:

Create unnecessary abstraction layers and complexity
Force compromises that hurt each use case
Make the codebase harder to maintain and understand

Instead, we maintain clean separation with shared data contracts (GO JSON → DataFrame schema).

Choosing an Implementation

Use C++ DAG if:

Processing 10K-1M genes locally
Need production performance without cloud setup
Want minimal dependencies

Use Spark/Databricks if:

Processing 100K-10M+ genes
Need cloud deployment
Require enterprise data engineering features

Use Pure Python if:

Learning the pipeline
Prototyping new features
Processing <10K genes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene Functional Abstraction Pipeline

Overview

Implementation Options

1. C++ DAG Engine (Recommended for Production)

2. Apache Spark + Databricks

3. Pure Python (Legacy)

Performance Comparison

Key Features

Quick Start (C++ DAG - Recommended)

Documentation

Project Structure

Computed Metrics

Design Philosophy: Intentional Separation

Choosing an Implementation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
cpp		cpp
databricks		databricks
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
README_SPARK.md		README_SPARK.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Gene Functional Abstraction Pipeline

Overview

Implementation Options

1. C++ DAG Engine (Recommended for Production)

2. Apache Spark + Databricks

3. Pure Python (Legacy)

Performance Comparison

Key Features

Quick Start (C++ DAG - Recommended)

Documentation

Project Structure

Computed Metrics

Design Philosophy: Intentional Separation

Choosing an Implementation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages