Skip to content

eneskemalergin/Imputation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

135 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Imputation

License: CC BY-NC 4.0 Python

Imputation is an archival snapshot of the missing-value imputation evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.

The repository preserves the Jupyter notebooks used to benchmark imputation strategies for peptide-level label-free quantification (LFQ) data, assess their downstream statistical consequences, and validate the findings via simulation. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics imputation tool or reusable software package.


What This Repository Is

  • A thesis-analysis snapshot of the missing-value imputation evaluation from Chapter 2.
  • A record of the statistical and visual analyses used to compare imputation methods on a real spike-in proteomics dataset.
  • A simulation framework for assessing method behaviour under controlled MNAR and MAR missingness patterns.
  • A small library of helper modules (pyFuncs/) that implement the imputation algorithms and supported the notebook workflows.

What This Repository Is Not

  • A packaged software library or command-line tool.
  • A general-purpose pipeline for arbitrary missing-value handling in proteomics.
  • A fully self-contained rerun from a fresh clone — raw Spectronaut/DIA-NN export files and FASTA references are not distributed with this repository.
  • A validated or generalized framework for non-LFQ or non-Spectronaut workflows.

Scientific Overview

Missing values are pervasive in label-free quantification proteomics data and arise from two fundamentally different mechanisms: values missing completely at random (MCAR/MAR) due to stochastic sampling limitations, and values missing not at random (MNAR) because the peptide was genuinely absent or below the detection threshold. The choice of imputation strategy should reflect which mechanism is dominant, yet in practice this is rarely verified. This repository evaluates a range of imputation approaches on both real and simulated data to quantify the consequences of that choice.

1. Real-data benchmark (Fröhlich et al. 2022 spike-in dataset)

A published spike-in dataset in which defined ratios of E. coli peptides are spiked into a human lymph node background is used as a ground-truth benchmark. Peptide-level MaxLFQ intensities are loaded from DIA-NN output and mapped to a FASTA reference. Five sample groups with different E. coli spike-in ratios provide two biologically motivated comparisons:

Comparison Expected signal
D vs B Defined log₂ fold change for E. coli peptides; human peptides unchanged
D vs A E. coli peptides present in D, fully absent in A

Eight imputation methods are applied and their downstream statistical performance evaluated with limma differential-expression testing (optionally with a missingness-informed weighting scheme). Performance is measured with ROC curves, precision-recall curves, AUC, F1 score, Matthews correlation coefficient (MCC), and accuracy.

2. Imputation methods benchmarked

Category Method
Single-value Mean, Median
Distribution-based Minimum-value-centred Gaussian, Downshifted Gaussian (global), Group-wise downshifted Gaussian
Neighbour-based k-Nearest Neighbours (kNN)
Regression / tree Linear Regression (IterativeImputer), Decision Tree (IterativeImputer)
Ensemble / matrix Random Forest (MissForest), PCA (truncated SVD)
Baseline Random sampling from observed values

3. Missingness-aware statistical weighting

A weight matrix is computed alongside imputed data to down-weight peptides where an entire condition is missing (i.e., MNAR-likely entries). This weighting is passed to limma's lmFit and its effect on statistical testing is compared against unweighted analysis.

4. Simulation validation

A synthetic spike-in experiment is generated with 5 000 proteins across replicated conditions and defined log₂ fold changes. Controlled MNAR and MAR missing-value fractions (10–50%) are introduced and the full panel of imputation methods is applied. Performance is evaluated with NRMSE and the QuEStVar metric, which combines differential expression testing and equivalence testing into a single interpretable benchmark.


Repository Structure

Imputation/
├── 2022_Frohlich/              # Real-data benchmark notebooks
│   ├── 01-DataPreparation.ipynb
│   ├── 02-ImputationEffect.ipynb
│   ├── 03-Comparisons.ipynb
│   ├── 04-CombinedBenchmark.ipynb
│   └── runLimma.R
├── Simulation/                 # Simulation-based validation notebooks
│   ├── EstablishLogic.ipynb
│   ├── SingleMethod.ipynb
│   └── visualize_results.ipynb
├── pyFuncs/                    # Helper modules used by the notebooks
│   ├── __init__.py
│   ├── utils.py
│   ├── plots.py
│   └── questvar.py
├── LICENSE
├── CITATION.cff
└── README.md

Analysis Notebooks

2022_Frohlich/ — Real-data benchmark

All notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.

Notebook Description
01-DataPreparation.ipynb Loads DIA-NN MaxLFQ output, maps peptides to a FASTA reference, constructs metadata, renames samples, median-centres intensities, and defines the two benchmark comparisons (D vs B and D vs A)
02-ImputationEffect.ipynb Applies downshifted Gaussian imputation and quantifies the resulting shifts in peptide-level means, log₂ fold changes, and CV distributions; highlights representative peptide-level examples
03-Comparisons.ipynb Runs limma (with and without missingness-informed weighting) across all eight imputation methods and evaluates each with ROC curves, PR curves, AUC, and statistical-status counts for both comparisons
04-CombinedBenchmark.ipynb Aggregates benchmark results across both comparisons; produces summary figures comparing F1, MCC, accuracy, ROC-AUC, and PR-AUC for all methods
runLimma.R R script that executes limma differential-expression testing on imputed data tables exported by the Python notebooks and writes results back as Feather files

Simulation/ — Simulation-based validation

Notebook Description
EstablishLogic.ipynb Builds the full simulation pipeline: generates correlated protein intensity matrices, creates spike-in metadata, adds biologically-informed absence, introduces MNAR/MAR missingness, applies all imputation methods, and runs speed benchmarks
SingleMethod.ipynb Evaluates each imputation method independently across a range of MV rates (10–50%) and MNAR/MAR mixing proportions; visualises imputed vs. original distributions across subsets
visualize_results.ipynb Loads pre-computed benchmark outputs and produces summary visualisations comparing NRMSE and QuEStVar metrics across methods and MV scenarios

Helper Modules (pyFuncs/)

Module Contents
utils.py Imputation algorithm implementations (kNN, random forest, linear regression, decision tree, PCA/SVD, downshifted Gaussian, group-wise Gaussian, random sampling, mean/median), FASTA-to-DataFrame parser, CV calculation, missingness weight-matrix construction, limma result label generation
plots.py Figure generation utilities: colour palette display, distribution plots, CV comparison figures, ROC/PR overlay helpers
questvar.py Implementation of the QuEStVar benchmark metric combining differential and equivalence testing to evaluate imputation-induced changes in statistical conclusions

Minimal Setup

Requires Python 3.8+ and R.

pip install -r requirements.txt

R dependencies used by runLimma.R: limma (Bioconductor), feather.

Important: The notebooks depend on local data/ inputs from DIA-NN and Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.


Reproducibility Notes

  • The notebooks in 2022_Frohlich/ and Simulation/ are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing.
  • Raw and processed DIA-NN/Spectronaut data are not included in this repository.
  • This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
  • Helper utilities in pyFuncs/ (imputation algorithms, QuEStVar, figure helpers) are self-contained enough to be reused independently of the thesis data.

Limitations

  • The benchmark is scoped to peptide-level LFQ data from Spectronaut and DIA-NN exports; other DIA software would require adapted column mapping.
  • Performance conclusions are specific to the Fröhlich et al. 2022 spike-in design and the simulated experiment parameters used here. Transferability to other sample types or acquisition settings is not established.
  • The full notebook execution requires the original DIA-NN output files and FASTA references that are not distributed.
  • This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.

License

All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

See LICENSE for the full license text and terms.


Citation

If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:

@phdthesis{ergin2024thesis,
  author      = {Ergin, Enes Kemal},
  title       = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
  school      = {University of British Columbia},
  year        = {2024},
  url         = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}

If you specifically need to reference this repository as a code or notebook artifact:

@misc{ergin2024imputation,
  author       = {Ergin, Enes Kemal},
  title        = {{Imputation: Missing-Value Imputation Evaluation for Peptide-Level LFQ Proteomics Data}},
  year         = {2024},
  howpublished = {\url{https://github.com/eneskemalergin/Imputation}},
  note         = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}

References


Values fade to white —
Gaps filled with a borrowed light,
Truth holds, barely, still.

About

Missing-value imputation benchmark for peptide-level LFQ proteomics. Evaluated on a spike-in dataset and via simulation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors