Imputation is an archival snapshot of the missing-value imputation evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.
The repository preserves the Jupyter notebooks used to benchmark imputation strategies for peptide-level label-free quantification (LFQ) data, assess their downstream statistical consequences, and validate the findings via simulation. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics imputation tool or reusable software package.
- A thesis-analysis snapshot of the missing-value imputation evaluation from Chapter 2.
- A record of the statistical and visual analyses used to compare imputation methods on a real spike-in proteomics dataset.
- A simulation framework for assessing method behaviour under controlled MNAR and MAR missingness patterns.
- A small library of helper modules (
pyFuncs/) that implement the imputation algorithms and supported the notebook workflows.
- A packaged software library or command-line tool.
- A general-purpose pipeline for arbitrary missing-value handling in proteomics.
- A fully self-contained rerun from a fresh clone — raw Spectronaut/DIA-NN export files and FASTA references are not distributed with this repository.
- A validated or generalized framework for non-LFQ or non-Spectronaut workflows.
Missing values are pervasive in label-free quantification proteomics data and arise from two fundamentally different mechanisms: values missing completely at random (MCAR/MAR) due to stochastic sampling limitations, and values missing not at random (MNAR) because the peptide was genuinely absent or below the detection threshold. The choice of imputation strategy should reflect which mechanism is dominant, yet in practice this is rarely verified. This repository evaluates a range of imputation approaches on both real and simulated data to quantify the consequences of that choice.
A published spike-in dataset in which defined ratios of E. coli peptides are spiked into a human lymph node background is used as a ground-truth benchmark. Peptide-level MaxLFQ intensities are loaded from DIA-NN output and mapped to a FASTA reference. Five sample groups with different E. coli spike-in ratios provide two biologically motivated comparisons:
| Comparison | Expected signal |
|---|---|
| D vs B | Defined log₂ fold change for E. coli peptides; human peptides unchanged |
| D vs A | E. coli peptides present in D, fully absent in A |
Eight imputation methods are applied and their downstream statistical performance evaluated with limma differential-expression testing (optionally with a missingness-informed weighting scheme). Performance is measured with ROC curves, precision-recall curves, AUC, F1 score, Matthews correlation coefficient (MCC), and accuracy.
| Category | Method |
|---|---|
| Single-value | Mean, Median |
| Distribution-based | Minimum-value-centred Gaussian, Downshifted Gaussian (global), Group-wise downshifted Gaussian |
| Neighbour-based | k-Nearest Neighbours (kNN) |
| Regression / tree | Linear Regression (IterativeImputer), Decision Tree (IterativeImputer) |
| Ensemble / matrix | Random Forest (MissForest), PCA (truncated SVD) |
| Baseline | Random sampling from observed values |
A weight matrix is computed alongside imputed data to down-weight peptides where an entire condition is missing (i.e., MNAR-likely entries). This weighting is passed to limma's lmFit and its effect on statistical testing is compared against unweighted analysis.
A synthetic spike-in experiment is generated with 5 000 proteins across replicated conditions and defined log₂ fold changes. Controlled MNAR and MAR missing-value fractions (10–50%) are introduced and the full panel of imputation methods is applied. Performance is evaluated with NRMSE and the QuEStVar metric, which combines differential expression testing and equivalence testing into a single interpretable benchmark.
Imputation/
├── 2022_Frohlich/ # Real-data benchmark notebooks
│ ├── 01-DataPreparation.ipynb
│ ├── 02-ImputationEffect.ipynb
│ ├── 03-Comparisons.ipynb
│ ├── 04-CombinedBenchmark.ipynb
│ └── runLimma.R
├── Simulation/ # Simulation-based validation notebooks
│ ├── EstablishLogic.ipynb
│ ├── SingleMethod.ipynb
│ └── visualize_results.ipynb
├── pyFuncs/ # Helper modules used by the notebooks
│ ├── __init__.py
│ ├── utils.py
│ ├── plots.py
│ └── questvar.py
├── LICENSE
├── CITATION.cff
└── README.mdAll notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.
| Notebook | Description |
|---|---|
01-DataPreparation.ipynb |
Loads DIA-NN MaxLFQ output, maps peptides to a FASTA reference, constructs metadata, renames samples, median-centres intensities, and defines the two benchmark comparisons (D vs B and D vs A) |
02-ImputationEffect.ipynb |
Applies downshifted Gaussian imputation and quantifies the resulting shifts in peptide-level means, log₂ fold changes, and CV distributions; highlights representative peptide-level examples |
03-Comparisons.ipynb |
Runs limma (with and without missingness-informed weighting) across all eight imputation methods and evaluates each with ROC curves, PR curves, AUC, and statistical-status counts for both comparisons |
04-CombinedBenchmark.ipynb |
Aggregates benchmark results across both comparisons; produces summary figures comparing F1, MCC, accuracy, ROC-AUC, and PR-AUC for all methods |
runLimma.R |
R script that executes limma differential-expression testing on imputed data tables exported by the Python notebooks and writes results back as Feather files |
| Notebook | Description |
|---|---|
EstablishLogic.ipynb |
Builds the full simulation pipeline: generates correlated protein intensity matrices, creates spike-in metadata, adds biologically-informed absence, introduces MNAR/MAR missingness, applies all imputation methods, and runs speed benchmarks |
SingleMethod.ipynb |
Evaluates each imputation method independently across a range of MV rates (10–50%) and MNAR/MAR mixing proportions; visualises imputed vs. original distributions across subsets |
visualize_results.ipynb |
Loads pre-computed benchmark outputs and produces summary visualisations comparing NRMSE and QuEStVar metrics across methods and MV scenarios |
| Module | Contents |
|---|---|
utils.py |
Imputation algorithm implementations (kNN, random forest, linear regression, decision tree, PCA/SVD, downshifted Gaussian, group-wise Gaussian, random sampling, mean/median), FASTA-to-DataFrame parser, CV calculation, missingness weight-matrix construction, limma result label generation |
plots.py |
Figure generation utilities: colour palette display, distribution plots, CV comparison figures, ROC/PR overlay helpers |
questvar.py |
Implementation of the QuEStVar benchmark metric combining differential and equivalence testing to evaluate imputation-induced changes in statistical conclusions |
Requires Python 3.8+ and R.
pip install -r requirements.txtR dependencies used by runLimma.R: limma (Bioconductor), feather.
Important: The notebooks depend on local
data/inputs from DIA-NN and Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.
- The notebooks in
2022_Frohlich/andSimulation/are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing. - Raw and processed DIA-NN/Spectronaut data are not included in this repository.
- This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
- Helper utilities in
pyFuncs/(imputation algorithms, QuEStVar, figure helpers) are self-contained enough to be reused independently of the thesis data.
- The benchmark is scoped to peptide-level LFQ data from Spectronaut and DIA-NN exports; other DIA software would require adapted column mapping.
- Performance conclusions are specific to the Fröhlich et al. 2022 spike-in design and the simulated experiment parameters used here. Transferability to other sample types or acquisition settings is not established.
- The full notebook execution requires the original DIA-NN output files and FASTA references that are not distributed.
- This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.
All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
See LICENSE for the full license text and terms.
If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:
@phdthesis{ergin2024thesis,
author = {Ergin, Enes Kemal},
title = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
school = {University of British Columbia},
year = {2024},
url = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}If you specifically need to reference this repository as a code or notebook artifact:
@misc{ergin2024imputation,
author = {Ergin, Enes Kemal},
title = {{Imputation: Missing-Value Imputation Evaluation for Peptide-Level LFQ Proteomics Data}},
year = {2024},
howpublished = {\url{https://github.com/eneskemalergin/Imputation}},
note = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}- Thesis: Computational interrogation of proteoform dynamics in pediatric cancer
- ORCID: Enes Kemal Ergin
- Fröhlich et al. 2022 spike-in dataset: ProteomicsDB / PRIDE repository benchmark dataset
Gaps filled with a borrowed light,
Truth holds, barely, still.