Imputation

Imputation is an archival snapshot of the missing-value imputation evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.

The repository preserves the Jupyter notebooks used to benchmark imputation strategies for peptide-level label-free quantification (LFQ) data, assess their downstream statistical consequences, and validate the findings via simulation. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics imputation tool or reusable software package.

What This Repository Is

A thesis-analysis snapshot of the missing-value imputation evaluation from Chapter 2.
A record of the statistical and visual analyses used to compare imputation methods on a real spike-in proteomics dataset.
A simulation framework for assessing method behaviour under controlled MNAR and MAR missingness patterns.
A small library of helper modules (pyFuncs/) that implement the imputation algorithms and supported the notebook workflows.

What This Repository Is Not

A packaged software library or command-line tool.
A general-purpose pipeline for arbitrary missing-value handling in proteomics.
A fully self-contained rerun from a fresh clone — raw Spectronaut/DIA-NN export files and FASTA references are not distributed with this repository.
A validated or generalized framework for non-LFQ or non-Spectronaut workflows.

Scientific Overview

Missing values are pervasive in label-free quantification proteomics data and arise from two fundamentally different mechanisms: values missing completely at random (MCAR/MAR) due to stochastic sampling limitations, and values missing not at random (MNAR) because the peptide was genuinely absent or below the detection threshold. The choice of imputation strategy should reflect which mechanism is dominant, yet in practice this is rarely verified. This repository evaluates a range of imputation approaches on both real and simulated data to quantify the consequences of that choice.

1. Real-data benchmark (Fröhlich et al. 2022 spike-in dataset)

A published spike-in dataset in which defined ratios of E. coli peptides are spiked into a human lymph node background is used as a ground-truth benchmark. Peptide-level MaxLFQ intensities are loaded from DIA-NN output and mapped to a FASTA reference. Five sample groups with different E. coli spike-in ratios provide two biologically motivated comparisons:

Comparison	Expected signal
D vs B	Defined log₂ fold change for E. coli peptides; human peptides unchanged
D vs A	E. coli peptides present in D, fully absent in A

Eight imputation methods are applied and their downstream statistical performance evaluated with limma differential-expression testing (optionally with a missingness-informed weighting scheme). Performance is measured with ROC curves, precision-recall curves, AUC, F1 score, Matthews correlation coefficient (MCC), and accuracy.

2. Imputation methods benchmarked

Category	Method
Single-value	Mean, Median
Distribution-based	Minimum-value-centred Gaussian, Downshifted Gaussian (global), Group-wise downshifted Gaussian
Neighbour-based	k-Nearest Neighbours (kNN)
Regression / tree	Linear Regression (`IterativeImputer`), Decision Tree (`IterativeImputer`)
Ensemble / matrix	Random Forest (MissForest), PCA (truncated SVD)
Baseline	Random sampling from observed values

3. Missingness-aware statistical weighting

A weight matrix is computed alongside imputed data to down-weight peptides where an entire condition is missing (i.e., MNAR-likely entries). This weighting is passed to limma's lmFit and its effect on statistical testing is compared against unweighted analysis.

4. Simulation validation

A synthetic spike-in experiment is generated with 5 000 proteins across replicated conditions and defined log₂ fold changes. Controlled MNAR and MAR missing-value fractions (10–50%) are introduced and the full panel of imputation methods is applied. Performance is evaluated with NRMSE and the QuEStVar metric, which combines differential expression testing and equivalence testing into a single interpretable benchmark.

Repository Structure

Imputation/
├── 2022_Frohlich/              # Real-data benchmark notebooks
│   ├── 01-DataPreparation.ipynb
│   ├── 02-ImputationEffect.ipynb
│   ├── 03-Comparisons.ipynb
│   ├── 04-CombinedBenchmark.ipynb
│   └── runLimma.R
├── Simulation/                 # Simulation-based validation notebooks
│   ├── EstablishLogic.ipynb
│   ├── SingleMethod.ipynb
│   └── visualize_results.ipynb
├── pyFuncs/                    # Helper modules used by the notebooks
│   ├── __init__.py
│   ├── utils.py
│   ├── plots.py
│   └── questvar.py
├── LICENSE
├── CITATION.cff
└── README.md

Analysis Notebooks

`2022_Frohlich/` — Real-data benchmark

All notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.

Notebook	Description
`01-DataPreparation.ipynb`	Loads DIA-NN MaxLFQ output, maps peptides to a FASTA reference, constructs metadata, renames samples, median-centres intensities, and defines the two benchmark comparisons (D vs B and D vs A)
`02-ImputationEffect.ipynb`	Applies downshifted Gaussian imputation and quantifies the resulting shifts in peptide-level means, log₂ fold changes, and CV distributions; highlights representative peptide-level examples
`03-Comparisons.ipynb`	Runs limma (with and without missingness-informed weighting) across all eight imputation methods and evaluates each with ROC curves, PR curves, AUC, and statistical-status counts for both comparisons
`04-CombinedBenchmark.ipynb`	Aggregates benchmark results across both comparisons; produces summary figures comparing F1, MCC, accuracy, ROC-AUC, and PR-AUC for all methods
`runLimma.R`	R script that executes limma differential-expression testing on imputed data tables exported by the Python notebooks and writes results back as Feather files

`Simulation/` — Simulation-based validation

Notebook	Description
`EstablishLogic.ipynb`	Builds the full simulation pipeline: generates correlated protein intensity matrices, creates spike-in metadata, adds biologically-informed absence, introduces MNAR/MAR missingness, applies all imputation methods, and runs speed benchmarks
`SingleMethod.ipynb`	Evaluates each imputation method independently across a range of MV rates (10–50%) and MNAR/MAR mixing proportions; visualises imputed vs. original distributions across subsets
`visualize_results.ipynb`	Loads pre-computed benchmark outputs and produces summary visualisations comparing NRMSE and QuEStVar metrics across methods and MV scenarios

Helper Modules (`pyFuncs/`)

Module	Contents
`utils.py`	Imputation algorithm implementations (kNN, random forest, linear regression, decision tree, PCA/SVD, downshifted Gaussian, group-wise Gaussian, random sampling, mean/median), FASTA-to-DataFrame parser, CV calculation, missingness weight-matrix construction, limma result label generation
`plots.py`	Figure generation utilities: colour palette display, distribution plots, CV comparison figures, ROC/PR overlay helpers
`questvar.py`	Implementation of the QuEStVar benchmark metric combining differential and equivalence testing to evaluate imputation-induced changes in statistical conclusions

Minimal Setup

Requires Python 3.8+ and R.

pip install -r requirements.txt

R dependencies used by runLimma.R: limma (Bioconductor), feather.

Important: The notebooks depend on local data/ inputs from DIA-NN and Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.

Reproducibility Notes

The notebooks in 2022_Frohlich/ and Simulation/ are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing.
Raw and processed DIA-NN/Spectronaut data are not included in this repository.
This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
Helper utilities in pyFuncs/ (imputation algorithms, QuEStVar, figure helpers) are self-contained enough to be reused independently of the thesis data.

Limitations

The benchmark is scoped to peptide-level LFQ data from Spectronaut and DIA-NN exports; other DIA software would require adapted column mapping.
Performance conclusions are specific to the Fröhlich et al. 2022 spike-in design and the simulated experiment parameters used here. Transferability to other sample types or acquisition settings is not established.
The full notebook execution requires the original DIA-NN output files and FASTA references that are not distributed.
This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.

License

All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

See LICENSE for the full license text and terms.

Citation

If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:

@phdthesis{ergin2024thesis,
  author      = {Ergin, Enes Kemal},
  title       = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
  school      = {University of British Columbia},
  year        = {2024},
  url         = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}

If you specifically need to reference this repository as a code or notebook artifact:

@misc{ergin2024imputation,
  author       = {Ergin, Enes Kemal},
  title        = {{Imputation: Missing-Value Imputation Evaluation for Peptide-Level LFQ Proteomics Data}},
  year         = {2024},
  howpublished = {\url{https://github.com/eneskemalergin/Imputation}},
  note         = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}

References

Thesis: Computational interrogation of proteoform dynamics in pediatric cancer
ORCID: Enes Kemal Ergin
Fröhlich et al. 2022 spike-in dataset: ProteomicsDB / PRIDE repository benchmark dataset

Values fade to white —
Gaps filled with a borrowed light,
Truth holds, barely, still.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imputation

What This Repository Is

What This Repository Is Not

Scientific Overview

1. Real-data benchmark (Fröhlich et al. 2022 spike-in dataset)

2. Imputation methods benchmarked

3. Missingness-aware statistical weighting

4. Simulation validation

Repository Structure

Analysis Notebooks

`2022_Frohlich/` — Real-data benchmark

`Simulation/` — Simulation-based validation

Helper Modules (`pyFuncs/`)

Minimal Setup

Reproducibility Notes

Limitations

License

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
2022_Frohlich		2022_Frohlich
Simulation		Simulation
pyFuncs		pyFuncs
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Imputation

What This Repository Is

What This Repository Is Not

Scientific Overview

1. Real-data benchmark (Fröhlich et al. 2022 spike-in dataset)

2. Imputation methods benchmarked

3. Missingness-aware statistical weighting

4. Simulation validation

Repository Structure

Analysis Notebooks

2022_Frohlich/ — Real-data benchmark

Simulation/ — Simulation-based validation

Helper Modules (pyFuncs/)

Minimal Setup

Reproducibility Notes

Limitations

License

Citation

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`2022_Frohlich/` — Real-data benchmark

`Simulation/` — Simulation-based validation

Helper Modules (`pyFuncs/`)

Packages