[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR by grosenberger-bruker · Pull Request #182 · lazear/sage

grosenberger-bruker · 2025-05-09T10:29:11Z

This PR introduces protein grouping with picked group FDR to Sage and fixes the picked protein FDR issue:

Fix of Picked Protein FDR
While Sage generally computes picked protein FDR correctly, we recently encountered an issue with shared peptides. For example, let's assume PEPTIDEAK belongs to protA, PEPTIDECK belongs to protC, and shared PEPTIDEDK belongs to both protA and protC. If PEPTIDEDK is confidently identified, it counts as a new "protein" protA/protC. With a canonical UniProtKB/Swiss-Prot DB, shared peptides typically constitute 5-10% and have similar properties to proteotypic peptides, so the effect on computing picked protein FDR is minor. However, the number of proteins in the Sage runtime log is artificially inflated (3 proteins). This also appears in the report, but I assume most users will filter them in downstream analysis and recount.

Solution: As proposed in the literature, we now use only proteotypic, unique, non-shared peptides to compute picked protein FDR. Shared peptides will still be reported but with protein FDR set to 1.0. This has a minor effect on canonical databases, providing more accurate numbers. For isoforms, this approach is not applicable, so we introduce protein grouping.

Protein Grouping with Picked Group FDR
This new module implements a protein grouping algorithm based on the IDPicker algorithm with extensions from the "Picked Group FDR approach." The Python implementation of CsoDIAq has been used as a template and for testing the IDPicker approach. Most functions are based on IDPicker, with generate_proteingroups() representing the "rescued subset grouping (rsG)" approach. Discarding shared peptides and picked FDR are implemented as part of the core Sage FDR routines.

In our experience, picked group FDR with IDPicker is a simple, robust, and scalable approach. Compared to standard IDPicker, it performs better under tricky boundary conditions, albeit with some computational expense. For 10 dda-PASEF mixed proteome samples, the runtime overhead in my benchmark was 45 seconds. We have optimized some IDPicker components, but there may still be potential for further improvement.

We believe this PR will be useful for Sage, extending the current established and accepted principles to protein groups.
Best regards,
@grosenberger-bruker, @vijay-gnanasambandan-bruker, @sander-willems-bruker

References

Zhang, B., Chambers, M. C., & Tabb, D. L. (2007). Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of proteome research, 6(9), 3549-3557. https://doi.org/10.1021/pr070230d
The, M., Samaras, P., Kuster, B., & Wilhelm, M. (2022). Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups. Molecular & Cellular Proteomics, 21(12), 100437. https://doi.org/10.1016/j.mcpro.2022.100437
https://github.com/dg310012/CsoDIAq/blob/68abaa713eb719b488967cb34a876a71657827bd/idpicker.py
Cranney, C. W., & Meyer, J. G. (2021). CsoDIAq software for direct infusion shotgun proteome analysis. Analytical Chemistry, 93(36), 12312-12319. https://doi.org/10.1021/acs.analchem.1c02021

lazear · 2025-05-12T20:49:55Z

Hi guys,

Thanks for another valuable PR - especially one that addresses one of the big shortcomings in Sage. From a quick readover, it looks good; I will probably just rename a couple fields and otherwise use it as-is for now. We can tackle some performance issues here, as well as some other places in Sage, in a future update

lazear · 2025-05-12T20:51:34Z

crates/sage/src/idpicker.rs

+    data: Vec<(Arc<String>, Arc<String>)>,
+) -> FxHashSet<(Vec<Arc<String>>, Arc<String>)> {
+    // Phase 1: Group proteins by peptide
+    let group_by_peptides: FxHashMap<Arc<String>, Vec<Arc<String>>> = {


Is there a reason we can't use PeptideIx instead of Arc<String> to represent peptides?

lazear · 2025-05-12T20:58:47Z

crates/sage/src/idpicker.rs

+            println!("Protein  ->{:?} ,  Protein Group-> {:?} ", i, j);
+        }
+
+        assert_eq!(5, protein_map.len());


Can we add an assertion for the exact mapping? Let's make sure that it is deterministic and reproducible across invocations.

grosenberger-bruker · 2025-05-14T07:06:50Z

@sander-willems-bruker further tweaked performance; protein grouping is now down to 28s from 45s.

grosenberger-bruker · 2025-05-20T06:51:04Z

We are doing some further tweaks to lower memory consumption and runtime over the next days.

grosenberger-bruker · 2025-05-23T13:12:59Z

Brief update: We are completely changing the IDpicker implementation, leading to substantial improvements. PR will be updated within the next few days and is now back at the draft stage until then.

lazear · 2025-05-23T15:06:22Z

OK, just let me know when it's ready to review!

grosenberger-bruker · 2025-05-27T07:56:28Z

@sander-willems-bruker now re-implemented protein grouping and inference, bringing overhead down to 0.3s from 28s. A major change was a modification to the original IDPicker approach, replacing the method to find the minimum protein set. We believe this is now ready for review.

grosenberger-bruker · 2025-05-27T10:28:23Z

crates/sage/src/protein_grouping.rs

+//! ## Main Features
+//! - Groups proteins when peptide evidence for proteins is identical.
+//! - Infers (almost) minimal protein-group covers using bipartite graph algorithms.
+//! - Supports different protein inference strategies (e.g., "All", "Slim").


@lazear This is a parameter that is not yet exposed to the config, but it could be an option for applications where users are not interested in finding the parsimonious solution, but want to retain all alternatives while still conducting protein grouping.

Implement IDPicker-based protein grouping with picked group FDR and fix picked protein FDR to use only proteotypic (unique, non-shared) peptides. - Add protein_grouping module with bipartite graph cover algorithm - Add picked_proteingroup FDR computation in fdr module - Fix picked_protein to discard shared peptides (protein FDR = 1.0) - Add proteingroups, num_proteingroups, proteingroup_q to output - Add protein_grouping and protein_grouping_peptide_fdr config options - Support parquet and TSV output for new fields Co-Authored-By: vijay-gnanasambandan-bruker <69693791+vijay-gnanasambandan-bruker@users.noreply.github.com> Co-Authored-By: grosenberger-bruker <121885934+grosenberger-bruker@users.noreply.github.com> Co-Authored-By: sander-willems-bruker <sander.willems@bruker.com> Co-Authored-By: gadi-armony-brkr <gadi.armony@bruker.com>

lazear reviewed May 12, 2025

View reviewed changes

grosenberger-bruker marked this pull request as draft May 23, 2025 13:11

grosenberger-bruker marked this pull request as ready for review May 27, 2025 07:56

grosenberger-bruker commented May 27, 2025

View reviewed changes

lazear mentioned this pull request Jun 30, 2025

Sage seems to have MBR turned on by default - how do I turn it off #189

Closed

sander-willems-bruker mentioned this pull request Aug 21, 2025

Report all proteins in the results.sage.tsv #190

Closed

lazear force-pushed the feature/id_picker branch from db3cdee to 7edc27b Compare March 12, 2026 20:07

lazear force-pushed the feature/id_picker branch from 7edc27b to 4028d78 Compare March 12, 2026 20:10

lazear added 4 commits March 12, 2026 13:17

nit: naming

af7ef5f

nitpicks

fc51f4a

fix: clippy

33d42e4

clippy lints, nits

125263e

lazear merged commit 125263e into lazear:master Mar 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR#182

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR#182
lazear merged 5 commits intolazear:masterfrom
grosenberger-bruker:feature/id_picker

grosenberger-bruker commented May 9, 2025

Uh oh!

lazear commented May 12, 2025

Uh oh!

lazear May 12, 2025

Uh oh!

lazear May 12, 2025

Uh oh!

grosenberger-bruker commented May 14, 2025

Uh oh!

grosenberger-bruker commented May 20, 2025

Uh oh!

grosenberger-bruker commented May 23, 2025

Uh oh!

lazear commented May 23, 2025

Uh oh!

grosenberger-bruker commented May 27, 2025

Uh oh!

grosenberger-bruker May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grosenberger-bruker commented May 9, 2025

Uh oh!

lazear commented May 12, 2025

Uh oh!

lazear May 12, 2025

Choose a reason for hiding this comment

Uh oh!

lazear May 12, 2025

Choose a reason for hiding this comment

Uh oh!

grosenberger-bruker commented May 14, 2025

Uh oh!

grosenberger-bruker commented May 20, 2025

Uh oh!

grosenberger-bruker commented May 23, 2025

Uh oh!

lazear commented May 23, 2025

Uh oh!

grosenberger-bruker commented May 27, 2025

Uh oh!

grosenberger-bruker May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants