Skip to content

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR#182

Merged
lazear merged 5 commits intolazear:masterfrom
grosenberger-bruker:feature/id_picker
Mar 12, 2026
Merged

[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR#182
lazear merged 5 commits intolazear:masterfrom
grosenberger-bruker:feature/id_picker

Conversation

@grosenberger-bruker
Copy link
Copy Markdown
Contributor

Hi @lazear,

This PR introduces protein grouping with picked group FDR to Sage and fixes the picked protein FDR issue:

Fix of Picked Protein FDR
While Sage generally computes picked protein FDR correctly, we recently encountered an issue with shared peptides. For example, let's assume PEPTIDEAK belongs to protA, PEPTIDECK belongs to protC, and shared PEPTIDEDK belongs to both protA and protC. If PEPTIDEDK is confidently identified, it counts as a new "protein" protA/protC. With a canonical UniProtKB/Swiss-Prot DB, shared peptides typically constitute 5-10% and have similar properties to proteotypic peptides, so the effect on computing picked protein FDR is minor. However, the number of proteins in the Sage runtime log is artificially inflated (3 proteins). This also appears in the report, but I assume most users will filter them in downstream analysis and recount.

Solution: As proposed in the literature, we now use only proteotypic, unique, non-shared peptides to compute picked protein FDR. Shared peptides will still be reported but with protein FDR set to 1.0. This has a minor effect on canonical databases, providing more accurate numbers. For isoforms, this approach is not applicable, so we introduce protein grouping.

Protein Grouping with Picked Group FDR
This new module implements a protein grouping algorithm based on the IDPicker algorithm with extensions from the "Picked Group FDR approach." The Python implementation of CsoDIAq has been used as a template and for testing the IDPicker approach. Most functions are based on IDPicker, with generate_proteingroups() representing the "rescued subset grouping (rsG)" approach. Discarding shared peptides and picked FDR are implemented as part of the core Sage FDR routines.

In our experience, picked group FDR with IDPicker is a simple, robust, and scalable approach. Compared to standard IDPicker, it performs better under tricky boundary conditions, albeit with some computational expense. For 10 dda-PASEF mixed proteome samples, the runtime overhead in my benchmark was 45 seconds. We have optimized some IDPicker components, but there may still be potential for further improvement.

We believe this PR will be useful for Sage, extending the current established and accepted principles to protein groups.
Best regards,
@grosenberger-bruker, @vijay-gnanasambandan-bruker, @sander-willems-bruker

References

  1. Zhang, B., Chambers, M. C., & Tabb, D. L. (2007). Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. Journal of proteome research, 6(9), 3549-3557. https://doi.org/10.1021/pr070230d
  2. The, M., Samaras, P., Kuster, B., & Wilhelm, M. (2022). Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups. Molecular & Cellular Proteomics, 21(12), 100437. https://doi.org/10.1016/j.mcpro.2022.100437
  3. https://github.com/dg310012/CsoDIAq/blob/68abaa713eb719b488967cb34a876a71657827bd/idpicker.py
  4. Cranney, C. W., & Meyer, J. G. (2021). CsoDIAq software for direct infusion shotgun proteome analysis. Analytical Chemistry, 93(36), 12312-12319. https://doi.org/10.1021/acs.analchem.1c02021

@lazear
Copy link
Copy Markdown
Owner

lazear commented May 12, 2025

Hi guys,

Thanks for another valuable PR - especially one that addresses one of the big shortcomings in Sage. From a quick readover, it looks good; I will probably just rename a couple fields and otherwise use it as-is for now. We can tackle some performance issues here, as well as some other places in Sage, in a future update

data: Vec<(Arc<String>, Arc<String>)>,
) -> FxHashSet<(Vec<Arc<String>>, Arc<String>)> {
// Phase 1: Group proteins by peptide
let group_by_peptides: FxHashMap<Arc<String>, Vec<Arc<String>>> = {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't use PeptideIx instead of Arc<String> to represent peptides?

println!("Protein ->{:?} , Protein Group-> {:?} ", i, j);
}

assert_eq!(5, protein_map.len());
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an assertion for the exact mapping? Let's make sure that it is deterministic and reproducible across invocations.

@grosenberger-bruker
Copy link
Copy Markdown
Contributor Author

@sander-willems-bruker further tweaked performance; protein grouping is now down to 28s from 45s.

@grosenberger-bruker
Copy link
Copy Markdown
Contributor Author

We are doing some further tweaks to lower memory consumption and runtime over the next days.

@grosenberger-bruker grosenberger-bruker marked this pull request as draft May 23, 2025 13:11
@grosenberger-bruker
Copy link
Copy Markdown
Contributor Author

Brief update: We are completely changing the IDpicker implementation, leading to substantial improvements. PR will be updated within the next few days and is now back at the draft stage until then.

@lazear
Copy link
Copy Markdown
Owner

lazear commented May 23, 2025

OK, just let me know when it's ready to review!

@grosenberger-bruker
Copy link
Copy Markdown
Contributor Author

@sander-willems-bruker now re-implemented protein grouping and inference, bringing overhead down to 0.3s from 28s. A major change was a modification to the original IDPicker approach, replacing the method to find the minimum protein set. We believe this is now ready for review.

@grosenberger-bruker grosenberger-bruker marked this pull request as ready for review May 27, 2025 07:56
//! ## Main Features
//! - Groups proteins when peptide evidence for proteins is identical.
//! - Infers (almost) minimal protein-group covers using bipartite graph algorithms.
//! - Supports different protein inference strategies (e.g., "All", "Slim").
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lazear This is a parameter that is not yet exposed to the config, but it could be an option for applications where users are not interested in finding the parsimonious solution, but want to retain all alternatives while still conducting protein grouping.

Implement IDPicker-based protein grouping with picked group FDR and fix
picked protein FDR to use only proteotypic (unique, non-shared) peptides.

- Add protein_grouping module with bipartite graph cover algorithm
- Add picked_proteingroup FDR computation in fdr module
- Fix picked_protein to discard shared peptides (protein FDR = 1.0)
- Add proteingroups, num_proteingroups, proteingroup_q to output
- Add protein_grouping and protein_grouping_peptide_fdr config options
- Support parquet and TSV output for new fields

Co-Authored-By: vijay-gnanasambandan-bruker <69693791+vijay-gnanasambandan-bruker@users.noreply.github.com>
Co-Authored-By: grosenberger-bruker <121885934+grosenberger-bruker@users.noreply.github.com>
Co-Authored-By: sander-willems-bruker <sander.willems@bruker.com>
Co-Authored-By: gadi-armony-brkr <gadi.armony@bruker.com>
@lazear lazear force-pushed the feature/id_picker branch from 7edc27b to 4028d78 Compare March 12, 2026 20:10
@lazear lazear merged commit 125263e into lazear:master Mar 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants