[FEATURE] Protein Grouping, Picked Group FDR & Fix of Picked Protein FDR#182
Conversation
|
Hi guys, Thanks for another valuable PR - especially one that addresses one of the big shortcomings in Sage. From a quick readover, it looks good; I will probably just rename a couple fields and otherwise use it as-is for now. We can tackle some performance issues here, as well as some other places in Sage, in a future update |
crates/sage/src/idpicker.rs
Outdated
| data: Vec<(Arc<String>, Arc<String>)>, | ||
| ) -> FxHashSet<(Vec<Arc<String>>, Arc<String>)> { | ||
| // Phase 1: Group proteins by peptide | ||
| let group_by_peptides: FxHashMap<Arc<String>, Vec<Arc<String>>> = { |
There was a problem hiding this comment.
Is there a reason we can't use PeptideIx instead of Arc<String> to represent peptides?
crates/sage/src/idpicker.rs
Outdated
| println!("Protein ->{:?} , Protein Group-> {:?} ", i, j); | ||
| } | ||
|
|
||
| assert_eq!(5, protein_map.len()); |
There was a problem hiding this comment.
Can we add an assertion for the exact mapping? Let's make sure that it is deterministic and reproducible across invocations.
|
@sander-willems-bruker further tweaked performance; protein grouping is now down to 28s from 45s. |
|
We are doing some further tweaks to lower memory consumption and runtime over the next days. |
|
Brief update: We are completely changing the IDpicker implementation, leading to substantial improvements. PR will be updated within the next few days and is now back at the draft stage until then. |
|
OK, just let me know when it's ready to review! |
|
@sander-willems-bruker now re-implemented protein grouping and inference, bringing overhead down to 0.3s from 28s. A major change was a modification to the original IDPicker approach, replacing the method to find the minimum protein set. We believe this is now ready for review. |
| //! ## Main Features | ||
| //! - Groups proteins when peptide evidence for proteins is identical. | ||
| //! - Infers (almost) minimal protein-group covers using bipartite graph algorithms. | ||
| //! - Supports different protein inference strategies (e.g., "All", "Slim"). |
There was a problem hiding this comment.
@lazear This is a parameter that is not yet exposed to the config, but it could be an option for applications where users are not interested in finding the parsimonious solution, but want to retain all alternatives while still conducting protein grouping.
db3cdee to
7edc27b
Compare
Implement IDPicker-based protein grouping with picked group FDR and fix picked protein FDR to use only proteotypic (unique, non-shared) peptides. - Add protein_grouping module with bipartite graph cover algorithm - Add picked_proteingroup FDR computation in fdr module - Fix picked_protein to discard shared peptides (protein FDR = 1.0) - Add proteingroups, num_proteingroups, proteingroup_q to output - Add protein_grouping and protein_grouping_peptide_fdr config options - Support parquet and TSV output for new fields Co-Authored-By: vijay-gnanasambandan-bruker <69693791+vijay-gnanasambandan-bruker@users.noreply.github.com> Co-Authored-By: grosenberger-bruker <121885934+grosenberger-bruker@users.noreply.github.com> Co-Authored-By: sander-willems-bruker <sander.willems@bruker.com> Co-Authored-By: gadi-armony-brkr <gadi.armony@bruker.com>
7edc27b to
4028d78
Compare
Hi @lazear,
This PR introduces protein grouping with picked group FDR to Sage and fixes the picked protein FDR issue:
Fix of Picked Protein FDR
While Sage generally computes picked protein FDR correctly, we recently encountered an issue with shared peptides. For example, let's assume PEPTIDEAK belongs to protA, PEPTIDECK belongs to protC, and shared PEPTIDEDK belongs to both protA and protC. If PEPTIDEDK is confidently identified, it counts as a new "protein" protA/protC. With a canonical UniProtKB/Swiss-Prot DB, shared peptides typically constitute 5-10% and have similar properties to proteotypic peptides, so the effect on computing picked protein FDR is minor. However, the number of proteins in the Sage runtime log is artificially inflated (3 proteins). This also appears in the report, but I assume most users will filter them in downstream analysis and recount.
Solution: As proposed in the literature, we now use only proteotypic, unique, non-shared peptides to compute picked protein FDR. Shared peptides will still be reported but with protein FDR set to 1.0. This has a minor effect on canonical databases, providing more accurate numbers. For isoforms, this approach is not applicable, so we introduce protein grouping.
Protein Grouping with Picked Group FDR
This new module implements a protein grouping algorithm based on the IDPicker algorithm with extensions from the "Picked Group FDR approach." The Python implementation of CsoDIAq has been used as a template and for testing the IDPicker approach. Most functions are based on IDPicker, with generate_proteingroups() representing the "rescued subset grouping (rsG)" approach. Discarding shared peptides and picked FDR are implemented as part of the core Sage FDR routines.
In our experience, picked group FDR with IDPicker is a simple, robust, and scalable approach. Compared to standard IDPicker, it performs better under tricky boundary conditions, albeit with some computational expense. For 10 dda-PASEF mixed proteome samples, the runtime overhead in my benchmark was 45 seconds. We have optimized some IDPicker components, but there may still be potential for further improvement.
We believe this PR will be useful for Sage, extending the current established and accepted principles to protein groups.
Best regards,
@grosenberger-bruker, @vijay-gnanasambandan-bruker, @sander-willems-bruker
References