Phase 3: DuckDB backend for out-of-core TDC confidence estimation by wsnoble · Pull Request #28 · Noble-Lab/crema

wsnoble · 2026-06-20T02:15:46Z

Summary

Adds backend='duckdb' parameter to PsmDataset.assign_confidence() and the module-level assign_confidence() — explicit opt-in, no behavior change for existing code
New crema/backends/duckdb.py implements the full TDC pipeline (spectrum competition, peptide competition, protein aggregation, FDR + q-values) using DuckDB SQL window functions; handles datasets larger than RAM via out-of-core execution
Q-value SQL uses the same tied-score-aware backward-monotone-minimum logic as qvalues.tdc() for numerical equivalence
New DuckdbTdcConfidence class is a drop-in subclass of Confidence; to_txt() and _prettify_tables() work unchanged
setup.cfg: adds [fast] (pyarrow) and [large] (duckdb≥0.8.0) optional dependency groups
18 new unit tests in tests/unit_tests/test_duckdb.py; all skipped automatically when duckdb is not installed

Limitations

Only supports method='tdc'; backend='duckdb' with method='mixmax' raises ValueError
Protein-group estimation is not implemented in the DuckDB backend (deferred)

Test plan

pytest tests/unit_tests/ passes with 91 tests (duckdb tests skipped if package absent)
pip install crema[large] installs duckdb; duckdb tests then run and pass
psms.assign_confidence(score_column="score", backend="duckdb") produces same accept counts as pandas path
to_txt() on a DuckdbTdcConfidence object writes valid output files

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added DuckDB backend support for TDC confidence scoring at PSM, peptide, and protein levels.
- Introduced backend parameter to confidence assignment functions (default: "pandas").
- New optional dependency group large for DuckDB support.
Tests
- Added comprehensive unit tests for the DuckDB backend with output validation and numerical accuracy checks.

coderabbitai · 2026-06-20T02:15:55Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b8a6ab8-439d-44f5-849d-1af463dc146a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scale-phase-3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Adds an explicit opt-in backend='duckdb' to assign_confidence() that implements the full TDC pipeline (competition, FDR, q-values) using DuckDB SQL window functions, enabling datasets larger than RAM. - crema/backends/duckdb.py: new module with _compete_into(), _fetch_qvalues(), and run_tdc(); the q-value SQL uses the same tied-score-aware backward-monotone-minimum logic as qvalues.tdc() - crema/confidence.py: add DuckdbTdcConfidence class (subclass of Confidence) and backend param to module-level assign_confidence() - crema/dataset.py: add backend param to PsmDataset.assign_confidence(); routes backend='duckdb' to DuckdbTdcConfidence, raises ValueError for incompatible method='mixmax' - crema/__init__.py: export DuckdbTdcConfidence - setup.cfg: add [large] extras_require with duckdb>=0.8.0; add [fast] extras for pyarrow>=8.0.0 - tests/unit_tests/test_duckdb.py: 18 tests covering API, output structure, numerical agreement with pandas path, and to_txt compat; all skipped automatically when duckdb is not installed Protein-group estimation is not implemented in the DuckDB backend. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crema/backends/duckdb.py`:
- Around line 50-64: The CREATE TABLE SQL statement in the con.execute call and
other code blocks in the file do not conform to Black's formatting standards,
causing linting failures. Run the Black code formatter on the
crema/backends/duckdb.py file to automatically apply the correct formatting to
the SQL query string and all other affected code blocks referenced in the
comment (lines 95-127, 236-244, 267-273, 284-295). This will ensure the code
passes the CI linting checks.
- Around line 49-60: Define a helper function _qid that takes a string name and
returns it wrapped in double quotes with any internal double quotes escaped by
doubling them (quote becomes two quotes). Then systematically replace all direct
identifier interpolations throughout the file that currently use the pattern
"{variable_name}" with _qid(variable_name) instead. This includes the partition
variable construction on line 49, all score_col variable references, target_col
references, pep_col references, prot_col references, and score_column references
across the various SQL query construction sections in the file. The goal is to
ensure that column names containing double quote characters are properly escaped
and do not cause SQL parse failures.

In `@crema/dataset.py`:
- Around line 218-233: The current code silently falls back to pandas when an
unrecognized backend value is provided (e.g., a typo like "duckbd"), which can
mask configuration mistakes. After the if statement that checks for backend ==
"duckdb", replace the else clause with explicit validation that either accepts
known backend values like "pandas" or raises a ValueError rejecting unknown
backends. This ensures invalid backend values are caught immediately with a
clear error message rather than silently executing with an unintended backend.

In `@tests/unit_tests/test_duckdb.py`:
- Line 12: Apply Black code formatting to the pytest.importorskip call on line
12. The line exceeds Black's default line length limit and needs to be
reformatted, likely by breaking the pytest.importorskip function call with the
"duckdb" argument and reason parameter across multiple lines to meet Black's
formatting standards and unblock the CI checks.
- Around line 181-201: The test test_duckdb_psm_qvalues_match_pandas is
comparing the score column "x" instead of the actual q-values. Replace the
assertion that currently compares p_df["x"] and d_df["x"] with a comparison of
the appropriate q-value column from both dataframes. The test should validate
that the q-values calculated by the DuckDB path match those from the pandas
path, not the original scores. Identify the correct q-value column name from the
confidence_estimates dictionary and update the pd.testing.assert_series_equal
call accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c4e1a4e-45b5-4cca-9c10-ee6598ff06a9

📥 Commits

Reviewing files that changed from the base of the PR and between 1e6fb7c and 8c59b6c.

📒 Files selected for processing (7)

crema/__init__.py
crema/backends/__init__.py
crema/backends/duckdb.py
crema/confidence.py
crema/dataset.py
setup.cfg
tests/unit_tests/test_duckdb.py

- crema/backends/duckdb.py: reformat multiline f-string calls to match Black 26.x style - tests/unit_tests/test_duckdb.py: same Black reformatting; remove noqa: E402 directives (E402 is not in ruff select, so they triggered RUF100 unused-noqa warnings) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- backends/duckdb.py: add _qid() helper to safely escape double-quote characters in column name identifiers; replace all inline f'"{col}"' interpolations throughout SQL queries - dataset.py: replace silent else fallback with explicit elif 'pandas' + else raise ValueError for unknown backend values - tests/unit_tests/test_duckdb.py: fix test_duckdb_psm_qvalues_match_pandas to compare 'crema q-value' column instead of the raw score column Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wsnoble force-pushed the scale-phase-3 branch from 8c59b6c to 9d22d26 Compare June 20, 2026 02:21

wsnoble force-pushed the scale-phase-2 branch from 1e6fb7c to 04a4c44 Compare June 20, 2026 02:21

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread crema/backends/duckdb.py Outdated

Comment thread crema/backends/duckdb.py Outdated

Comment thread crema/dataset.py

Comment thread tests/unit_tests/test_duckdb.py Outdated

Comment thread tests/unit_tests/test_duckdb.py

wsnoble and others added 2 commits June 19, 2026 19:22

wsnoble requested a review from Copilot June 20, 2026 02:26

Copilot started reviewing on behalf of wsnoble June 20, 2026 02:26 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28

Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28
wsnoble wants to merge 3 commits into
scale-phase-2from
scale-phase-3

wsnoble commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wsnoble commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Limitations

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wsnoble commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading