Skip to content

Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28

Open
wsnoble wants to merge 3 commits into
scale-phase-2from
scale-phase-3
Open

Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28
wsnoble wants to merge 3 commits into
scale-phase-2from
scale-phase-3

Conversation

@wsnoble

@wsnoble wsnoble commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds backend='duckdb' parameter to PsmDataset.assign_confidence() and the module-level assign_confidence() — explicit opt-in, no behavior change for existing code
  • New crema/backends/duckdb.py implements the full TDC pipeline (spectrum competition, peptide competition, protein aggregation, FDR + q-values) using DuckDB SQL window functions; handles datasets larger than RAM via out-of-core execution
  • Q-value SQL uses the same tied-score-aware backward-monotone-minimum logic as qvalues.tdc() for numerical equivalence
  • New DuckdbTdcConfidence class is a drop-in subclass of Confidence; to_txt() and _prettify_tables() work unchanged
  • setup.cfg: adds [fast] (pyarrow) and [large] (duckdb≥0.8.0) optional dependency groups
  • 18 new unit tests in tests/unit_tests/test_duckdb.py; all skipped automatically when duckdb is not installed

Limitations

  • Only supports method='tdc'; backend='duckdb' with method='mixmax' raises ValueError
  • Protein-group estimation is not implemented in the DuckDB backend (deferred)

Test plan

  • pytest tests/unit_tests/ passes with 91 tests (duckdb tests skipped if package absent)
  • pip install crema[large] installs duckdb; duckdb tests then run and pass
  • psms.assign_confidence(score_column="score", backend="duckdb") produces same accept counts as pandas path
  • to_txt() on a DuckdbTdcConfidence object writes valid output files

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added DuckDB backend support for TDC confidence scoring at PSM, peptide, and protein levels.
    • Introduced backend parameter to confidence assignment functions (default: "pandas").
    • New optional dependency group large for DuckDB support.
  • Tests

    • Added comprehensive unit tests for the DuckDB backend with output validation and numerical accuracy checks.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b8a6ab8-439d-44f5-849d-1af463dc146a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scale-phase-3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Adds an explicit opt-in backend='duckdb' to assign_confidence() that
implements the full TDC pipeline (competition, FDR, q-values) using
DuckDB SQL window functions, enabling datasets larger than RAM.

- crema/backends/duckdb.py: new module with _compete_into(),
  _fetch_qvalues(), and run_tdc(); the q-value SQL uses the same
  tied-score-aware backward-monotone-minimum logic as qvalues.tdc()
- crema/confidence.py: add DuckdbTdcConfidence class (subclass of
  Confidence) and backend param to module-level assign_confidence()
- crema/dataset.py: add backend param to PsmDataset.assign_confidence();
  routes backend='duckdb' to DuckdbTdcConfidence, raises ValueError for
  incompatible method='mixmax'
- crema/__init__.py: export DuckdbTdcConfidence
- setup.cfg: add [large] extras_require with duckdb>=0.8.0; add [fast]
  extras for pyarrow>=8.0.0
- tests/unit_tests/test_duckdb.py: 18 tests covering API, output
  structure, numerical agreement with pandas path, and to_txt compat;
  all skipped automatically when duckdb is not installed

Protein-group estimation is not implemented in the DuckDB backend.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crema/backends/duckdb.py`:
- Around line 50-64: The CREATE TABLE SQL statement in the con.execute call and
other code blocks in the file do not conform to Black's formatting standards,
causing linting failures. Run the Black code formatter on the
crema/backends/duckdb.py file to automatically apply the correct formatting to
the SQL query string and all other affected code blocks referenced in the
comment (lines 95-127, 236-244, 267-273, 284-295). This will ensure the code
passes the CI linting checks.
- Around line 49-60: Define a helper function _qid that takes a string name and
returns it wrapped in double quotes with any internal double quotes escaped by
doubling them (quote becomes two quotes). Then systematically replace all direct
identifier interpolations throughout the file that currently use the pattern
"{variable_name}" with _qid(variable_name) instead. This includes the partition
variable construction on line 49, all score_col variable references, target_col
references, pep_col references, prot_col references, and score_column references
across the various SQL query construction sections in the file. The goal is to
ensure that column names containing double quote characters are properly escaped
and do not cause SQL parse failures.

In `@crema/dataset.py`:
- Around line 218-233: The current code silently falls back to pandas when an
unrecognized backend value is provided (e.g., a typo like "duckbd"), which can
mask configuration mistakes. After the if statement that checks for backend ==
"duckdb", replace the else clause with explicit validation that either accepts
known backend values like "pandas" or raises a ValueError rejecting unknown
backends. This ensures invalid backend values are caught immediately with a
clear error message rather than silently executing with an unintended backend.

In `@tests/unit_tests/test_duckdb.py`:
- Line 12: Apply Black code formatting to the pytest.importorskip call on line
12. The line exceeds Black's default line length limit and needs to be
reformatted, likely by breaking the pytest.importorskip function call with the
"duckdb" argument and reason parameter across multiple lines to meet Black's
formatting standards and unblock the CI checks.
- Around line 181-201: The test test_duckdb_psm_qvalues_match_pandas is
comparing the score column "x" instead of the actual q-values. Replace the
assertion that currently compares p_df["x"] and d_df["x"] with a comparison of
the appropriate q-value column from both dataframes. The test should validate
that the q-values calculated by the DuckDB path match those from the pandas
path, not the original scores. Identify the correct q-value column name from the
confidence_estimates dictionary and update the pd.testing.assert_series_equal
call accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c4e1a4e-45b5-4cca-9c10-ee6598ff06a9

📥 Commits

Reviewing files that changed from the base of the PR and between 1e6fb7c and 8c59b6c.

📒 Files selected for processing (7)
  • crema/__init__.py
  • crema/backends/__init__.py
  • crema/backends/duckdb.py
  • crema/confidence.py
  • crema/dataset.py
  • setup.cfg
  • tests/unit_tests/test_duckdb.py

Comment thread crema/backends/duckdb.py Outdated
Comment thread crema/backends/duckdb.py Outdated
Comment thread crema/dataset.py
Comment thread tests/unit_tests/test_duckdb.py Outdated
Comment thread tests/unit_tests/test_duckdb.py
wsnoble and others added 2 commits June 19, 2026 19:22
- crema/backends/duckdb.py: reformat multiline f-string calls to match
  Black 26.x style
- tests/unit_tests/test_duckdb.py: same Black reformatting; remove
  noqa: E402 directives (E402 is not in ruff select, so they triggered
  RUF100 unused-noqa warnings)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- backends/duckdb.py: add _qid() helper to safely escape double-quote
  characters in column name identifiers; replace all inline f'"{col}"'
  interpolations throughout SQL queries
- dataset.py: replace silent else fallback with explicit elif 'pandas'
  + else raise ValueError for unknown backend values
- tests/unit_tests/test_duckdb.py: fix test_duckdb_psm_qvalues_match_pandas
  to compare 'crema q-value' column instead of the raw score column

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants