Phase 2: streaming writer and chunked text parser by wsnoble · Pull Request #27 · Noble-Lab/crema

wsnoble · 2026-06-19T23:25:12Z

Summary

writers/txt.py: Eliminate pd.concat() before writing. Each confidence-estimate DataFrame is now written directly to disk with mode='a', so peak memory during output no longer scales with dataset size.
parsers/txt.py: Add chunk_size parameter to read_txt(). When set, returns a generator yielding DataFrames of chunk_size rows (target column already converted to bool). Non-chunked path is unchanged.
utils.py: Add chunk_size parameter to parse_psms_txt() (used by Tide/Comet/MSGF+/etc. parsers). Returns a pandas TextFileReader iterator when set, allowing row-batch ingestion of large files.

This is Phase 2 of SCALING_PLAN.md. The chunked parser is the ingestion layer for the Phase 3 DuckDB backend.

Not implemented in this PR (deferred to Phase 3):

PsmDatasetStream — requires DuckDB
mzTab iterative parsing — pyteomics has no streaming API
pepXML — already streams via lxml.etree.iterparse

Test plan

CI passes (90 unit tests + lint across 3 OS × 3 Python versions)
Manually verify to_txt() output is identical before and after the writer change
Verify read_txt(..., chunk_size=10000) yields DataFrames with correct dtypes and row counts

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Text file reading now supports streaming mode: set chunk_size parameter in read_txt() to receive data as chunks via a generator.
Refactor
- Optimized text file writing implementation.

coderabbitai · 2026-06-19T23:25:19Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5003dbd-38ab-4161-8405-3874473218ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scale-phase-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Implements Phase 2 of the scaling plan by adding chunked/streaming ingestion for large TXT inputs and reducing peak memory usage when writing TXT outputs.

Changes:

Stream TXT writer output by appending each confidence-estimate DataFrame directly to disk instead of concatenating in memory.
Add chunk_size support to read_txt() to optionally return a generator of DataFrame chunks.
Add chunk_size support to utils.parse_psms_txt() to optionally return a TextFileReader iterator.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
crema/writers/txt.py	Switches TXT output writing to iterative `to_csv(..., mode='a')` to reduce peak memory during output.
crema/utils.py	Adds optional chunked CSV reading via `chunksize` for PSM TXT parsing.
crema/parsers/txt.py	Adds optional streaming/chunked mode to return DataFrame chunks instead of constructing a `PsmDataset`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        first = True
+        for df in qval_list:
+            df.to_csv(
+                out_file,
+                sep=sep,
+                index=False,
+                float_format=f"%.{precision}f",
+                header=first,
+                mode="w" if first else "a",
+            )
+            first = False


+    # Streaming mode: yield chunks without constructing a PsmDataset
+    if chunk_size is not None and not isinstance(txt_files, pd.DataFrame):
+        return _read_txt_chunked(
+            utils.listify(txt_files), sep, fields, target_column, chunk_size
+        )
+


+        first = True
+        for df in qval_list:
+            df.to_csv(
+                out_file,
+                sep=sep,


coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crema/parsers/txt.py`:
- Around line 61-77: The docstring for the chunk_size parameter (lines 61-77)
documents that a generator is returned when chunk_size is set, but the
implementation at line 88 silently skips streaming mode when txt_files is a
pandas.DataFrame. Fix this inconsistency by either updating the docstring to
explicitly document that DataFrame inputs do not support chunked mode and will
always return a PsmDataset, or by adding a ValueError check in the
implementation that raises a clear error message when chunk_size is provided
with a DataFrame input to prevent unexpected behavior.

In `@crema/writers/txt.py`:
- Around line 65-75: The iterative append loop that calls to_csv() on each
DataFrame in qval_list can result in DataFrames with different column orders
being written to the same file, corrupting the output. Before the to_csv() call
for each df in the loop, reindex each DataFrame to match a consistent column
order (such as the column order of the first DataFrame in the list) to ensure
all appends align columns correctly regardless of how each individual Confidence
object's data was prettified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f300f634-3eba-43ed-9a65-628684465c92

📥 Commits

Reviewing files that changed from the base of the PR and between b523659 and bdf6828.

📒 Files selected for processing (3)

crema/parsers/txt.py
crema/utils.py
crema/writers/txt.py

coderabbitai · 2026-06-19T23:40:53Z

+    chunk_size : int or None, optional
+        When set, the parser operates in streaming mode: instead of loading all
+        files into memory at once, it returns a generator that yields one
+        :py:class:`pandas.DataFrame` per chunk of ``chunk_size`` rows. This is
+        useful for feeding data into a DuckDB relation or a Parquet writer
+        without ever holding the full dataset in RAM. When ``None`` (default),
+        the existing behaviour is preserved and a
+        :py:class:`~crema.dataset.PsmDataset` is returned.

    Returns
    -------
-    PsmDataset
-        A :py:class:`~crema.dataset.PsmDataset` object containing the parsed
-        PSMs.
+    PsmDataset or generator of pandas.DataFrame
+        When ``chunk_size`` is ``None``, a
+        :py:class:`~crema.dataset.PsmDataset` object containing the parsed
+        PSMs.  When ``chunk_size`` is set, a generator of
+        :py:class:`pandas.DataFrame` chunks (each with the target column
+        already converted to bool).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

chunk_size contract is inconsistent for DataFrame inputs.

Line 88 skips streaming when txt_files is a pandas.DataFrame, but Lines 61-77 document chunked mode as returning a generator when chunk_size is set. Please either (a) document the DataFrame exception explicitly, or (b) raise a clear ValueError when chunk_size is used with DataFrame input to avoid type surprises.

Also applies to: 87-92

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crema/parsers/txt.py` around lines 61 - 77, The docstring for the chunk_size parameter (lines 61-77) documents that a generator is returned when chunk_size is set, but the implementation at line 88 silently skips streaming mode when txt_files is a pandas.DataFrame. Fix this inconsistency by either updating the docstring to explicitly document that DataFrame inputs do not support chunked mode and will always return a PsmDataset, or by adding a ValueError check in the implementation that raises a clear error message when chunk_size is provided with a DataFrame input to prevent unexpected behavior.

coderabbitai · 2026-06-19T23:40:53Z

+        first = True
+        for df in qval_list:
+            df.to_csv(
+                out_file,
+                sep=sep,
+                index=False,
+                float_format=f"%.{precision}f",
+                header=first,
+                mode="w" if first else "a",
+            )
+            first = False


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Demonstrates the failure mode: appending differently-shaped DataFrames # without reindexing can mis-map values under a fixed header. python - <<'PY' import tempfile from pathlib import Path import pandas as pd p = Path(tempfile.mkdtemp()) / "demo.tsv" df1 = pd.DataFrame({"a":[1], "b":[2]}) df2 = pd.DataFrame({"b":[3], "a":[4], "c":[5]}) first = True for df in (df1, df2): df.to_csv(p, sep="\t", index=False, header=first, mode="w" if first else "a") first = False print("Written file:") print(p.read_text()) print("Round-tripped DataFrame:") print(pd.read_csv(p, sep="\t")) PY

Repository: Noble-Lab/crema

Length of output: 178

🏁 Script executed:

# First, locate and examine the file fd -t f "txt.py" | grep -i writer

Repository: Noble-Lab/crema

Length of output: 80

🏁 Script executed:

# Read the file around lines 65-75 cat -n crema/writers/txt.py | sed -n '50,85p'

Repository: Noble-Lab/crema

Length of output: 1259

🏁 Script executed:

# Check the broader context to understand qval_list construction cat -n crema/writers/txt.py | sed -n '1,80p'

Repository: Noble-Lab/crema

Length of output: 3158

🏁 Script executed:

# Examine _get_level_data function cat -n crema/writers/txt.py | sed -n '81,150p'

Repository: Noble-Lab/crema

Length of output: 983

🏁 Script executed:

# Find the Confidence class definition fd -t f "\.py$" -path "*confidence*" | head -5

Repository: Noble-Lab/crema

Length of output: 228

🏁 Script executed:

# Search for how DataFrames are created in the codebase rg "DataFrame|to_csv" --type py -A 2 -B 2 | head -100

Repository: Noble-Lab/crema

Length of output: 5647

🏁 Script executed:

# Find the Confidence class definition fd -t f "confidence.py"

Repository: Noble-Lab/crema

Length of output: 115

🏁 Script executed:

# Search for _prettify_tables or DataFrame construction rg "_prettify_tables|confidence_estimates\s*=" --type py -B 2 -A 5 | head -80

Repository: Noble-Lab/crema

Length of output: 2455

🌐 Web query:

pandas DataFrame.to_csv append mode column order mismatch behavior

💡 Result:

When using pandas DataFrame.to_csv with mode='a' (append mode), pandas does not validate or check the column order or consistency between the new data being appended and the existing content of the CSV file [1][2]. The to_csv method performs a blind append of the raw data [3]. Consequently, if your DataFrame columns are in a different order than the CSV file, or if they contain a different set of columns entirely, pandas will write the data as-is, leading to data misalignment or an inconsistent CSV structure [4][1][3]. To safely append data while maintaining correct column order, you should implement validation or reordering logic before calling to_csv [2]: 1. Read the header of the existing file to determine the expected column order [1][3]. 2. Reorder or reindex your new DataFrame to match that order using the DataFrame.reindex method [4][3]. 3. Append the reindexed DataFrame, ensuring header=False is set to prevent writing a redundant header [2][5]. Example of safe append: import pandas as pd import os # 1. Determine existing column order if os.path.exists('file.csv'): # Read only the header to get correct order expected_columns = pd.read_csv('file.csv', nrows=0).columns # 2. Reindex new data to match existing order new_data = new_data.reindex(columns=expected_columns) # 3. Append to file new_data.to_csv('file.csv', mode='a', index=False, header=False) else: # First write new_data.to_csv('file.csv', index=False) This behavior is intentional as pandas does not assume the file being appended to necessarily has a header or follows a specific schema, and it avoids the performance overhead of reading and validating potentially large files [1].

Citations:

1: why not check the column order and type consistent before use ‘appand’ mode in to_csv or equivalent ops such as to_sql? pandas-dev/pandas#38656

2: https://softhints.com/how-to-append-data-to-existing-csv-file-in-pandas/

3: https://stackoverflow.com/questions/76345249/pandas-to-csv-method-in-append-mode-with-dataframes-with-different-columns

4: https://stackoverflow.com/questions/51648524/pandas-to-csv-append-mode-with-preserved-columns-order

5: https://www.statology.org/pandas-to-csv-append/

🏁 Script executed:

# Examine _prettify_tables to see if it standardizes column order cat -n crema/confidence.py | sed -n '1,50p'

Repository: Noble-Lab/crema

Length of output: 1959

🏁 Script executed:

# Find and read _prettify_tables method rg "_prettify_tables" -A 40 crema/confidence.py

Repository: Noble-Lab/crema

Length of output: 1903

🏁 Script executed:

# Continue reading _prettify_tables for PSM and peptide levels rg "_prettify_tables" -A 60 crema/confidence.py | tail -30

Repository: Noble-Lab/crema

Length of output: 1319

🏁 Script executed:

# Check if there are any tests that verify aggregation of multiple Confidence objects rg "results\[level\] \+=" -B 5 -A 5 crema/

Repository: Noble-Lab/crema

Length of output: 640

🏁 Script executed:

# Look for any variations in how DataFrames are created that might cause column differences rg "confidence_estimates\[" -B 3 -A 3 crema/confidence.py

Repository: Noble-Lab/crema

Length of output: 2036

Appending DataFrames with inconsistent column order corrupts the output file.

When multiple Confidence objects are aggregated in results[level] (lines 57-60), each object's DataFrames are reordered by _prettify_tables() based on their individual dataset configuration. If different Confidence objects use different dataset settings (spectrum columns, score column, etc.), the resulting DataFrames can have different column sets or orders. The iterative append at lines 67-74 writes each DataFrame as-is without realigning columns, causing values to be recorded under incorrect headers.

The original pd.concat path aligned schemas before writing. Reindexing each DataFrame before append prevents this corruption:

Proposed fix: ensure consistent column order before append

out_files = [] for level, qval_list in results.items(): out_file = str(file_base) + f".{level}.txt" + all_cols = [] + for df in qval_list: + for col in df.columns: + if col not in all_cols: + all_cols.append(col) + first = True for df in qval_list: - df.to_csv( + df.reindex(columns=all_cols).to_csv( out_file, sep=sep, index=False, float_format=f"%.{precision}f", header=first, mode="w" if first else "a", ) first = False out_files.append(out_file)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crema/writers/txt.py` around lines 65 - 75, The iterative append loop that calls to_csv() on each DataFrame in qval_list can result in DataFrames with different column orders being written to the same file, corrupting the output. Before the to_csv() call for each df in the loop, reindex each DataFrame to match a consistent column order (such as the column order of the first DataFrame in the list) to ensure all appends align columns correctly regardless of how each individual Confidence object's data was prettified.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

crema/utils.py:72

parse_psms_txt(..., chunk_size=...) adds a new return type (TextFileReader) and chunked-reading behavior, but there are currently no unit tests covering this function (including skip_line handling + usecols filtering in chunked mode). Please add tests to validate the iterator path yields the expected columns and total rows.

def parse_psms_txt(txt_file, cols, skip_line, chunk_size=None):
    """Parse a single tab-delimited file

    Parameters
    ----------

+        for df in qval_list:
+            df.to_csv(
+                out_file,
+                sep=sep,
+                index=False,


+    # Streaming mode: yield chunks without constructing a PsmDataset
+    if chunk_size is not None and not isinstance(txt_files, pd.DataFrame):
+        return _read_txt_chunked(
+            utils.listify(txt_files), sep, fields, target_column, chunk_size
+        )
+


    sep="\t",
    pairing_file_name=None,
    copy_data=False,
+    chunk_size=None,
 ):


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

+    if chunk_size is not None:
+        if isinstance(txt_files, pd.DataFrame):
+            raise ValueError(
+                "chunk_size is not supported when txt_files is a DataFrame. "
+                "Pass file path(s) instead to use streaming mode."
+            )
+        return _read_txt_chunked(
+            utils.listify(txt_files), sep, fields, target_column, chunk_size
+        )


writers/txt.py: - Stream each confidence-estimate DataFrame directly to CSV instead of materialising the full concatenated result first. Uses mode='a' + header=False for all but the first chunk, eliminating the peak-memory spike from pd.concat() before the write. parsers/txt.py: - Add chunk_size parameter to read_txt(). When set, returns a generator that yields one pandas DataFrame per chunk_size rows, with the target column already converted to bool. The non-chunked path is unchanged. Chunked mode is intended as the ingestion layer for the Phase 3 DuckDB backend. utils.py: - Add chunk_size parameter to parse_psms_txt(). When set, returns a pandas TextFileReader iterator instead of a full DataFrame, allowing callers (e.g. Tide/Comet/MSGF+ parsers) to read large files row-batch by row-batch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writers/txt.py: - Compute the union of all columns across DataFrames before writing so that appended rows from different Confidence objects are always aligned to the same header, regardless of column order or set differences. parsers/txt.py: - Raise ValueError when chunk_size is passed with a DataFrame input instead of silently ignoring it. DataFrames are already in memory so streaming mode does not apply. tests/unit_tests/test_writers.py: - Add test_to_txt_multi_confidence_row_count: verifies that writing a tuple of two identical Confidence objects produces exactly 2× the rows of a single write, covering the multi-object append path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wsnoble requested a review from Copilot June 19, 2026 23:35

Copilot started reviewing on behalf of wsnoble June 19, 2026 23:35 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

wsnoble force-pushed the scale-phase-2 branch from bdf6828 to 9a71619 Compare June 19, 2026 23:44

wsnoble requested a review from Copilot June 19, 2026 23:53

Copilot started reviewing on behalf of wsnoble June 19, 2026 23:54 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

wsnoble requested a review from Copilot June 19, 2026 23:59

Copilot started reviewing on behalf of wsnoble June 20, 2026 00:00 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

wsnoble force-pushed the scale-phase-1-2 branch from 07540af to fdcf8ea Compare June 20, 2026 00:43

wsnoble force-pushed the scale-phase-2 branch from 77ec7ab to 1e6fb7c Compare June 20, 2026 00:43

wsnoble and others added 2 commits June 19, 2026 19:20

wsnoble force-pushed the scale-phase-1-2 branch from fdcf8ea to 4386e8a Compare June 20, 2026 02:21

wsnoble force-pushed the scale-phase-2 branch from 1e6fb7c to 04a4c44 Compare June 20, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2: streaming writer and chunked text parser#27

Phase 2: streaming writer and chunked text parser#27
wsnoble wants to merge 2 commits into
scale-phase-1-2from
scale-phase-2

wsnoble commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Review skipped

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wsnoble commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wsnoble commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading