Skip to content

Phase 2: streaming writer and chunked text parser#27

Open
wsnoble wants to merge 2 commits into
scale-phase-1-2from
scale-phase-2
Open

Phase 2: streaming writer and chunked text parser#27
wsnoble wants to merge 2 commits into
scale-phase-1-2from
scale-phase-2

Conversation

@wsnoble

@wsnoble wsnoble commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • writers/txt.py: Eliminate pd.concat() before writing. Each confidence-estimate DataFrame is now written directly to disk with mode='a', so peak memory during output no longer scales with dataset size.
  • parsers/txt.py: Add chunk_size parameter to read_txt(). When set, returns a generator yielding DataFrames of chunk_size rows (target column already converted to bool). Non-chunked path is unchanged.
  • utils.py: Add chunk_size parameter to parse_psms_txt() (used by Tide/Comet/MSGF+/etc. parsers). Returns a pandas TextFileReader iterator when set, allowing row-batch ingestion of large files.

This is Phase 2 of SCALING_PLAN.md. The chunked parser is the ingestion layer for the Phase 3 DuckDB backend.

Not implemented in this PR (deferred to Phase 3):

  • PsmDatasetStream — requires DuckDB
  • mzTab iterative parsing — pyteomics has no streaming API
  • pepXML — already streams via lxml.etree.iterparse

Test plan

  • CI passes (90 unit tests + lint across 3 OS × 3 Python versions)
  • Manually verify to_txt() output is identical before and after the writer change
  • Verify read_txt(..., chunk_size=10000) yields DataFrames with correct dtypes and row counts

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Text file reading now supports streaming mode: set chunk_size parameter in read_txt() to receive data as chunks via a generator.
  • Refactor

    • Optimized text file writing implementation.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5003dbd-38ab-4161-8405-3874473218ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scale-phase-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Phase 2 of the scaling plan by adding chunked/streaming ingestion for large TXT inputs and reducing peak memory usage when writing TXT outputs.

Changes:

  • Stream TXT writer output by appending each confidence-estimate DataFrame directly to disk instead of concatenating in memory.
  • Add chunk_size support to read_txt() to optionally return a generator of DataFrame chunks.
  • Add chunk_size support to utils.parse_psms_txt() to optionally return a TextFileReader iterator.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
crema/writers/txt.py Switches TXT output writing to iterative to_csv(..., mode='a') to reduce peak memory during output.
crema/utils.py Adds optional chunked CSV reading via chunksize for PSM TXT parsing.
crema/parsers/txt.py Adds optional streaming/chunked mode to return DataFrame chunks instead of constructing a PsmDataset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crema/writers/txt.py
Comment on lines +65 to +75
first = True
for df in qval_list:
df.to_csv(
out_file,
sep=sep,
index=False,
float_format=f"%.{precision}f",
header=first,
mode="w" if first else "a",
)
first = False
Comment thread crema/parsers/txt.py Outdated
Comment on lines +87 to +92
# Streaming mode: yield chunks without constructing a PsmDataset
if chunk_size is not None and not isinstance(txt_files, pd.DataFrame):
return _read_txt_chunked(
utils.listify(txt_files), sep, fields, target_column, chunk_size
)

Comment thread crema/writers/txt.py
Comment on lines +65 to +69
first = True
for df in qval_list:
df.to_csv(
out_file,
sep=sep,

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crema/parsers/txt.py`:
- Around line 61-77: The docstring for the chunk_size parameter (lines 61-77)
documents that a generator is returned when chunk_size is set, but the
implementation at line 88 silently skips streaming mode when txt_files is a
pandas.DataFrame. Fix this inconsistency by either updating the docstring to
explicitly document that DataFrame inputs do not support chunked mode and will
always return a PsmDataset, or by adding a ValueError check in the
implementation that raises a clear error message when chunk_size is provided
with a DataFrame input to prevent unexpected behavior.

In `@crema/writers/txt.py`:
- Around line 65-75: The iterative append loop that calls to_csv() on each
DataFrame in qval_list can result in DataFrames with different column orders
being written to the same file, corrupting the output. Before the to_csv() call
for each df in the loop, reindex each DataFrame to match a consistent column
order (such as the column order of the first DataFrame in the list) to ensure
all appends align columns correctly regardless of how each individual Confidence
object's data was prettified.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f300f634-3eba-43ed-9a65-628684465c92

📥 Commits

Reviewing files that changed from the base of the PR and between b523659 and bdf6828.

📒 Files selected for processing (3)
  • crema/parsers/txt.py
  • crema/utils.py
  • crema/writers/txt.py

Comment thread crema/parsers/txt.py
Comment on lines +61 to +77
chunk_size : int or None, optional
When set, the parser operates in streaming mode: instead of loading all
files into memory at once, it returns a generator that yields one
:py:class:`pandas.DataFrame` per chunk of ``chunk_size`` rows. This is
useful for feeding data into a DuckDB relation or a Parquet writer
without ever holding the full dataset in RAM. When ``None`` (default),
the existing behaviour is preserved and a
:py:class:`~crema.dataset.PsmDataset` is returned.

Returns
-------
PsmDataset
A :py:class:`~crema.dataset.PsmDataset` object containing the parsed
PSMs.
PsmDataset or generator of pandas.DataFrame
When ``chunk_size`` is ``None``, a
:py:class:`~crema.dataset.PsmDataset` object containing the parsed
PSMs. When ``chunk_size`` is set, a generator of
:py:class:`pandas.DataFrame` chunks (each with the target column
already converted to bool).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

chunk_size contract is inconsistent for DataFrame inputs.

Line 88 skips streaming when txt_files is a pandas.DataFrame, but Lines 61-77 document chunked mode as returning a generator when chunk_size is set. Please either (a) document the DataFrame exception explicitly, or (b) raise a clear ValueError when chunk_size is used with DataFrame input to avoid type surprises.

Also applies to: 87-92

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crema/parsers/txt.py` around lines 61 - 77, The docstring for the chunk_size
parameter (lines 61-77) documents that a generator is returned when chunk_size
is set, but the implementation at line 88 silently skips streaming mode when
txt_files is a pandas.DataFrame. Fix this inconsistency by either updating the
docstring to explicitly document that DataFrame inputs do not support chunked
mode and will always return a PsmDataset, or by adding a ValueError check in the
implementation that raises a clear error message when chunk_size is provided
with a DataFrame input to prevent unexpected behavior.

Comment thread crema/writers/txt.py
Comment on lines +65 to +75
first = True
for df in qval_list:
df.to_csv(
out_file,
sep=sep,
index=False,
float_format=f"%.{precision}f",
header=first,
mode="w" if first else "a",
)
first = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Demonstrates the failure mode: appending differently-shaped DataFrames
# without reindexing can mis-map values under a fixed header.
python - <<'PY'
import tempfile
from pathlib import Path
import pandas as pd

p = Path(tempfile.mkdtemp()) / "demo.tsv"

df1 = pd.DataFrame({"a":[1], "b":[2]})
df2 = pd.DataFrame({"b":[3], "a":[4], "c":[5]})

first = True
for df in (df1, df2):
    df.to_csv(p, sep="\t", index=False, header=first, mode="w" if first else "a")
    first = False

print("Written file:")
print(p.read_text())
print("Round-tripped DataFrame:")
print(pd.read_csv(p, sep="\t"))
PY

Repository: Noble-Lab/crema

Length of output: 178


🏁 Script executed:

# First, locate and examine the file
fd -t f "txt.py" | grep -i writer

Repository: Noble-Lab/crema

Length of output: 80


🏁 Script executed:

# Read the file around lines 65-75
cat -n crema/writers/txt.py | sed -n '50,85p'

Repository: Noble-Lab/crema

Length of output: 1259


🏁 Script executed:

# Check the broader context to understand qval_list construction
cat -n crema/writers/txt.py | sed -n '1,80p'

Repository: Noble-Lab/crema

Length of output: 3158


🏁 Script executed:

# Examine _get_level_data function
cat -n crema/writers/txt.py | sed -n '81,150p'

Repository: Noble-Lab/crema

Length of output: 983


🏁 Script executed:

# Find the Confidence class definition
fd -t f "\.py$" -path "*confidence*" | head -5

Repository: Noble-Lab/crema

Length of output: 228


🏁 Script executed:

# Search for how DataFrames are created in the codebase
rg "DataFrame|to_csv" --type py -A 2 -B 2 | head -100

Repository: Noble-Lab/crema

Length of output: 5647


🏁 Script executed:

# Find the Confidence class definition
fd -t f "confidence.py"

Repository: Noble-Lab/crema

Length of output: 115


🏁 Script executed:

# Search for _prettify_tables or DataFrame construction
rg "_prettify_tables|confidence_estimates\s*=" --type py -B 2 -A 5 | head -80

Repository: Noble-Lab/crema

Length of output: 2455


🌐 Web query:

pandas DataFrame.to_csv append mode column order mismatch behavior

💡 Result:

When using pandas DataFrame.to_csv with mode='a' (append mode), pandas does not validate or check the column order or consistency between the new data being appended and the existing content of the CSV file [1][2]. The to_csv method performs a blind append of the raw data [3]. Consequently, if your DataFrame columns are in a different order than the CSV file, or if they contain a different set of columns entirely, pandas will write the data as-is, leading to data misalignment or an inconsistent CSV structure [4][1][3]. To safely append data while maintaining correct column order, you should implement validation or reordering logic before calling to_csv [2]: 1. Read the header of the existing file to determine the expected column order [1][3]. 2. Reorder or reindex your new DataFrame to match that order using the DataFrame.reindex method [4][3]. 3. Append the reindexed DataFrame, ensuring header=False is set to prevent writing a redundant header [2][5]. Example of safe append: import pandas as pd import os # 1. Determine existing column order if os.path.exists('file.csv'): # Read only the header to get correct order expected_columns = pd.read_csv('file.csv', nrows=0).columns # 2. Reindex new data to match existing order new_data = new_data.reindex(columns=expected_columns) # 3. Append to file new_data.to_csv('file.csv', mode='a', index=False, header=False) else: # First write new_data.to_csv('file.csv', index=False) This behavior is intentional as pandas does not assume the file being appended to necessarily has a header or follows a specific schema, and it avoids the performance overhead of reading and validating potentially large files [1].

Citations:


🏁 Script executed:

# Examine _prettify_tables to see if it standardizes column order
cat -n crema/confidence.py | sed -n '1,50p'

Repository: Noble-Lab/crema

Length of output: 1959


🏁 Script executed:

# Find and read _prettify_tables method
rg "_prettify_tables" -A 40 crema/confidence.py

Repository: Noble-Lab/crema

Length of output: 1903


🏁 Script executed:

# Continue reading _prettify_tables for PSM and peptide levels
rg "_prettify_tables" -A 60 crema/confidence.py | tail -30

Repository: Noble-Lab/crema

Length of output: 1319


🏁 Script executed:

# Check if there are any tests that verify aggregation of multiple Confidence objects
rg "results\[level\] \+=" -B 5 -A 5 crema/

Repository: Noble-Lab/crema

Length of output: 640


🏁 Script executed:

# Look for any variations in how DataFrames are created that might cause column differences
rg "confidence_estimates\[" -B 3 -A 3 crema/confidence.py

Repository: Noble-Lab/crema

Length of output: 2036


Appending DataFrames with inconsistent column order corrupts the output file.

When multiple Confidence objects are aggregated in results[level] (lines 57-60), each object's DataFrames are reordered by _prettify_tables() based on their individual dataset configuration. If different Confidence objects use different dataset settings (spectrum columns, score column, etc.), the resulting DataFrames can have different column sets or orders. The iterative append at lines 67-74 writes each DataFrame as-is without realigning columns, causing values to be recorded under incorrect headers.

The original pd.concat path aligned schemas before writing. Reindexing each DataFrame before append prevents this corruption:

Proposed fix: ensure consistent column order before append
     out_files = []
     for level, qval_list in results.items():
         out_file = str(file_base) + f".{level}.txt"
+        all_cols = []
+        for df in qval_list:
+            for col in df.columns:
+                if col not in all_cols:
+                    all_cols.append(col)
+
         first = True
         for df in qval_list:
-            df.to_csv(
+            df.reindex(columns=all_cols).to_csv(
                 out_file,
                 sep=sep,
                 index=False,
                 float_format=f"%.{precision}f",
                 header=first,
                 mode="w" if first else "a",
             )
             first = False
         out_files.append(out_file)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crema/writers/txt.py` around lines 65 - 75, The iterative append loop that
calls to_csv() on each DataFrame in qval_list can result in DataFrames with
different column orders being written to the same file, corrupting the output.
Before the to_csv() call for each df in the loop, reindex each DataFrame to
match a consistent column order (such as the column order of the first DataFrame
in the list) to ensure all appends align columns correctly regardless of how
each individual Confidence object's data was prettified.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

crema/utils.py:72

  • parse_psms_txt(..., chunk_size=...) adds a new return type (TextFileReader) and chunked-reading behavior, but there are currently no unit tests covering this function (including skip_line handling + usecols filtering in chunked mode). Please add tests to validate the iterator path yields the expected columns and total rows.
def parse_psms_txt(txt_file, cols, skip_line, chunk_size=None):
    """Parse a single tab-delimited file

    Parameters
    ----------

Comment thread crema/writers/txt.py
Comment on lines +66 to +70
for df in qval_list:
df.to_csv(
out_file,
sep=sep,
index=False,
Comment thread crema/parsers/txt.py Outdated
Comment on lines +87 to +92
# Streaming mode: yield chunks without constructing a PsmDataset
if chunk_size is not None and not isinstance(txt_files, pd.DataFrame):
return _read_txt_chunked(
utils.listify(txt_files), sep, fields, target_column, chunk_size
)

Comment thread crema/parsers/txt.py
Comment on lines 20 to 24
sep="\t",
pairing_file_name=None,
copy_data=False,
chunk_size=None,
):

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread crema/parsers/txt.py
Comment on lines +90 to +98
if chunk_size is not None:
if isinstance(txt_files, pd.DataFrame):
raise ValueError(
"chunk_size is not supported when txt_files is a DataFrame. "
"Pass file path(s) instead to use streaming mode."
)
return _read_txt_chunked(
utils.listify(txt_files), sep, fields, target_column, chunk_size
)
wsnoble and others added 2 commits June 19, 2026 19:20
writers/txt.py:
- Stream each confidence-estimate DataFrame directly to CSV instead of
  materialising the full concatenated result first.  Uses mode='a' +
  header=False for all but the first chunk, eliminating the peak-memory
  spike from pd.concat() before the write.

parsers/txt.py:
- Add chunk_size parameter to read_txt().  When set, returns a generator
  that yields one pandas DataFrame per chunk_size rows, with the target
  column already converted to bool.  The non-chunked path is unchanged.
  Chunked mode is intended as the ingestion layer for the Phase 3 DuckDB
  backend.

utils.py:
- Add chunk_size parameter to parse_psms_txt().  When set, returns a
  pandas TextFileReader iterator instead of a full DataFrame, allowing
  callers (e.g. Tide/Comet/MSGF+ parsers) to read large files row-batch
  by row-batch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
writers/txt.py:
- Compute the union of all columns across DataFrames before writing so
  that appended rows from different Confidence objects are always aligned
  to the same header, regardless of column order or set differences.

parsers/txt.py:
- Raise ValueError when chunk_size is passed with a DataFrame input
  instead of silently ignoring it. DataFrames are already in memory so
  streaming mode does not apply.

tests/unit_tests/test_writers.py:
- Add test_to_txt_multi_confidence_row_count: verifies that writing a
  tuple of two identical Confidence objects produces exactly 2× the rows
  of a single write, covering the multi-object append path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants