Phase 2: streaming writer and chunked text parser#27
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Implements Phase 2 of the scaling plan by adding chunked/streaming ingestion for large TXT inputs and reducing peak memory usage when writing TXT outputs.
Changes:
- Stream TXT writer output by appending each confidence-estimate DataFrame directly to disk instead of concatenating in memory.
- Add
chunk_sizesupport toread_txt()to optionally return a generator of DataFrame chunks. - Add
chunk_sizesupport toutils.parse_psms_txt()to optionally return aTextFileReaderiterator.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crema/writers/txt.py | Switches TXT output writing to iterative to_csv(..., mode='a') to reduce peak memory during output. |
| crema/utils.py | Adds optional chunked CSV reading via chunksize for PSM TXT parsing. |
| crema/parsers/txt.py | Adds optional streaming/chunked mode to return DataFrame chunks instead of constructing a PsmDataset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| first = True | ||
| for df in qval_list: | ||
| df.to_csv( | ||
| out_file, | ||
| sep=sep, | ||
| index=False, | ||
| float_format=f"%.{precision}f", | ||
| header=first, | ||
| mode="w" if first else "a", | ||
| ) | ||
| first = False |
| # Streaming mode: yield chunks without constructing a PsmDataset | ||
| if chunk_size is not None and not isinstance(txt_files, pd.DataFrame): | ||
| return _read_txt_chunked( | ||
| utils.listify(txt_files), sep, fields, target_column, chunk_size | ||
| ) | ||
|
|
| first = True | ||
| for df in qval_list: | ||
| df.to_csv( | ||
| out_file, | ||
| sep=sep, |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crema/parsers/txt.py`:
- Around line 61-77: The docstring for the chunk_size parameter (lines 61-77)
documents that a generator is returned when chunk_size is set, but the
implementation at line 88 silently skips streaming mode when txt_files is a
pandas.DataFrame. Fix this inconsistency by either updating the docstring to
explicitly document that DataFrame inputs do not support chunked mode and will
always return a PsmDataset, or by adding a ValueError check in the
implementation that raises a clear error message when chunk_size is provided
with a DataFrame input to prevent unexpected behavior.
In `@crema/writers/txt.py`:
- Around line 65-75: The iterative append loop that calls to_csv() on each
DataFrame in qval_list can result in DataFrames with different column orders
being written to the same file, corrupting the output. Before the to_csv() call
for each df in the loop, reindex each DataFrame to match a consistent column
order (such as the column order of the first DataFrame in the list) to ensure
all appends align columns correctly regardless of how each individual Confidence
object's data was prettified.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f300f634-3eba-43ed-9a65-628684465c92
📒 Files selected for processing (3)
crema/parsers/txt.pycrema/utils.pycrema/writers/txt.py
| chunk_size : int or None, optional | ||
| When set, the parser operates in streaming mode: instead of loading all | ||
| files into memory at once, it returns a generator that yields one | ||
| :py:class:`pandas.DataFrame` per chunk of ``chunk_size`` rows. This is | ||
| useful for feeding data into a DuckDB relation or a Parquet writer | ||
| without ever holding the full dataset in RAM. When ``None`` (default), | ||
| the existing behaviour is preserved and a | ||
| :py:class:`~crema.dataset.PsmDataset` is returned. | ||
|
|
||
| Returns | ||
| ------- | ||
| PsmDataset | ||
| A :py:class:`~crema.dataset.PsmDataset` object containing the parsed | ||
| PSMs. | ||
| PsmDataset or generator of pandas.DataFrame | ||
| When ``chunk_size`` is ``None``, a | ||
| :py:class:`~crema.dataset.PsmDataset` object containing the parsed | ||
| PSMs. When ``chunk_size`` is set, a generator of | ||
| :py:class:`pandas.DataFrame` chunks (each with the target column | ||
| already converted to bool). |
There was a problem hiding this comment.
chunk_size contract is inconsistent for DataFrame inputs.
Line 88 skips streaming when txt_files is a pandas.DataFrame, but Lines 61-77 document chunked mode as returning a generator when chunk_size is set. Please either (a) document the DataFrame exception explicitly, or (b) raise a clear ValueError when chunk_size is used with DataFrame input to avoid type surprises.
Also applies to: 87-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crema/parsers/txt.py` around lines 61 - 77, The docstring for the chunk_size
parameter (lines 61-77) documents that a generator is returned when chunk_size
is set, but the implementation at line 88 silently skips streaming mode when
txt_files is a pandas.DataFrame. Fix this inconsistency by either updating the
docstring to explicitly document that DataFrame inputs do not support chunked
mode and will always return a PsmDataset, or by adding a ValueError check in the
implementation that raises a clear error message when chunk_size is provided
with a DataFrame input to prevent unexpected behavior.
| first = True | ||
| for df in qval_list: | ||
| df.to_csv( | ||
| out_file, | ||
| sep=sep, | ||
| index=False, | ||
| float_format=f"%.{precision}f", | ||
| header=first, | ||
| mode="w" if first else "a", | ||
| ) | ||
| first = False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Demonstrates the failure mode: appending differently-shaped DataFrames
# without reindexing can mis-map values under a fixed header.
python - <<'PY'
import tempfile
from pathlib import Path
import pandas as pd
p = Path(tempfile.mkdtemp()) / "demo.tsv"
df1 = pd.DataFrame({"a":[1], "b":[2]})
df2 = pd.DataFrame({"b":[3], "a":[4], "c":[5]})
first = True
for df in (df1, df2):
df.to_csv(p, sep="\t", index=False, header=first, mode="w" if first else "a")
first = False
print("Written file:")
print(p.read_text())
print("Round-tripped DataFrame:")
print(pd.read_csv(p, sep="\t"))
PYRepository: Noble-Lab/crema
Length of output: 178
🏁 Script executed:
# First, locate and examine the file
fd -t f "txt.py" | grep -i writerRepository: Noble-Lab/crema
Length of output: 80
🏁 Script executed:
# Read the file around lines 65-75
cat -n crema/writers/txt.py | sed -n '50,85p'Repository: Noble-Lab/crema
Length of output: 1259
🏁 Script executed:
# Check the broader context to understand qval_list construction
cat -n crema/writers/txt.py | sed -n '1,80p'Repository: Noble-Lab/crema
Length of output: 3158
🏁 Script executed:
# Examine _get_level_data function
cat -n crema/writers/txt.py | sed -n '81,150p'Repository: Noble-Lab/crema
Length of output: 983
🏁 Script executed:
# Find the Confidence class definition
fd -t f "\.py$" -path "*confidence*" | head -5Repository: Noble-Lab/crema
Length of output: 228
🏁 Script executed:
# Search for how DataFrames are created in the codebase
rg "DataFrame|to_csv" --type py -A 2 -B 2 | head -100Repository: Noble-Lab/crema
Length of output: 5647
🏁 Script executed:
# Find the Confidence class definition
fd -t f "confidence.py"Repository: Noble-Lab/crema
Length of output: 115
🏁 Script executed:
# Search for _prettify_tables or DataFrame construction
rg "_prettify_tables|confidence_estimates\s*=" --type py -B 2 -A 5 | head -80Repository: Noble-Lab/crema
Length of output: 2455
🌐 Web query:
pandas DataFrame.to_csv append mode column order mismatch behavior
💡 Result:
When using pandas DataFrame.to_csv with mode='a' (append mode), pandas does not validate or check the column order or consistency between the new data being appended and the existing content of the CSV file [1][2]. The to_csv method performs a blind append of the raw data [3]. Consequently, if your DataFrame columns are in a different order than the CSV file, or if they contain a different set of columns entirely, pandas will write the data as-is, leading to data misalignment or an inconsistent CSV structure [4][1][3]. To safely append data while maintaining correct column order, you should implement validation or reordering logic before calling to_csv [2]: 1. Read the header of the existing file to determine the expected column order [1][3]. 2. Reorder or reindex your new DataFrame to match that order using the DataFrame.reindex method [4][3]. 3. Append the reindexed DataFrame, ensuring header=False is set to prevent writing a redundant header [2][5]. Example of safe append: import pandas as pd import os # 1. Determine existing column order if os.path.exists('file.csv'): # Read only the header to get correct order expected_columns = pd.read_csv('file.csv', nrows=0).columns # 2. Reindex new data to match existing order new_data = new_data.reindex(columns=expected_columns) # 3. Append to file new_data.to_csv('file.csv', mode='a', index=False, header=False) else: # First write new_data.to_csv('file.csv', index=False) This behavior is intentional as pandas does not assume the file being appended to necessarily has a header or follows a specific schema, and it avoids the performance overhead of reading and validating potentially large files [1].
Citations:
- 1: why not check the column order and type consistent before use ‘appand’ mode in to_csv or equivalent ops such as to_sql? pandas-dev/pandas#38656
- 2: https://softhints.com/how-to-append-data-to-existing-csv-file-in-pandas/
- 3: https://stackoverflow.com/questions/76345249/pandas-to-csv-method-in-append-mode-with-dataframes-with-different-columns
- 4: https://stackoverflow.com/questions/51648524/pandas-to-csv-append-mode-with-preserved-columns-order
- 5: https://www.statology.org/pandas-to-csv-append/
🏁 Script executed:
# Examine _prettify_tables to see if it standardizes column order
cat -n crema/confidence.py | sed -n '1,50p'Repository: Noble-Lab/crema
Length of output: 1959
🏁 Script executed:
# Find and read _prettify_tables method
rg "_prettify_tables" -A 40 crema/confidence.pyRepository: Noble-Lab/crema
Length of output: 1903
🏁 Script executed:
# Continue reading _prettify_tables for PSM and peptide levels
rg "_prettify_tables" -A 60 crema/confidence.py | tail -30Repository: Noble-Lab/crema
Length of output: 1319
🏁 Script executed:
# Check if there are any tests that verify aggregation of multiple Confidence objects
rg "results\[level\] \+=" -B 5 -A 5 crema/Repository: Noble-Lab/crema
Length of output: 640
🏁 Script executed:
# Look for any variations in how DataFrames are created that might cause column differences
rg "confidence_estimates\[" -B 3 -A 3 crema/confidence.pyRepository: Noble-Lab/crema
Length of output: 2036
Appending DataFrames with inconsistent column order corrupts the output file.
When multiple Confidence objects are aggregated in results[level] (lines 57-60), each object's DataFrames are reordered by _prettify_tables() based on their individual dataset configuration. If different Confidence objects use different dataset settings (spectrum columns, score column, etc.), the resulting DataFrames can have different column sets or orders. The iterative append at lines 67-74 writes each DataFrame as-is without realigning columns, causing values to be recorded under incorrect headers.
The original pd.concat path aligned schemas before writing. Reindexing each DataFrame before append prevents this corruption:
Proposed fix: ensure consistent column order before append
out_files = []
for level, qval_list in results.items():
out_file = str(file_base) + f".{level}.txt"
+ all_cols = []
+ for df in qval_list:
+ for col in df.columns:
+ if col not in all_cols:
+ all_cols.append(col)
+
first = True
for df in qval_list:
- df.to_csv(
+ df.reindex(columns=all_cols).to_csv(
out_file,
sep=sep,
index=False,
float_format=f"%.{precision}f",
header=first,
mode="w" if first else "a",
)
first = False
out_files.append(out_file)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crema/writers/txt.py` around lines 65 - 75, The iterative append loop that
calls to_csv() on each DataFrame in qval_list can result in DataFrames with
different column orders being written to the same file, corrupting the output.
Before the to_csv() call for each df in the loop, reindex each DataFrame to
match a consistent column order (such as the column order of the first DataFrame
in the list) to ensure all appends align columns correctly regardless of how
each individual Confidence object's data was prettified.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
crema/utils.py:72
parse_psms_txt(..., chunk_size=...)adds a new return type (TextFileReader) and chunked-reading behavior, but there are currently no unit tests covering this function (includingskip_linehandling +usecolsfiltering in chunked mode). Please add tests to validate the iterator path yields the expected columns and total rows.
def parse_psms_txt(txt_file, cols, skip_line, chunk_size=None):
"""Parse a single tab-delimited file
Parameters
----------
| for df in qval_list: | ||
| df.to_csv( | ||
| out_file, | ||
| sep=sep, | ||
| index=False, |
| # Streaming mode: yield chunks without constructing a PsmDataset | ||
| if chunk_size is not None and not isinstance(txt_files, pd.DataFrame): | ||
| return _read_txt_chunked( | ||
| utils.listify(txt_files), sep, fields, target_column, chunk_size | ||
| ) | ||
|
|
| sep="\t", | ||
| pairing_file_name=None, | ||
| copy_data=False, | ||
| chunk_size=None, | ||
| ): |
| if chunk_size is not None: | ||
| if isinstance(txt_files, pd.DataFrame): | ||
| raise ValueError( | ||
| "chunk_size is not supported when txt_files is a DataFrame. " | ||
| "Pass file path(s) instead to use streaming mode." | ||
| ) | ||
| return _read_txt_chunked( | ||
| utils.listify(txt_files), sep, fields, target_column, chunk_size | ||
| ) |
writers/txt.py: - Stream each confidence-estimate DataFrame directly to CSV instead of materialising the full concatenated result first. Uses mode='a' + header=False for all but the first chunk, eliminating the peak-memory spike from pd.concat() before the write. parsers/txt.py: - Add chunk_size parameter to read_txt(). When set, returns a generator that yields one pandas DataFrame per chunk_size rows, with the target column already converted to bool. The non-chunked path is unchanged. Chunked mode is intended as the ingestion layer for the Phase 3 DuckDB backend. utils.py: - Add chunk_size parameter to parse_psms_txt(). When set, returns a pandas TextFileReader iterator instead of a full DataFrame, allowing callers (e.g. Tide/Comet/MSGF+ parsers) to read large files row-batch by row-batch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
writers/txt.py: - Compute the union of all columns across DataFrames before writing so that appended rows from different Confidence objects are always aligned to the same header, regardless of column order or set differences. parsers/txt.py: - Raise ValueError when chunk_size is passed with a DataFrame input instead of silently ignoring it. DataFrames are already in memory so streaming mode does not apply. tests/unit_tests/test_writers.py: - Add test_to_txt_multi_confidence_row_count: verifies that writing a tuple of two identical Confidence objects produces exactly 2× the rows of a single write, covering the multi-object append path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
writers/txt.py: Eliminatepd.concat()before writing. Each confidence-estimate DataFrame is now written directly to disk withmode='a', so peak memory during output no longer scales with dataset size.parsers/txt.py: Addchunk_sizeparameter toread_txt(). When set, returns a generator yielding DataFrames ofchunk_sizerows (target column already converted to bool). Non-chunked path is unchanged.utils.py: Addchunk_sizeparameter toparse_psms_txt()(used by Tide/Comet/MSGF+/etc. parsers). Returns a pandasTextFileReaderiterator when set, allowing row-batch ingestion of large files.This is Phase 2 of
SCALING_PLAN.md. The chunked parser is the ingestion layer for the Phase 3 DuckDB backend.Not implemented in this PR (deferred to Phase 3):
PsmDatasetStream— requires DuckDBlxml.etree.iterparseTest plan
to_txt()output is identical before and after the writer changeread_txt(..., chunk_size=10000)yields DataFrames with correct dtypes and row counts🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
chunk_sizeparameter inread_txt()to receive data as chunks via a generator.Refactor