Fix segfault and out-of-bounds reads in file loaders on malformed rows#2453
Conversation
load_edges_from_file() and load_labels_from_file() build their COPY parser with only format=csv and header=false, so COPY uses its default comma delimiter. A file delimited by anything else (or a malformed row) then parses with an unexpected column count, and the loaders indexed the parsed fields without validating that count: - process_edge_row() reads the four fixed fields fields[0..3] unconditionally. A non-comma-delimited edge file parses as a single column, so fields[1..3] are out of bounds -> segfault (issue apache#2449). - create_agtype_from_list()/_i() pair header[i] with fields[i] for all i < nfields, so a row with more fields than the header reads header[i] out of bounds. Add bounds validation that turns these into clear errors: - Edge header must have >= 4 columns; a smaller count almost always means the wrong delimiter, so the error carries a hint. - Each edge row must have >= 4 columns and no more than the header's. - Each label row must have no more than the header's column count. Rows with fewer trailing columns than the header remain allowed, matching existing behavior (exercised by the existing conversion tests). This closes the segfault and out-of-bounds reads. The silent mis-parsing of a non-comma file whose header and rows share the same (wrong) column count is not detectable here; adding a delimiter option to the load functions is a separate follow-up. Adds a regression test in age_load using a pipe-delimited edge file. Addresses apache#2449.
|
@jrgemignani CI is green across all five workflows (Build/Regression + all four drivers). Ready for review whenever you have a minute. This closes the segfault + out-of-bounds reads in #2449: the loaders indexed COPY-parsed fields without validating the column count, so a non-comma-delimited edge file (parsed as one column) read Scope note: this closes the crash/OOB half; the silent mis-parse of a non-comma file (the label-corruption half) needs a delimiter option, which I'll do as a follow-up — so #2449 stays open for that. Small enough to consider for 1.8.0 if you want the crash fix in. |
There was a problem hiding this comment.
Pull request overview
This PR hardens the CSV-based label/edge file loaders against malformed or mis-delimited input to prevent server crashes (segfault) and out-of-bounds reads, turning them into clear user-facing errors. It addresses the crash/UB described in issue #2449 by validating parsed column counts before indexing parsed COPY fields.
Changes:
- Add column-count validation for label rows to prevent
header[i]out-of-bounds reads when a row has more columns than the header. - Add header and row column-count validation for edge files to prevent
fields[0..3]andheader[i]out-of-bounds reads. - Add a regression test and fixture for a pipe-delimited edge file to assert a clean error (previously segfaulted).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/backend/utils/load/ag_load_labels.c | Adds row-vs-header bounds validation to prevent OOB reads in property construction. |
| src/backend/utils/load/ag_load_edges.c | Adds header and row column-count validation to prevent segfault/OOB reads on malformed input. |
| regress/sql/age_load.sql | Adds a regression case for mis-delimited edge input (pipe-delimited). |
| regress/expected/age_load.out | Captures expected ERROR/HINT output for the new regression case. |
| regress/age_load/data/bad_delim_edges.csv | Adds a pipe-delimited edge CSV fixture used by the regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@gregfelice please see above Copilot suggestions. |
Address review feedback on the nfields guards:
- Error messages now say "the header's %d columns" (was "the header's %d"),
making the count's unit explicit.
- Add regression cases exercising the per-row guards, which previously only
had coverage for the mis-delimited-header path:
* an edge row with fewer than 4 columns
* an edge row with more columns than the header
* a label row with more columns than the header
Each asserts a clean ERROR (these were the out-of-bounds reads the guards
now catch).
|
@jrgemignani addressed the Copilot notes in a431fc5:
Each asserts a clean ERROR — these are exactly the out-of-bounds reads the guards now catch. Full regression suite green locally (41/41 on PG18). |
Summary
Fixes the segfault reported in #2449 (and a related out-of-bounds read) in
load_edges_from_file()/load_labels_from_file().Both loaders build their COPY parser with only
format=csv+header=false, so COPY uses its default comma delimiter. A file delimited by anything else — or any malformed row — then parses with an unexpected column count, and the loaders indexed the parsed fields without validating that count:process_edge_row()reads the four fixed fieldsfields[0..3]unconditionally. A non-comma-delimited edge file parses as a single column, sofields[1..3]are out of bounds → segfault.create_agtype_from_list()/_i()pairheader[i]withfields[i]for alli < nfields, so a row with more fields than the header readsheader[i]out of bounds.Fix
Bounds validation that turns these into clear errors instead of a crash/UB:
errhintpointing at the delimiter.Rows with fewer trailing columns than the header remain allowed (that's safe — the property loop only iterates
nfields— and is exercised by the existing conversion tests).Example, before vs after, on a pipe-delimited edge file:
Scope / follow-up
This closes the segfault and out-of-bounds reads. The silent mis-parsing of a non-comma file whose header and rows share the same (wrong) column count (the label-corruption half of #2449) isn't detectable here — the real fix for that is a delimiter option on the load functions, which I'll do as a separate follow-up. Leaving #2449 open for that.
Tests
Adds a regression case to
age_loadusing a pipe-delimited edge file (asserts the clean error). Full regression suite passes locally against PostgreSQL 18 (41/41).