Skip to content

Fix segfault and out-of-bounds reads in file loaders on malformed rows#2453

Merged
jrgemignani merged 2 commits into
apache:masterfrom
gregfelice:fix/loader-delimiter-nfields-guard
Jul 2, 2026
Merged

Fix segfault and out-of-bounds reads in file loaders on malformed rows#2453
jrgemignani merged 2 commits into
apache:masterfrom
gregfelice:fix/loader-delimiter-nfields-guard

Conversation

@gregfelice

Copy link
Copy Markdown
Contributor

Summary

Fixes the segfault reported in #2449 (and a related out-of-bounds read) in load_edges_from_file() / load_labels_from_file().

Both loaders build their COPY parser with only format=csv + header=false, so COPY uses its default comma delimiter. A file delimited by anything else — or any malformed row — then parses with an unexpected column count, and the loaders indexed the parsed fields without validating that count:

  • process_edge_row() reads the four fixed fields fields[0..3] unconditionally. A non-comma-delimited edge file parses as a single column, so fields[1..3] are out of bounds → segfault.
  • create_agtype_from_list()/_i() pair header[i] with fields[i] for all i < nfields, so a row with more fields than the header reads header[i] out of bounds.

Fix

Bounds validation that turns these into clear errors instead of a crash/UB:

  • Edge header must have ≥ 4 columns; a smaller count almost always means the wrong delimiter, so the error carries an errhint pointing at the delimiter.
  • Each edge row must have ≥ 4 columns and no more than the header's count.
  • Each label row must have no more than the header's count.

Rows with fewer trailing columns than the header remain allowed (that's safe — the property loop only iterates nfields — and is exercised by the existing conversion tests).

Example, before vs after, on a pipe-delimited edge file:

-- before: server terminated by signal 11 (segfault)
-- after:
ERROR:  edge file must have at least 4 columns (start_id, start_vertex_type, end_id, end_vertex_type), but the header has 1
HINT:  load_edges_from_file expects a comma-delimited CSV; check the file's delimiter.

Scope / follow-up

This closes the segfault and out-of-bounds reads. The silent mis-parsing of a non-comma file whose header and rows share the same (wrong) column count (the label-corruption half of #2449) isn't detectable here — the real fix for that is a delimiter option on the load functions, which I'll do as a separate follow-up. Leaving #2449 open for that.

Tests

Adds a regression case to age_load using a pipe-delimited edge file (asserts the clean error). Full regression suite passes locally against PostgreSQL 18 (41/41).

load_edges_from_file() and load_labels_from_file() build their COPY
parser with only format=csv and header=false, so COPY uses its default
comma delimiter. A file delimited by anything else (or a malformed row)
then parses with an unexpected column count, and the loaders indexed the
parsed fields without validating that count:

- process_edge_row() reads the four fixed fields fields[0..3]
  unconditionally. A non-comma-delimited edge file parses as a single
  column, so fields[1..3] are out of bounds -> segfault (issue apache#2449).
- create_agtype_from_list()/_i() pair header[i] with fields[i] for all
  i < nfields, so a row with more fields than the header reads header[i]
  out of bounds.

Add bounds validation that turns these into clear errors:

- Edge header must have >= 4 columns; a smaller count almost always
  means the wrong delimiter, so the error carries a hint.
- Each edge row must have >= 4 columns and no more than the header's.
- Each label row must have no more than the header's column count.

Rows with fewer trailing columns than the header remain allowed, matching
existing behavior (exercised by the existing conversion tests).

This closes the segfault and out-of-bounds reads. The silent mis-parsing
of a non-comma file whose header and rows share the same (wrong) column
count is not detectable here; adding a delimiter option to the load
functions is a separate follow-up.

Adds a regression test in age_load using a pipe-delimited edge file.

Addresses apache#2449.
@gregfelice

Copy link
Copy Markdown
Contributor Author

@jrgemignani CI is green across all five workflows (Build/Regression + all four drivers). Ready for review whenever you have a minute.

This closes the segfault + out-of-bounds reads in #2449: the loaders indexed COPY-parsed fields without validating the column count, so a non-comma-delimited edge file (parsed as one column) read fields[1..3] out of bounds and crashed. The fix adds bounds guards that turn it into a clear error (with a hint about the delimiter). Small, in-lane, regression test included in age_load.

Scope note: this closes the crash/OOB half; the silent mis-parse of a non-comma file (the label-corruption half) needs a delimiter option, which I'll do as a follow-up — so #2449 stays open for that. Small enough to consider for 1.8.0 if you want the crash fix in.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the CSV-based label/edge file loaders against malformed or mis-delimited input to prevent server crashes (segfault) and out-of-bounds reads, turning them into clear user-facing errors. It addresses the crash/UB described in issue #2449 by validating parsed column counts before indexing parsed COPY fields.

Changes:

  • Add column-count validation for label rows to prevent header[i] out-of-bounds reads when a row has more columns than the header.
  • Add header and row column-count validation for edge files to prevent fields[0..3] and header[i] out-of-bounds reads.
  • Add a regression test and fixture for a pipe-delimited edge file to assert a clean error (previously segfaulted).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/backend/utils/load/ag_load_labels.c Adds row-vs-header bounds validation to prevent OOB reads in property construction.
src/backend/utils/load/ag_load_edges.c Adds header and row column-count validation to prevent segfault/OOB reads on malformed input.
regress/sql/age_load.sql Adds a regression case for mis-delimited edge input (pipe-delimited).
regress/expected/age_load.out Captures expected ERROR/HINT output for the new regression case.
regress/age_load/data/bad_delim_edges.csv Adds a pipe-delimited edge CSV fixture used by the regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/backend/utils/load/ag_load_labels.c Outdated
Comment thread src/backend/utils/load/ag_load_edges.c
Comment thread src/backend/utils/load/ag_load_labels.c
Comment thread src/backend/utils/load/ag_load_edges.c
@jrgemignani

Copy link
Copy Markdown
Contributor

@gregfelice please see above Copilot suggestions.

Address review feedback on the nfields guards:

- Error messages now say "the header's %d columns" (was "the header's %d"),
  making the count's unit explicit.
- Add regression cases exercising the per-row guards, which previously only
  had coverage for the mis-delimited-header path:
    * an edge row with fewer than 4 columns
    * an edge row with more columns than the header
    * a label row with more columns than the header
  Each asserts a clean ERROR (these were the out-of-bounds reads the guards
  now catch).
@gregfelice

Copy link
Copy Markdown
Contributor Author

@jrgemignani addressed the Copilot notes in a431fc5:

  • Error wording: both per-row messages now say "the header's %d columns" so the count's unit is explicit.
  • Missing regression coverage: added cases for the per-row guards (previously only the mis-delimited-header path was asserted):
    • an edge row with fewer than 4 columns → edge file row has 2 columns; expected at least 4 and no more than the header's 4 columns
    • an edge row with more columns than the header → ...has 5 columns; ...header's 4 columns
    • a label row with more columns than the header → label file row has 3 columns, more than the header's 2 columns

Each asserts a clean ERROR — these are exactly the out-of-bounds reads the guards now catch. Full regression suite green locally (41/41 on PG18).

@jrgemignani jrgemignani merged commit 09fce2c into apache:master Jul 2, 2026
6 checks passed
@gregfelice gregfelice deleted the fix/loader-delimiter-nfields-guard branch July 2, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants