Fix segfault and out-of-bounds reads in file loaders on malformed rows by gregfelice · Pull Request #2453 · apache/age

gregfelice · 2026-07-02T13:27:11Z

Summary

Fixes the segfault reported in #2449 (and a related out-of-bounds read) in load_edges_from_file() / load_labels_from_file().

Both loaders build their COPY parser with only format=csv + header=false, so COPY uses its default comma delimiter. A file delimited by anything else — or any malformed row — then parses with an unexpected column count, and the loaders indexed the parsed fields without validating that count:

process_edge_row() reads the four fixed fields fields[0..3] unconditionally. A non-comma-delimited edge file parses as a single column, so fields[1..3] are out of bounds → segfault.
create_agtype_from_list()/_i() pair header[i] with fields[i] for all i < nfields, so a row with more fields than the header reads header[i] out of bounds.

Fix

Bounds validation that turns these into clear errors instead of a crash/UB:

Edge header must have ≥ 4 columns; a smaller count almost always means the wrong delimiter, so the error carries an errhint pointing at the delimiter.
Each edge row must have ≥ 4 columns and no more than the header's count.
Each label row must have no more than the header's count.

Rows with fewer trailing columns than the header remain allowed (that's safe — the property loop only iterates nfields — and is exercised by the existing conversion tests).

Example, before vs after, on a pipe-delimited edge file:

-- before: server terminated by signal 11 (segfault)
-- after:
ERROR:  edge file must have at least 4 columns (start_id, start_vertex_type, end_id, end_vertex_type), but the header has 1
HINT:  load_edges_from_file expects a comma-delimited CSV; check the file's delimiter.

Scope / follow-up

This closes the segfault and out-of-bounds reads. The silent mis-parsing of a non-comma file whose header and rows share the same (wrong) column count (the label-corruption half of #2449) isn't detectable here — the real fix for that is a delimiter option on the load functions, which I'll do as a separate follow-up. Leaving #2449 open for that.

Tests

Adds a regression case to age_load using a pipe-delimited edge file (asserts the clean error). Full regression suite passes locally against PostgreSQL 18 (41/41).

load_edges_from_file() and load_labels_from_file() build their COPY parser with only format=csv and header=false, so COPY uses its default comma delimiter. A file delimited by anything else (or a malformed row) then parses with an unexpected column count, and the loaders indexed the parsed fields without validating that count: - process_edge_row() reads the four fixed fields fields[0..3] unconditionally. A non-comma-delimited edge file parses as a single column, so fields[1..3] are out of bounds -> segfault (issue apache#2449). - create_agtype_from_list()/_i() pair header[i] with fields[i] for all i < nfields, so a row with more fields than the header reads header[i] out of bounds. Add bounds validation that turns these into clear errors: - Edge header must have >= 4 columns; a smaller count almost always means the wrong delimiter, so the error carries a hint. - Each edge row must have >= 4 columns and no more than the header's. - Each label row must have no more than the header's column count. Rows with fewer trailing columns than the header remain allowed, matching existing behavior (exercised by the existing conversion tests). This closes the segfault and out-of-bounds reads. The silent mis-parsing of a non-comma file whose header and rows share the same (wrong) column count is not detectable here; adding a delimiter option to the load functions is a separate follow-up. Adds a regression test in age_load using a pipe-delimited edge file. Addresses apache#2449.

gregfelice · 2026-07-02T13:37:21Z

@jrgemignani CI is green across all five workflows (Build/Regression + all four drivers). Ready for review whenever you have a minute.

This closes the segfault + out-of-bounds reads in #2449: the loaders indexed COPY-parsed fields without validating the column count, so a non-comma-delimited edge file (parsed as one column) read fields[1..3] out of bounds and crashed. The fix adds bounds guards that turn it into a clear error (with a hint about the delimiter). Small, in-lane, regression test included in age_load.

Scope note: this closes the crash/OOB half; the silent mis-parse of a non-comma file (the label-corruption half) needs a delimiter option, which I'll do as a follow-up — so #2449 stays open for that. Small enough to consider for 1.8.0 if you want the crash fix in.

Copilot

Pull request overview

This PR hardens the CSV-based label/edge file loaders against malformed or mis-delimited input to prevent server crashes (segfault) and out-of-bounds reads, turning them into clear user-facing errors. It addresses the crash/UB described in issue #2449 by validating parsed column counts before indexing parsed COPY fields.

Changes:

Add column-count validation for label rows to prevent header[i] out-of-bounds reads when a row has more columns than the header.
Add header and row column-count validation for edge files to prevent fields[0..3] and header[i] out-of-bounds reads.
Add a regression test and fixture for a pipe-delimited edge file to assert a clean error (previously segfaulted).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/backend/utils/load/ag_load_labels.c	Adds row-vs-header bounds validation to prevent OOB reads in property construction.
src/backend/utils/load/ag_load_edges.c	Adds header and row column-count validation to prevent segfault/OOB reads on malformed input.
regress/sql/age_load.sql	Adds a regression case for mis-delimited edge input (pipe-delimited).
regress/expected/age_load.out	Captures expected ERROR/HINT output for the new regression case.
regress/age_load/data/bad_delim_edges.csv	Adds a pipe-delimited edge CSV fixture used by the regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jrgemignani · 2026-07-02T14:43:46Z

@gregfelice please see above Copilot suggestions.

Address review feedback on the nfields guards: - Error messages now say "the header's %d columns" (was "the header's %d"), making the count's unit explicit. - Add regression cases exercising the per-row guards, which previously only had coverage for the mis-delimited-header path: * an edge row with fewer than 4 columns * an edge row with more columns than the header * a label row with more columns than the header Each asserts a clean ERROR (these were the out-of-bounds reads the guards now catch).

gregfelice · 2026-07-02T16:45:08Z

@jrgemignani addressed the Copilot notes in a431fc5:

Error wording: both per-row messages now say "the header's %d columns" so the count's unit is explicit.
Missing regression coverage: added cases for the per-row guards (previously only the mis-delimited-header path was asserted):
- an edge row with fewer than 4 columns → edge file row has 2 columns; expected at least 4 and no more than the header's 4 columns
- an edge row with more columns than the header → ...has 5 columns; ...header's 4 columns
- a label row with more columns than the header → label file row has 3 columns, more than the header's 2 columns

Each asserts a clean ERROR — these are exactly the out-of-bounds reads the guards now catch. Full regression suite green locally (41/41 on PG18).

jrgemignani requested a review from Copilot July 2, 2026 14:30

jrgemignani added priority 1.8.0 labels Jul 2, 2026

Copilot started reviewing on behalf of jrgemignani July 2, 2026 14:38 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Comment thread src/backend/utils/load/ag_load_labels.c Outdated

Comment thread src/backend/utils/load/ag_load_edges.c

Comment thread src/backend/utils/load/ag_load_labels.c

Comment thread src/backend/utils/load/ag_load_edges.c

jrgemignani approved these changes Jul 2, 2026

View reviewed changes

jrgemignani merged commit 09fce2c into apache:master Jul 2, 2026
6 checks passed

gregfelice deleted the fix/loader-delimiter-nfields-guard branch July 2, 2026 17:06

gregfelice mentioned this pull request Jul 2, 2026

Support outer references in reduce() fold bodies #2448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segfault and out-of-bounds reads in file loaders on malformed rows#2453

Fix segfault and out-of-bounds reads in file loaders on malformed rows#2453
jrgemignani merged 2 commits into
apache:masterfrom
gregfelice:fix/loader-delimiter-nfields-guard

gregfelice commented Jul 2, 2026

Uh oh!

gregfelice commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrgemignani commented Jul 2, 2026

Uh oh!

gregfelice commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

gregfelice commented Jul 2, 2026

Summary

Fix

Scope / follow-up

Tests

Uh oh!

gregfelice commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrgemignani commented Jul 2, 2026

Uh oh!

gregfelice commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants