Skip to content

[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2#56796

Closed
aokolnychyi wants to merge 1 commit into
apache:branch-4.2from
aokolnychyi:spark-57544-4.2
Closed

[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2#56796
aokolnychyi wants to merge 1 commit into
apache:branch-4.2from
aokolnychyi:spark-57544-4.2

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR reworks column ID validation for nested fields in DSv2.

Why are the changes needed?

The original implementation detected dropped-and-re-added columns by comparing top-level Column.id() strings in a dedicated validateColumnIds pass, but this approach had no visibility into nested struct fields, array elements, or map keys/values. To work around this limitation, connectors had to encode nested field IDs into the top-level ID string (as demonstrated by ComposedColumnIdTableCatalog), placing an unreasonable burden on connector authors and making the feature fragile by design.

The new mechanism stores field IDs in StructField metadata and validates within validateSchemaCompatibility.

Does this PR introduce any user-facing change?

Yes but it targets unreleased functionality and must be cherry picked to 4.2.

How was this patch tested?

Existing and new tests.

Was this patch authored or co-authored using generative AI tooling?

Claude Code v2.1.183

This PR reworks column ID validation for nested fields in DSv2.

The original implementation detected dropped-and-re-added columns by comparing top-level Column.id() strings in a dedicated validateColumnIds pass, but this approach had no visibility into nested struct fields, array elements, or map keys/values. To work around this limitation, connectors had to encode nested field IDs into the top-level ID string (as demonstrated by ComposedColumnIdTableCatalog), placing an unreasonable burden on connector authors and making the feature fragile by design.

The new mechanism stores field IDs in `StructField` metadata and validates within `validateSchemaCompatibility`.

Yes but it targets unreleased functionality and must be cherry picked to 4.2.

Existing and new tests.

Claude Code v2.1.183

Closes apache#56619 from aokolnychyi/spark-57544.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@gengliangwang

Copy link
Copy Markdown
Member

@aokolnychyi Thanks for the backport. Merging to 4.2.
cc @huaxingao

gengliangwang pushed a commit that referenced this pull request Jun 26, 2026
### What changes were proposed in this pull request?

This PR reworks column ID validation for nested fields in DSv2.

### Why are the changes needed?

The original implementation detected dropped-and-re-added columns by comparing top-level Column.id() strings in a dedicated validateColumnIds pass, but this approach had no visibility into nested struct fields, array elements, or map keys/values. To work around this limitation, connectors had to encode nested field IDs into the top-level ID string (as demonstrated by ComposedColumnIdTableCatalog), placing an unreasonable burden on connector authors and making the feature fragile by design.

The new mechanism stores field IDs in `StructField` metadata and validates within `validateSchemaCompatibility`.

### Does this PR introduce _any_ user-facing change?

Yes but it targets unreleased functionality and must be cherry picked to 4.2.

### How was this patch tested?

Existing and new tests.

### Was this patch authored or co-authored using generative AI tooling?

Claude Code v2.1.183

Closes #56796 from aokolnychyi/spark-57544-4.2.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants