[fix](iceberg)fix core when read a iceberg table that have schema change and equality delete.#59984
[fix](iceberg)fix core when read a iceberg table that have schema change and equality delete.#59984hubgeter wants to merge 4 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
54149f6 to
10863fa
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31578 ms |
TPC-DS: Total hot run time: 173933 ms |
ClickBench: Total hot run time: 27.86 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31924 ms |
TPC-DS: Total hot run time: 174572 ms |
ClickBench: Total hot run time: 27.85 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 32029 ms |
TPC-DS: Total hot run time: 173146 ms |
ClickBench: Total hot run time: 27.65 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31893 ms |
TPC-DS: Total hot run time: 173991 ms |
ClickBench: Total hot run time: 27.78 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Pull request overview
This PR fixes critical issues with reading Iceberg tables that have schema changes combined with equality delete files, particularly for tables migrated from Hive. The implementation correctly uses Iceberg field IDs instead of column names for matching, adds support for name mappings from table properties, and optimizes metadata transmission by pruning unnecessary columns.
Changes:
- Implements Iceberg field ID-based column matching for equality deletes instead of name-based matching to handle schema evolution correctly
- Adds support for reading Hive-migrated Iceberg tables using the
schema.name-mapping.defaulttable property for files missing field IDs - Optimizes Thrift metadata transmission by pruning schema info to only include required columns and nested fields
Reviewed changes
Copilot reviewed 87 out of 297 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java | Adds name mapping extraction from Iceberg table properties and implements schema pruning based on required columns |
| gensrc/thrift/ExternalTableSchema.thrift | Adds optional name_mapping field to TField for supporting missing column IDs |
| be/src/vec/exec/format/table/table_format_reader.h | Updates method signatures to pass exist_field_id as parameter instead of out parameter |
| be/src/vec/exec/format/table/iceberg_reader.h | Refactors equality delete handling to support multiple delete file groups based on field IDs |
| be/src/vec/exec/format/table/equality_delete.h | Updates equality delete interfaces to accept field IDs and id-to-column-name mapping |
| be/src/vec/exec/format/parquet/schema_desc.h | Changes get_column parameter type from int to size_t |
| regression-test/suites/external_table_p0/iceberg/test_gen_iceberg_by_api.groovy | Updates test to expect error when reading missing field ID files without name mapping |
| docker/thirdparties/docker-compose/iceberg/scripts/preinstalled_data/iceberg/* | Adds comprehensive test data files (Parquet, ORC, metadata) for testing equality delete with schema changes |
| docker/thirdparties/docker-compose/iceberg/iceberg.yaml.tpl | Mounts preinstalled test data directory in containers |
| docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl | Adds execution of iceberg_load SQL scripts to initialize test data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (delete.content() == FileContent.POSITION_DELETES) { | ||
| filters.add(IcebergDeleteFileFilter.createPositionDelete(delete)); | ||
| } else if (delete.content() == FileContent.EQUALITY_DELETES) { | ||
| // spitTask.schema().findField() |
There was a problem hiding this comment.
Remove this commented-out code. It appears to be a leftover from debugging or development and should be cleaned up before merging.
774c62d to
bf0e72e
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29365 ms |
TPC-DS: Total hot run time: 184548 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29313 ms |
TPC-DS: Total hot run time: 184196 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Problem Summary:
This PR addresses three major issues related to reading Iceberg tables:
When reading an Iceberg table that contains equality delete files, a crash may occur if schema changes have been applied to the affected columns.
The root cause is that the previous implementation matched columns by column name when applying equality deletes. However, Iceberg requires matching columns by Iceberg field ID to correctly identify the original column across schema evolution.
This PR updates the equality delete logic to resolve columns using Iceberg IDs instead of column names.
For Iceberg tables migrated from Hive, legacy data files do not contain Iceberg field IDs.
In this case, columns must be resolved using the table property
schema.name-mapping.default.This PR adds support for using the name mapping defined in table properties to correctly match columns when reading such data files.
Reduce Thrift metadata overhead during table reads. To optimize read performance, this PR reduces the amount of Thrift metadata transmitted when reading Iceberg tables.
History_schema_infonow includes only the columns required for the read, including pruning of nested fields within complex types.Release note
Handle Iceberg data files with missing field IDs.
Fix equality delete handling for Iceberg tables with schema changes.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)