[wip]feat: support column default values (initial-default / write-default)#731
Draft
huan233usc wants to merge 1 commit into
Draft
[wip]feat: support column default values (initial-default / write-default)#731huan233usc wants to merge 1 commit into
huan233usc wants to merge 1 commit into
Conversation
bb5f2d0 to
1b2be81
Compare
huan233usc
commented
Jun 12, 2026
| return std::cref(*write_default_); | ||
| } | ||
|
|
||
| SchemaField SchemaField::WithInitialDefault(Literal initial_default) const { |
Author
There was a problem hiding this comment.
Alternative to those with methods are make it in constructor
f234320 to
e46342a
Compare
Implements Iceberg v3 column default values (apache#730, item 2 of apache#637): - Schema model: SchemaField carries optional `initial-default` and `write-default` literals, with validation that defaults are primitive and match the field type. - JSON serde: parse and write the two fields using single-value serialization (resolves the TODO in struct field serialization). - Read path: Project() maps a missing column with an initial-default to FieldProjection::Kind::kDefault carrying the literal (for both required and optional columns, per spec), and the Parquet and Avro readers materialize it as a constant column via a new Literal-to-Arrow helper. This resolves the default-value TODO in schema projection. - Schema evolution: Add*Column accept an optional default value (used as both initial-default and write-default); a required column with a default no longer needs AllowIncompatibleChanges(); RequireColumn() accepts columns added with a default in the same update (resolves the defaulted-add TODO); UpdateColumnDefault() updates the write-default of an existing column; doc/rename/type-promotion updates preserve defaults (promotion casts them to the new type). - Format version gating: Schema::Validate() rejects schemas with default values below v3, using the existing kMinFormatVersionDefaultValues. Writers consume complete Arrow arrays, so applying write-default to omitted columns remains the engine's responsibility (as in Java); the library stores, validates, and serializes it.
e46342a to
1613523
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #730 (item 2 of #637).
Implements Iceberg v3 column default values:
initial-default/write-defaulton theschema, JSON serde, read-path application, schema-evolution support, and format-version
validation. Semantics follow the Java reference implementation
(
Types.NestedField/SchemaParser/SchemaUpdate/Schema.checkCompatibility).What changed
Schema model
SchemaFieldcarriesinitial_default/write_defaultliterals, stored asstd::shared_ptr<const Literal>— an immutable payload shared across field copies,the same pattern as the adjacent
type_member (and the C++ analog of Java'sfinal Literal<?>reference).literal.hcannot be included fromschema_field.hdue to the
literal.h -> type.h -> schema_field.hinclude cycle, and sharing keepsSchemaFieldcopies cheap. Accessors returnstd::optional<std::reference_wrapper<const Literal>>, the same optional-referenceidiom as
Schema::FindFieldByName, mirroring Java's nullableLiteral<?>onTypes.NestedField;WithInitialDefault/WithWriteDefaultcopy-modifiers followthe style of
AsRequired/AsOptional.SchemaField::Validate()checks that defaults are primitive literals matching thefield type (Java:
castDefaultrejects defaults on nested types);Schema::Validate(format_version)rejectsinitial-defaultbelow v3 with the samemessage wording as Java's
Schema.checkCompatibility, using the previously-unusedTableMetadata::kMinFormatVersionDefaultValues(resolves the TODO there).JSON serde
FieldFromJson/ToJson(SchemaField)parse and writeinitial-default/write-defaultusing the existing single-value serialization(
LiteralFromJson(json, type)), mirroring Java'sSchemaParser+SingleValueParser, and resolving theadd default valuesTODO in structserialization. All primitive types supported (incl. decimal, fixed, uuid, temporal).
Read path (
initial-default)Project()maps a column missing from a data file toFieldProjection::Kind::kDefaultcarrying the literal when the field has aninitial-default— for required and optional fields, per the spec. All fourprojection/decoding paths are covered:
schema_util.cc(resolving its default-valueTODO),
parquet_schema_util.cc(used by the Parquet reader),avro_schema_util.cc,and the Avro direct decoder.
iceberg::arrowhelpers (literal_util) convert aLiteralto an Arrow scalar /constant array; the Parquet reader materializes
kDefaultviaMakeDefaultArrayandthe Avro reader via
AppendDefaultToBuilder.Schema evolution (
write-default)AddColumn/AddRequiredColumnaccept an optionaldefault_value, used as both theinitial-defaultandwrite-defaultof the new column. Like Java'sTypes.NestedField.castDefault, the value is cast to the column type rather thanrejected (failing only if it cannot be cast). A required column with a default does
not need
AllowIncompatibleChanges().RequireColumn()accepts a column added with a default in the same update (resolvesthe defaulted-add TODO in
UpdateColumnRequirementInternal).UpdateColumnDefault()updates thewrite-defaultof an existing column(
initial-defaultstays fixed once the column exists), matching Java'supdateColumnDefault.UpdateColumnDoc/RenameColumn/UpdateColumnand theApplyChangesVisitorpreserve defaults when reconstructing fields; type promotion casts the defaults to
the new type.
Scope note: write-path application
Writers in this library consume complete Arrow arrays, so filling omitted columns with
write-defaultat write time remains the engine's responsibility, as in Java. Thelibrary's role — storing, validating, serializing the defaults, and exposing them
through schema evolution — is covered here.
Testing
literal_util: dedicated unit tests converting every primitive type to Arrowscalars (incl. negative decimals, uuid, all timestamp variants), constant-array
materialization, builder append, and rejection of null/sentinel literals.
Literal::CastTosignals narrowing with
AboveMax/BelowMinsentinels, which are rejected whensetting a default (e.g. a long default too large for an int column).
Schema::Validate: v2 rejectsinitial-default, v3 accepts; mismatched defaulttype rejected.
initial-default->kDefault;present fields ignore
initial-default.ProjectRecordBatchand AvroAppendDatumToBuilder: missing columnsmaterialize the default at top level and in nested structs.
back through the real readers with an evolved schema carrying initial-defaults, and
verify every row is filled (
ReadMissingFieldsWithDefaultsin both reader tests).UpdateSchema: add with default (both defaults set), required-with-default withoutAllowIncompatibleChanges(), uncastable default rejected,UpdateColumnDefaultonadded and pre-existing columns (incl. int->long cast),
RequireColumnon defaultedadd, doc-update preservation, type-promotion casting, and v2 rejection at
Apply();new
TableMetadataV3Valid.jsontest resource.file_io_testis unrelated and failsonly in my local environment due to a Homebrew AWS-SDK ABI issue; not touched by this
change).
This pull request and its description were written by Isaac.