Skip to content

[wip]feat: support column default values (initial-default / write-default)#731

Draft
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/column-default-values
Draft

[wip]feat: support column default values (initial-default / write-default)#731
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/column-default-values

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 11, 2026

Copy link
Copy Markdown

Closes #730 (item 2 of #637).

Implements Iceberg v3 column default values: initial-default / write-default on the
schema, JSON serde, read-path application, schema-evolution support, and format-version
validation. Semantics follow the Java reference implementation
(Types.NestedField / SchemaParser / SchemaUpdate / Schema.checkCompatibility).

What changed

Schema model

  • SchemaField carries initial_default / write_default literals, stored as
    std::shared_ptr<const Literal> — an immutable payload shared across field copies,
    the same pattern as the adjacent type_ member (and the C++ analog of Java's
    final Literal<?> reference). literal.h cannot be included from schema_field.h
    due to the literal.h -> type.h -> schema_field.h include cycle, and sharing keeps
    SchemaField copies cheap. Accessors return
    std::optional<std::reference_wrapper<const Literal>>, the same optional-reference
    idiom as Schema::FindFieldByName, mirroring Java's nullable Literal<?> on
    Types.NestedField; WithInitialDefault / WithWriteDefault copy-modifiers follow
    the style of AsRequired/AsOptional.
  • SchemaField::Validate() checks that defaults are primitive literals matching the
    field type (Java: castDefault rejects defaults on nested types);
    Schema::Validate(format_version) rejects initial-default below v3 with the same
    message wording as Java's Schema.checkCompatibility, using the previously-unused
    TableMetadata::kMinFormatVersionDefaultValues (resolves the TODO there).

JSON serde

  • FieldFromJson / ToJson(SchemaField) parse and write initial-default /
    write-default using the existing single-value serialization
    (LiteralFromJson(json, type)), mirroring Java's SchemaParser +
    SingleValueParser, and resolving the add default values TODO in struct
    serialization. All primitive types supported (incl. decimal, fixed, uuid, temporal).

Read path (initial-default)

  • Project() maps a column missing from a data file to
    FieldProjection::Kind::kDefault carrying the literal when the field has an
    initial-default — for required and optional fields, per the spec. All four
    projection/decoding paths are covered: schema_util.cc (resolving its default-value
    TODO), parquet_schema_util.cc (used by the Parquet reader), avro_schema_util.cc,
    and the Avro direct decoder.
  • New iceberg::arrow helpers (literal_util) convert a Literal to an Arrow scalar /
    constant array; the Parquet reader materializes kDefault via MakeDefaultArray and
    the Avro reader via AppendDefaultToBuilder.

Schema evolution (write-default)

  • AddColumn / AddRequiredColumn accept an optional default_value, used as both the
    initial-default and write-default of the new column. Like Java's
    Types.NestedField.castDefault, the value is cast to the column type rather than
    rejected (failing only if it cannot be cast). A required column with a default does
    not need AllowIncompatibleChanges().
  • RequireColumn() accepts a column added with a default in the same update (resolves
    the defaulted-add TODO in UpdateColumnRequirementInternal).
  • New UpdateColumnDefault() updates the write-default of an existing column
    (initial-default stays fixed once the column exists), matching Java's
    updateColumnDefault.
  • UpdateColumnDoc / RenameColumn / UpdateColumn and the ApplyChangesVisitor
    preserve defaults when reconstructing fields; type promotion casts the defaults to
    the new type.

Scope note: write-path application

Writers in this library consume complete Arrow arrays, so filling omitted columns with
write-default at write time remains the engine's responsibility, as in Java. The
library's role — storing, validating, serializing the defaults, and exposing them
through schema evolution — is covered here.

Testing

  • literal_util: dedicated unit tests converting every primitive type to Arrow
    scalars (incl. negative decimals, uuid, all timestamp variants), constant-array
    materialization, builder append, and rejection of null/sentinel literals.
  • Defaults are validated to be non-null, non-sentinel values: Literal::CastTo
    signals narrowing with AboveMax/BelowMin sentinels, which are rejected when
    setting a default (e.g. a long default too large for an int column).
  • Schema serde round-trips (top-level + nested struct fields, mismatch rejection).
  • Schema::Validate: v2 rejects initial-default, v3 accepts; mismatched default
    type rejected.
  • Projection: missing required/optional fields with initial-default -> kDefault;
    present fields ignore initial-default.
  • Parquet ProjectRecordBatch and Avro AppendDatumToBuilder: missing columns
    materialize the default at top level and in nested structs.
  • File-level end-to-end: write a Parquet/Avro file with the original schema, read it
    back through the real readers with an evolved schema carrying initial-defaults, and
    verify every row is filled (ReadMissingFieldsWithDefaults in both reader tests).
  • UpdateSchema: add with default (both defaults set), required-with-default without
    AllowIncompatibleChanges(), uncastable default rejected, UpdateColumnDefault on
    added and pre-existing columns (incl. int->long cast), RequireColumn on defaulted
    add, doc-update preservation, type-promotion casting, and v2 rejection at Apply();
    new TableMetadataV3Valid.json test resource.
  • Full suite passing locally (the pre-existing S3 file_io_test is unrelated and fails
    only in my local environment due to a Homebrew AWS-SDK ABI issue; not touched by this
    change).

This pull request and its description were written by Isaac.

@huan233usc huan233usc marked this pull request as draft June 11, 2026 21:07
@huan233usc huan233usc changed the title feat: support column default values (initial-default / write-default) [wip]feat: support column default values (initial-default / write-default) Jun 11, 2026
@huan233usc huan233usc force-pushed the feat/column-default-values branch 5 times, most recently from bb5f2d0 to 1b2be81 Compare June 12, 2026 07:24
return std::cref(*write_default_);
}

SchemaField SchemaField::WithInitialDefault(Literal initial_default) const {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative to those with methods are make it in constructor

@huan233usc huan233usc force-pushed the feat/column-default-values branch 6 times, most recently from f234320 to e46342a Compare June 12, 2026 20:15
Implements Iceberg v3 column default values (apache#730, item 2 of apache#637):

- Schema model: SchemaField carries optional `initial-default` and
  `write-default` literals, with validation that defaults are primitive
  and match the field type.
- JSON serde: parse and write the two fields using single-value
  serialization (resolves the TODO in struct field serialization).
- Read path: Project() maps a missing column with an initial-default to
  FieldProjection::Kind::kDefault carrying the literal (for both required
  and optional columns, per spec), and the Parquet and Avro readers
  materialize it as a constant column via a new Literal-to-Arrow helper.
  This resolves the default-value TODO in schema projection.
- Schema evolution: Add*Column accept an optional default value (used as
  both initial-default and write-default); a required column with a
  default no longer needs AllowIncompatibleChanges(); RequireColumn()
  accepts columns added with a default in the same update (resolves the
  defaulted-add TODO); UpdateColumnDefault() updates the write-default of
  an existing column; doc/rename/type-promotion updates preserve defaults
  (promotion casts them to the new type).
- Format version gating: Schema::Validate() rejects schemas with default
  values below v3, using the existing kMinFormatVersionDefaultValues.

Writers consume complete Arrow arrays, so applying write-default to
omitted columns remains the engine's responsibility (as in Java); the
library stores, validates, and serializes it.
@huan233usc huan233usc force-pushed the feat/column-default-values branch from e46342a to 1613523 Compare June 12, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support column default values (initial-default / write-default)

1 participant