Skip to content

parquet_encode: add default_timestamp_unit field#4294

Open
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481:feat/parquet-encode-timestamp-unit
Open

parquet_encode: add default_timestamp_unit field#4294
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481:feat/parquet-encode-timestamp-unit

Conversation

@ankit481
Copy link
Copy Markdown

Summary

Adds a new default_timestamp_unit configuration field to the parquet_encode processor, accepting NANOSECOND (default, preserves existing behaviour), MICROSECOND, or MILLISECOND. The unit is applied to both the static schema path and the dynamic schema_metadata path (used by CDC inputs such as mysql_cdc).

Resolves the long-standing TODO referenced at #3570.

Motivation

parquet_encode currently hardcodes INT64 (TIMESTAMP(NANOS, true)) for every TIMESTAMP column, in both code paths (processor_encode.go:159 and :369). This produces files that Apache Spark, Databricks, AWS Athena and DuckDB cannot read — they fail with:

[PARQUET_TYPE_ILLEGAL] Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

Spark's legacy compatibility flags for this were removed in recent versions, so there is no reader-side workaround on modern Databricks runtimes. The dynamic-schema path is particularly impacted because mysql_cdc sets schema_metadata immutably, and Bloblang cannot mutate the schema node passed into parquet_encode — leaving affected users with no in-pipeline escape hatch.

Note: MySQL DATETIME/TIMESTAMP source precision is microseconds, so nanosecond encoding is structurally unnecessary for MySQL CDC pipelines — no data is lost by downcasting.

Design

  • Default remains NANOSECOND to preserve existing behaviour. Zero impact on existing users.
  • New field is Advanced()-tagged and applies uniformly to both schema construction paths.
  • The existing Iceberg integration (internal/impl/iceberg/icebergx/parquet.go:146-148) already uses parquet.Timestamp(parquet.Microsecond), which this change brings parity for in the general encode processor.

Test plan

  • Existing tests pass (go test ./internal/impl/parquet/...)
  • New TestParquetEncodeTimestampUnit verifies all three units on the static schema path, inspecting the encoded file's schema string.
  • New TestParquetEncodeTimestampUnitDynamicSchema verifies the unit flows through the schema_metadata path used by CDC inputs.
  • New TestParquetEncodeTimestampUnitInvalid verifies config validation rejects unknown values.
  • go build ./... succeeds.
  • docs_gen re-run; generated .adoc updated.
  • CHANGELOG.md updated under Unreleased.

Adds a new `default_timestamp_unit` configuration field to the
`parquet_encode` processor, accepting `NANOSECOND` (default, preserves
existing behaviour), `MICROSECOND`, or `MILLISECOND`.

The unit is applied to both the static schema path and the dynamic
`schema_metadata` path (used by CDC inputs such as `mysql_cdc`).

`TIMESTAMP(NANOS)` is not readable by Apache Spark / Databricks, AWS
Athena or DuckDB; this field unblocks those consumers without requiring
a pre-encoding transform. MySQL sources additionally cannot exceed
microsecond precision, so `MICROSECOND` is lossless for CDC pipelines.

Resolves the long-standing TODO referenced at
redpanda-data#3570.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 17, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants