parquet_encode: add default_timestamp_unit field#4294
Open
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
Open
parquet_encode: add default_timestamp_unit field#4294ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
Conversation
Adds a new `default_timestamp_unit` configuration field to the `parquet_encode` processor, accepting `NANOSECOND` (default, preserves existing behaviour), `MICROSECOND`, or `MILLISECOND`. The unit is applied to both the static schema path and the dynamic `schema_metadata` path (used by CDC inputs such as `mysql_cdc`). `TIMESTAMP(NANOS)` is not readable by Apache Spark / Databricks, AWS Athena or DuckDB; this field unblocks those consumers without requiring a pre-encoding transform. MySQL sources additionally cannot exceed microsecond precision, so `MICROSECOND` is lossless for CDC pipelines. Resolves the long-standing TODO referenced at redpanda-data#3570.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
default_timestamp_unitconfiguration field to theparquet_encodeprocessor, acceptingNANOSECOND(default, preserves existing behaviour),MICROSECOND, orMILLISECOND. The unit is applied to both the static schema path and the dynamicschema_metadatapath (used by CDC inputs such asmysql_cdc).Resolves the long-standing TODO referenced at #3570.
Motivation
parquet_encodecurrently hardcodesINT64 (TIMESTAMP(NANOS, true))for every TIMESTAMP column, in both code paths (processor_encode.go:159and:369). This produces files that Apache Spark, Databricks, AWS Athena and DuckDB cannot read — they fail with:Spark's legacy compatibility flags for this were removed in recent versions, so there is no reader-side workaround on modern Databricks runtimes. The dynamic-schema path is particularly impacted because
mysql_cdcsetsschema_metadataimmutably, and Bloblang cannot mutate the schema node passed intoparquet_encode— leaving affected users with no in-pipeline escape hatch.Note: MySQL
DATETIME/TIMESTAMPsource precision is microseconds, so nanosecond encoding is structurally unnecessary for MySQL CDC pipelines — no data is lost by downcasting.Design
NANOSECONDto preserve existing behaviour. Zero impact on existing users.Advanced()-tagged and applies uniformly to both schema construction paths.internal/impl/iceberg/icebergx/parquet.go:146-148) already usesparquet.Timestamp(parquet.Microsecond), which this change brings parity for in the general encode processor.Test plan
go test ./internal/impl/parquet/...)TestParquetEncodeTimestampUnitverifies all three units on the static schema path, inspecting the encoded file's schema string.TestParquetEncodeTimestampUnitDynamicSchemaverifies the unit flows through theschema_metadatapath used by CDC inputs.TestParquetEncodeTimestampUnitInvalidverifies config validation rejects unknown values.go build ./...succeeds.docs_genre-run; generated.adocupdated.CHANGELOG.mdupdated under Unreleased.