parquet_encode: add default_timestamp_unit field by ankit481 · Pull Request #4294 · redpanda-data/connect

ankit481 · 2026-04-17T18:44:10Z

Summary

Adds a new default_timestamp_unit configuration field to the parquet_encode processor, accepting NANOSECOND (default, preserves existing behaviour), MICROSECOND, or MILLISECOND. The unit is applied to both the static schema path and the dynamic schema_metadata path (used by CDC inputs such as mysql_cdc).

Resolves the long-standing TODO referenced at #3570.

Motivation

parquet_encode currently hardcodes INT64 (TIMESTAMP(NANOS, true)) for every TIMESTAMP column, in both code paths (processor_encode.go:159 and :369). This produces files that Apache Spark, Databricks, AWS Athena and DuckDB cannot read — they fail with:

[PARQUET_TYPE_ILLEGAL] Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

Spark's legacy compatibility flags for this were removed in recent versions, so there is no reader-side workaround on modern Databricks runtimes. The dynamic-schema path is particularly impacted because mysql_cdc sets schema_metadata immutably, and Bloblang cannot mutate the schema node passed into parquet_encode — leaving affected users with no in-pipeline escape hatch.

Note: MySQL DATETIME/TIMESTAMP source precision is microseconds, so nanosecond encoding is structurally unnecessary for MySQL CDC pipelines — no data is lost by downcasting.

Design

Default remains NANOSECOND to preserve existing behaviour. Zero impact on existing users.
New field is Advanced()-tagged and applies uniformly to both schema construction paths.
The existing Iceberg integration (internal/impl/iceberg/icebergx/parquet.go:146-148) already uses parquet.Timestamp(parquet.Microsecond), which this change brings parity for in the general encode processor.

Test plan

Existing tests pass (go test ./internal/impl/parquet/...)
New TestParquetEncodeTimestampUnit verifies all three units on the static schema path, inspecting the encoded file's schema string.
New TestParquetEncodeTimestampUnitDynamicSchema verifies the unit flows through the schema_metadata path used by CDC inputs.
New TestParquetEncodeTimestampUnitInvalid verifies config validation rejects unknown values.
go build ./... succeeds.
docs_gen re-run; generated .adoc updated.
CHANGELOG.md updated under Unreleased.

Adds a new `default_timestamp_unit` configuration field to the `parquet_encode` processor, accepting `NANOSECOND` (default, preserves existing behaviour), `MICROSECOND`, or `MILLISECOND`. The unit is applied to both the static schema path and the dynamic `schema_metadata` path (used by CDC inputs such as `mysql_cdc`). `TIMESTAMP(NANOS)` is not readable by Apache Spark / Databricks, AWS Athena or DuckDB; this field unblocks those consumers without requiring a pre-encoding transform. MySQL sources additionally cannot exceed microsecond precision, so `MICROSECOND` is lossless for CDC pipelines. Resolves the long-standing TODO referenced at redpanda-data#3570.

CLAassistant · 2026-04-17T18:44:17Z

All committers have signed the CLA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet_encode: add default_timestamp_unit field#4294

parquet_encode: add default_timestamp_unit field#4294
ankit481 wants to merge 1 commit intoredpanda-data:mainfrom
ankit481:feat/parquet-encode-timestamp-unit

ankit481 commented Apr 17, 2026

Uh oh!

CLAassistant commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankit481 commented Apr 17, 2026

Summary

Motivation

Design

Test plan

Uh oh!

CLAassistant commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Apr 17, 2026 •

edited

Loading