[SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source by stevomitric · Pull Request #56407 · apache/spark

stevomitric · 2026-06-09T14:37:25Z

What changes were proposed in this pull request?

This PR adds read and write support for the nanosecond-capable timestamp types TimestampNTZNanosType(p) / TimestampLTZNanosType(p) (precision p in [7, 9], from the SPIP [SPARK-56822]) in the built-in Parquet data source, gated behind the existing preview flag spark.sql.timestampNanosTypes.enabled.

Schema conversion (ParquetSchemaConverter, both directions):
- Write: TimestampLTZNanosType / TimestampNTZNanosType -> INT64 annotated TIMESTAMP(NANOS, isAdjustedToUTC) (isAdjustedToUTC = true for LTZ, false for NTZ).
- Read: INT64 + TIMESTAMP(NANOS, ...) -> TimestampLTZNanosType(9) / TimestampNTZNanosType(9). Parquet's NANOS unit carries no precision parameter, so reads mint the canonical precision 9. The legacy spark.sql.legacy.parquet.nanosAsLong path keeps precedence and is unchanged.
Read values (non-vectorized / row-based reader, ParquetRowConverter): an INT64 epoch-nanoseconds value is split into epochMicros = floorDiv(v, 1000) and nanosWithinMicro = floorMod(v, 1000) and stored as TimestampNanosVal. TIMESTAMP(NANOS) values are exempt from datetime rebasing on both read and write: the NANOS unit postdates the legacy hybrid-calendar writers, so such files are always proleptic Gregorian (the spark.sql.parquet.datetimeRebaseModeIn{Read,Write} configs only cover DATE / TIMESTAMP_MILLIS / TIMESTAMP_MICROS).
Write values (ParquetWriteSupport): a TimestampNanosVal is written as INT64 epoch-nanoseconds using exact arithmetic (Math.addExact(Math.multiplyExact(epochMicros, 1000), nanosWithinMicro)); values outside the representable INT64 epoch-nanosecond range (~1677-09-21 .. 2262-04-11) fail instead of silently wrapping.
The Parquet supportDataType guards (V1 ParquetFileFormat and V2 ParquetTable) are relaxed to accept the nanos types, and the feature flag is propagated to the read Hadoop configuration in both the V1 and V2 paths.
The nanos types are excluded from ParquetUtils.isBatchReadSupported, so columnar reads transparently fall back to the row-based reader. Vectorized-reader support is a follow-up.

Spark-written files round-trip the exact type (including precision) via the Spark schema stored in the Parquet key-value metadata; "foreign" files with no Spark metadata (e.g. produced by Trino/DuckDB/pandas) derive the nanos type from the Parquet annotation.

Why are the changes needed?

Nanosecond-precision timestamps are common in data produced by pandas/PyArrow, Trino, ClickHouse, DuckDB, and similar systems. Spark currently rejects Parquet INT64 TIMESTAMP(NANOS) (PARQUET_TYPE_ILLEGAL), or, with spark.sql.legacy.parquet.nanosAsLong=true, reads it as a raw LongType that drops all timestamp and time-zone semantics. This PR lets Spark read and write such data as first-class nanosecond timestamp types, as part of the SPIP [SPARK-56822] "Timestamps with nanosecond precision".

Does this PR introduce any user-facing change?

Yes, behind the preview flag spark.sql.timestampNanosTypes.enabled (default off in production). When the flag is enabled:

Parquet files with INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true/false) are read as TimestampLTZNanosType(9) / TimestampNTZNanosType(9) instead of being rejected.
Columns of these types can be written to Parquet (as INT64 TIMESTAMP(NANOS)).

When the flag is off, behavior is unchanged, including the legacy spark.sql.legacy.parquet.nanosAsLong escape hatch.

How was this patch tested?

New ParquetTimestampNanosSuite covering: Spark write/read round-trip preserving value and precision at p = 7, 8, 9 (vectorized reader on and off); reading "foreign" TIMESTAMP(NANOS) files written directly via parquet-mr for both NTZ and LTZ, including a pre-epoch (negative) instant that exercises floor semantics and nulls; nanosAsLong precedence; the disabled-feature error pinned via checkError (PARQUET_TYPE_ILLEGAL, both isAdjustedToUTC values); an out-of-INT64-range write failing for both NTZ and LTZ; rebase-mode invariance (a foreign pre-1883 TIMESTAMP(NANOS) file reads identically under EXCEPTION / CORRECTED / LEGACY); a nested (array) column round-trip; and a V2 file-source round-trip. Schema-level conversion cases added to ParquetSchemaSuite (round-trip for both nanos types at p = 9, write direction at p = 7, and nanosAsLong taking precedence over the nanos types; the testParquetToCatalyst/testSchema helpers gained a timestampNanosTypesEnabled flag). Existing tests updated: SPARK-40819 (pin the feature off to keep asserting the legacy reject path) and SPARK-57166 (drop Parquet, which is now supported). scalastyle passes.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.8)

…and fix affected tests Extends Parquet nanosecond-timestamp support to the V2 file source (relax the ParquetTable type guard and propagate spark.sql.timestampNanosTypes.enabled in ParquetScan), and updates two existing tests the feature changes: SPARK-40819 pins the feature off to keep asserting the legacy TIMESTAMP(NANOS) reject path (the flag defaults on under tests); SPARK-57166 drops Parquet from the unsupported-datasource list. Adds a V2 round-trip test to ParquetTimestampNanosSuite. Co-authored-by: Isaac

stevomitric · 2026-06-09T16:03:35Z

cc @MaxGekk PTAL.

MaxGekk

1 blocking, 3 non-blocking, 1 nit.

Design / architecture (1): write/read rebase asymmetry for TimestampLTZNanosType — see inline comment on ParquetWriteSupport.scala:278.
Correctness (2): weak disabled-feature assertion; overflow test covers NTZ only — see inline comments.
Suggestions (1): add schema-unit tests to ParquetSchemaSuite — see inline comment.
Nits (1): displaced inShredded comment — see inline comment.

… on read, strengthen tests Co-authored-by: Isaac

MaxGekk

5 addressed, 0 remaining, 2 new. (2 late catches — my own round-1 misses, not regressions.)
All round-1 findings resolved cleanly. The rebase exemption is now symmetric on read and write, documented inline, and pinned by the 3-mode invariance test — I independently verified the 1800-01-01 test value and the rebase-config coverage claim against SQLConf.

Suggestions (1)

ParquetWriteSupport.scala:192: out-of-range write fails as raw ArithmeticException: long overflow — the existing DATETIME_OVERFLOW error condition would fit; see inline

Nits: 1 minor item (see inline comment).

stevomitric · 2026-06-14T17:24:03Z

cc @uros-b PTAL when you get a chance.

uros-b · 2026-06-14T17:55:53Z

      schema.forall(f => isBatchReadSupported(sqlConf, f.dataType))

  def isBatchReadSupported(sqlConf: SQLConf, dt: DataType): Boolean = dt match {
+    case _: TimestampNTZNanosType | _: TimestampLTZNanosType =>


Just leaving a performance note here: because isBatchReadSupported returns false for these types and isBatchReadSupportedForSchema uses forall, a single nanos column disables vectorized reads for the entire file/row group. That's fine and the vectorized follow-up is acknowledged.

uros-b · 2026-06-16T09:30:27Z

+      case _: TimestampLTZNanosType =>
+        Types.primitive(INT64, repetition)
+          .as(LogicalTypeAnnotation.timestampType(true, TimeUnit.NANOS)).named(field.name)
+
+      case _: TimestampNTZNanosType =>
+        Types.primitive(INT64, repetition)
+          .as(LogicalTypeAnnotation.timestampType(false, TimeUnit.NANOS)).named(field.name)


It seems that the TimestampLTZNanosType and TimestampNTZNanosType cases have byte-for-byte identical bodies and guards. These two could collapse into one case, e.g.

case (_: TimestampLTZNanosType | _: TimestampNTZNanosType) if isNanosTimestamp(parquetType) => ...

with the verbose annotation check extracted into a small predicate, mirroring the existing canReadAsTimestampNTZ helper.

they differ in isAdjustedToUTC: LTZ writes timestampType(true, NANOS) and NTZ writes timestampType(false, NANOS)

uros-b · 2026-06-16T09:34:07Z

        (row: SpecializedGetters, ordinal: Int) => recordConsumer.addLong(row.getLong(ordinal))

+      // TIMESTAMP(NANOS) values are always proleptic Gregorian and are exempt from datetime
+      // rebasing; see the TIMESTAMP(NANOS) converters in `ParquetRowConverter` for details.


A bit of a nit: both converter cases are guarded on the Parquet annotation being TIMESTAMP(NANOS). If a user supplies an explicit read schema with a nanos type over a column whose Parquet annotation is not NANOS, both guards fail and the match falls through to the generic handling.

Let's just confirm that this produces a clear error rather than a confusing one. Schema clipping should normally prevent the situation, but a quick check (or an explicit unguarded case _: TimestampLTZNanosType => that throws a descriptive error) would make the contract explicit.

Please see how other similar types work, and let's consider whether we need to take care of this or not.

Confirmed it produces a clear error, so I don't think a dedicated case is needed. When the annotation isn't NANOS the nanos guards fail and makeConverter falls through to its default case t => throw cannotCreateParquetConverterForDataTypeError(t, parquetType), which raises PARQUET_CONVERSION_FAILURE.UNSUPPORTED naming both the requested Spark type and the actual Parquet type. This is the same fall-through the existing types use - TimestampType/TimestampNTZTyp

uros-b · 2026-06-16T09:35:24Z

  }

+  def parquetTimestampNanosOverflowError(
+      value: TimestampNanosVal, isNtz: Boolean): ArithmeticException = {


Why don't we declare SparkArithmeticException here? IIUC, that is what we throw below.

Done - changed the return type to SparkArithmeticException.

uros-b

Thank you @stevomitric and @MaxGekk! I left a few more comments, but otherwise LGTM.

stevomitric · 2026-06-16T14:23:05Z

Thanks @uros-b for the review, cc @MaxGekk can you please do another round. Thanks.

Resolve conflicts in the Parquet data source type-support checks. While this branch was open, master added a temporary guard `case _: AnyTimestampNanoType => false` ("not yet supported by this datasource") to ParquetFileFormat.supportDataType and ParquetTable.supportsDataType. This PR adds exactly that support, so the guard is dropped on both, letting nanosecond-capable timestamp types fall through to the AtomicType cases. This matches the PR's FileBasedDataSourceSuite change, which moves Parquet out of the "unsupported nanos" list. Also fix pre-existing CI failures on the branch (independent of the merge): - ParquetTimestampNanosSuite referenced SQLConf.TYPES_FRAMEWORK_ENABLED, which does not exist on master; gate solely on TIMESTAMP_NANOS_TYPES_ENABLED (consistent with the rest of the suite). - Reformat SparkDateTimeUtils.scala per scalafmt; the signature exceeded the 98-column limit after `private` became `private[sql]`. Verified locally: sql/core compiles; ParquetTimestampNanosSuite and ParquetSchemaSuite (147 tests) and FileBasedDataSourceSuite SPARK-57166 pass; scalafmt and scalastyle are clean. Co-authored-by: Isaac

initial commit

f890a56

stevomitric changed the title ~~[SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source~~ [WIP][SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source Jun 9, 2026

stevomitric changed the title ~~[WIP][SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source~~ [SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source Jun 9, 2026

MaxGekk reviewed Jun 9, 2026

View reviewed changes

Address review comments: exempt TIMESTAMP(NANOS) from datetime rebase…

7c89933

… on read, strengthen tests Co-authored-by: Isaac

stevomitric requested a review from MaxGekk June 10, 2026 11:22

MaxGekk reviewed Jun 11, 2026

View reviewed changes

Comment thread .../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala Outdated

Comment thread sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated

resolve comments

bc5e7cf

stevomitric requested a review from MaxGekk June 12, 2026 14:08

uros-b reviewed Jun 14, 2026

View reviewed changes

Comment thread .../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

uros-b reviewed Jun 14, 2026

View reviewed changes

Comment thread .../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

uros-b reviewed Jun 14, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala Outdated

uros-b reviewed Jun 14, 2026

View reviewed changes

resolve comments

c67611e

stevomitric requested a review from uros-b June 16, 2026 09:06

uros-b reviewed Jun 16, 2026

View reviewed changes

uros-b approved these changes Jun 16, 2026

View reviewed changes

stevomitric added 2 commits June 16, 2026 14:24

resolve comments

2e48786

Conversation

stevomitric commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

stevomitric commented Jun 9, 2026

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Suggestions (1)

Uh oh!

Uh oh!

Uh oh!

stevomitric commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uros-b Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

stevomitric Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

stevomitric Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

stevomitric Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

stevomitric commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stevomitric commented Jun 9, 2026 •

edited

Loading

stevomitric commented Jun 14, 2026 •

edited

Loading

stevomitric commented Jun 16, 2026 •

edited

Loading