Skip to content

[SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source#56407

Open
stevomitric wants to merge 7 commits into
apache:masterfrom
stevomitric:stevomitric/SPARK-57102-parquet-nanos
Open

[SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source#56407
stevomitric wants to merge 7 commits into
apache:masterfrom
stevomitric:stevomitric/SPARK-57102-parquet-nanos

Conversation

@stevomitric

@stevomitric stevomitric commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds read and write support for the nanosecond-capable timestamp types TimestampNTZNanosType(p) / TimestampLTZNanosType(p) (precision p in [7, 9], from the SPIP [SPARK-56822]) in the built-in Parquet data source, gated behind the existing preview flag spark.sql.timestampNanosTypes.enabled.

  • Schema conversion (ParquetSchemaConverter, both directions):
    • Write: TimestampLTZNanosType / TimestampNTZNanosType -> INT64 annotated TIMESTAMP(NANOS, isAdjustedToUTC) (isAdjustedToUTC = true for LTZ, false for NTZ).
    • Read: INT64 + TIMESTAMP(NANOS, ...) -> TimestampLTZNanosType(9) / TimestampNTZNanosType(9). Parquet's NANOS unit carries no precision parameter, so reads mint the canonical precision 9. The legacy spark.sql.legacy.parquet.nanosAsLong path keeps precedence and is unchanged.
  • Read values (non-vectorized / row-based reader, ParquetRowConverter): an INT64 epoch-nanoseconds value is split into epochMicros = floorDiv(v, 1000) and nanosWithinMicro = floorMod(v, 1000) and stored as TimestampNanosVal. TIMESTAMP(NANOS) values are exempt from datetime rebasing on both read and write: the NANOS unit postdates the legacy hybrid-calendar writers, so such files are always proleptic Gregorian (the spark.sql.parquet.datetimeRebaseModeIn{Read,Write} configs only cover DATE / TIMESTAMP_MILLIS / TIMESTAMP_MICROS).
  • Write values (ParquetWriteSupport): a TimestampNanosVal is written as INT64 epoch-nanoseconds using exact arithmetic (Math.addExact(Math.multiplyExact(epochMicros, 1000), nanosWithinMicro)); values outside the representable INT64 epoch-nanosecond range (~1677-09-21 .. 2262-04-11) fail instead of silently wrapping.
  • The Parquet supportDataType guards (V1 ParquetFileFormat and V2 ParquetTable) are relaxed to accept the nanos types, and the feature flag is propagated to the read Hadoop configuration in both the V1 and V2 paths.
  • The nanos types are excluded from ParquetUtils.isBatchReadSupported, so columnar reads transparently fall back to the row-based reader. Vectorized-reader support is a follow-up.

Spark-written files round-trip the exact type (including precision) via the Spark schema stored in the Parquet key-value metadata; "foreign" files with no Spark metadata (e.g. produced by Trino/DuckDB/pandas) derive the nanos type from the Parquet annotation.

Why are the changes needed?

Nanosecond-precision timestamps are common in data produced by pandas/PyArrow, Trino, ClickHouse, DuckDB, and similar systems. Spark currently rejects Parquet INT64 TIMESTAMP(NANOS) (PARQUET_TYPE_ILLEGAL), or, with spark.sql.legacy.parquet.nanosAsLong=true, reads it as a raw LongType that drops all timestamp and time-zone semantics. This PR lets Spark read and write such data as first-class nanosecond timestamp types, as part of the SPIP [SPARK-56822] "Timestamps with nanosecond precision".

Does this PR introduce any user-facing change?

Yes, behind the preview flag spark.sql.timestampNanosTypes.enabled (default off in production). When the flag is enabled:

  • Parquet files with INT64 TIMESTAMP(NANOS, isAdjustedToUTC=true/false) are read as TimestampLTZNanosType(9) / TimestampNTZNanosType(9) instead of being rejected.
  • Columns of these types can be written to Parquet (as INT64 TIMESTAMP(NANOS)).

When the flag is off, behavior is unchanged, including the legacy spark.sql.legacy.parquet.nanosAsLong escape hatch.

How was this patch tested?

New ParquetTimestampNanosSuite covering: Spark write/read round-trip preserving value and precision at p = 7, 8, 9 (vectorized reader on and off); reading "foreign" TIMESTAMP(NANOS) files written directly via parquet-mr for both NTZ and LTZ, including a pre-epoch (negative) instant that exercises floor semantics and nulls; nanosAsLong precedence; the disabled-feature error pinned via checkError (PARQUET_TYPE_ILLEGAL, both isAdjustedToUTC values); an out-of-INT64-range write failing for both NTZ and LTZ; rebase-mode invariance (a foreign pre-1883 TIMESTAMP(NANOS) file reads identically under EXCEPTION / CORRECTED / LEGACY); a nested (array) column round-trip; and a V2 file-source round-trip. Schema-level conversion cases added to ParquetSchemaSuite (round-trip for both nanos types at p = 9, write direction at p = 7, and nanosAsLong taking precedence over the nanos types; the testParquetToCatalyst/testSchema helpers gained a timestampNanosTypesEnabled flag). Existing tests updated: SPARK-40819 (pin the feature off to keep asserting the legacy reject path) and SPARK-57166 (drop Parquet, which is now supported). scalastyle passes.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.8)

@stevomitric stevomitric changed the title [SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source [WIP][SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source Jun 9, 2026
…and fix affected tests

Extends Parquet nanosecond-timestamp support to the V2 file source (relax the ParquetTable type guard and propagate spark.sql.timestampNanosTypes.enabled in ParquetScan), and updates two existing tests the feature changes: SPARK-40819 pins the feature off to keep asserting the legacy TIMESTAMP(NANOS) reject path (the flag defaults on under tests); SPARK-57166 drops Parquet from the unsupported-datasource list. Adds a V2 round-trip test to ParquetTimestampNanosSuite.

Co-authored-by: Isaac
@stevomitric stevomitric changed the title [WIP][SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source [SPARK-57102][SQL] Support nanosecond-precision timestamps in the Parquet data source Jun 9, 2026
@stevomitric

Copy link
Copy Markdown
Contributor Author

cc @MaxGekk PTAL.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 3 non-blocking, 1 nit.

Design / architecture (1): write/read rebase asymmetry for TimestampLTZNanosType — see inline comment on ParquetWriteSupport.scala:278.
Correctness (2): weak disabled-feature assertion; overflow test covers NTZ only — see inline comments.
Suggestions (1): add schema-unit tests to ParquetSchemaSuite — see inline comment.
Nits (1): displaced inShredded comment — see inline comment.

… on read, strengthen tests

Co-authored-by: Isaac
@stevomitric stevomitric requested a review from MaxGekk June 10, 2026 11:22

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 addressed, 0 remaining, 2 new. (2 late catches — my own round-1 misses, not regressions.)
All round-1 findings resolved cleanly. The rebase exemption is now symmetric on read and write, documented inline, and pinned by the 3-mode invariance test — I independently verified the 1800-01-01 test value and the rebase-config coverage claim against SQLConf.

Suggestions (1)

  • ParquetWriteSupport.scala:192: out-of-range write fails as raw ArithmeticException: long overflow — the existing DATETIME_OVERFLOW error condition would fit; see inline

Nits: 1 minor item (see inline comment).

Comment thread sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated
@stevomitric stevomitric requested a review from MaxGekk June 12, 2026 14:08
@stevomitric

stevomitric commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

cc @uros-b PTAL when you get a chance.

schema.forall(f => isBatchReadSupported(sqlConf, f.dataType))

def isBatchReadSupported(sqlConf: SQLConf, dt: DataType): Boolean = dt match {
case _: TimestampNTZNanosType | _: TimestampLTZNanosType =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving a performance note here: because isBatchReadSupported returns false for these types and isBatchReadSupportedForSchema uses forall, a single nanos column disables vectorized reads for the entire file/row group. That's fine and the vectorized follow-up is acknowledged.

@stevomitric stevomitric requested a review from uros-b June 16, 2026 09:06
Comment on lines +750 to +756
case _: TimestampLTZNanosType =>
Types.primitive(INT64, repetition)
.as(LogicalTypeAnnotation.timestampType(true, TimeUnit.NANOS)).named(field.name)

case _: TimestampNTZNanosType =>
Types.primitive(INT64, repetition)
.as(LogicalTypeAnnotation.timestampType(false, TimeUnit.NANOS)).named(field.name)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the TimestampLTZNanosType and TimestampNTZNanosType cases have byte-for-byte identical bodies and guards. These two could collapse into one case, e.g.

case (_: TimestampLTZNanosType | _: TimestampNTZNanosType) if isNanosTimestamp(parquetType) =>
...

with the verbose annotation check extracted into a small predicate, mirroring the existing canReadAsTimestampNTZ helper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they differ in isAdjustedToUTC: LTZ writes timestampType(true, NANOS) and NTZ writes timestampType(false, NANOS)

(row: SpecializedGetters, ordinal: Int) => recordConsumer.addLong(row.getLong(ordinal))

// TIMESTAMP(NANOS) values are always proleptic Gregorian and are exempt from datetime
// rebasing; see the TIMESTAMP(NANOS) converters in `ParquetRowConverter` for details.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit: both converter cases are guarded on the Parquet annotation being TIMESTAMP(NANOS). If a user supplies an explicit read schema with a nanos type over a column whose Parquet annotation is not NANOS, both guards fail and the match falls through to the generic handling.

Let's just confirm that this produces a clear error rather than a confusing one. Schema clipping should normally prevent the situation, but a quick check (or an explicit unguarded case _: TimestampLTZNanosType => that throws a descriptive error) would make the contract explicit.

Please see how other similar types work, and let's consider whether we need to take care of this or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it produces a clear error, so I don't think a dedicated case is needed. When the annotation isn't NANOS the nanos guards fail and makeConverter falls through to its default case t => throw cannotCreateParquetConverterForDataTypeError(t, parquetType), which raises PARQUET_CONVERSION_FAILURE.UNSUPPORTED naming both the requested Spark type and the actual Parquet type. This is the same fall-through the existing types use - TimestampType/TimestampNTZTyp

}

def parquetTimestampNanosOverflowError(
value: TimestampNanosVal, isNtz: Boolean): ArithmeticException = {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we declare SparkArithmeticException here? IIUC, that is what we throw below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - changed the return type to SparkArithmeticException.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @stevomitric and @MaxGekk! I left a few more comments, but otherwise LGTM.

@stevomitric

stevomitric commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @uros-b for the review, cc @MaxGekk can you please do another round. Thanks.

Resolve conflicts in the Parquet data source type-support checks.
While this branch was open, master added a temporary guard
`case _: AnyTimestampNanoType => false` ("not yet supported by this
datasource") to ParquetFileFormat.supportDataType and
ParquetTable.supportsDataType. This PR adds exactly that support, so the
guard is dropped on both, letting nanosecond-capable timestamp types fall
through to the AtomicType cases. This matches the PR's FileBasedDataSourceSuite
change, which moves Parquet out of the "unsupported nanos" list.

Also fix pre-existing CI failures on the branch (independent of the merge):
- ParquetTimestampNanosSuite referenced SQLConf.TYPES_FRAMEWORK_ENABLED,
  which does not exist on master; gate solely on TIMESTAMP_NANOS_TYPES_ENABLED
  (consistent with the rest of the suite).
- Reformat SparkDateTimeUtils.scala per scalafmt; the signature exceeded the
  98-column limit after `private` became `private[sql]`.

Verified locally: sql/core compiles; ParquetTimestampNanosSuite and
ParquetSchemaSuite (147 tests) and FileBasedDataSourceSuite SPARK-57166 pass;
scalafmt and scalastyle are clean.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants