Skip to content

Support Delta Lake 4.1 on Spark 4.1[databricks]#14646

Open
firestarman wants to merge 15 commits intoNVIDIA:mainfrom
firestarman:b-delta-lake
Open

Support Delta Lake 4.1 on Spark 4.1[databricks]#14646
firestarman wants to merge 15 commits intoNVIDIA:mainfrom
firestarman:b-delta-lake

Conversation

@firestarman
Copy link
Copy Markdown
Collaborator

@firestarman firestarman commented Apr 22, 2026

Fixes #14461.

Description

  • Add a dedicated delta-41x module and wire release411 to the real Delta 4.1 runtime so Spark 4.1 uses the correct provider, catalog, MDC, and write-path APIs instead of the stub implementation.
  • Move shared Delta logic into new delta-33x-41x and delta-40x-41x common layers so Spark 4.0 and 4.1 keep their version-specific hooks while reducing duplicate code across the Delta command stack.
  • Isolate the remaining Delta 4.1 API differences in thin 41x shims so commit metadata, create-table dependency checks, MDC logging, and auto-compaction behavior stay compatible with Delta 4.1 without regressing Delta 4.0 and 3.3 lines.
  • Add Spark 4.1 Delta coverage in DeltaLakeQuerySuite so supported Spark/Delta combinations are documented and the new paths are exercised.
  • Validate the change with full Scala 2.13 package builds for buildver=356, buildver=400, and buildver=411 so the shared refactor compiles across Delta-enabled Spark lines.
  • Validate Delta integration coverage with serial ./run_pyspark_from_build.sh -m delta_lake --delta_lake runs on spark356, spark400, and spark411, yielding 701 selected / 558 passed / 143 skipped / 0 failed for spark356, 701 selected / 558 passed / 143 skipped / 0 failed for spark400, and 701 selected / 556 passed / 145 skipped / 0 failed for spark411.

What changes in Delta 41x

  • DeltaOperations.Write extends commit metadata with dynamic partition overwrite and schema flags, so the write path needs a dedicated 4.1 wrapper.
  • UniversalFormat dependency checks now require table descriptors during create-table flows, so the Spark 4.1 create-table path needs separate wiring.
  • Spark 4.1 switches Delta MDC logging to the newer MDC(logKey, value) API, so logging shims must diverge from Spark 4.0.
  • Auto compaction now runs from committed-transaction hooks instead of live transaction hooks, so the 4.1 hook integration cannot reuse the 4.0 implementation unchanged.

Delta NDS perf validation (runs=3)

Ran the local Delta NDS benchmark 3 times on the same machine, same dataset, same RAPIDS build, and same config for both Spark 4.0.0 and Spark 4.1.1.

Environment

  • Dataset: /bigdata/tpcds_data/delta_sf30_float
  • Mode: local[12]
  • RAPIDS: 26.06.0-SNAPSHOT (cuda13)
  • CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
  • GPU: NVIDIA RTX 5880 Ada Generation

Delta NDS Total Time (runs=3)

单位:秒(s)

Run Spark 4.0.0 Total Time (s) Spark 4.1.1 Total Time (s)
Run 1 876.695 827.712
Run 2 868.151 825.713
Run 3 788.435 839.011
Average 844.427 830.812

Based on these 3 runs, no obvious performance regression is observed for Spark 4.1.1 versus Spark 4.0.0, and the observed difference is within normal run-to-run noise.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Made with Cursor

- add a dedicated delta-41x shim and wire release411 to the real Delta runtime
- split shared Delta code into 33x-41x and 40x-41x layers to reduce duplication
- isolate Spark 4.1 MDC, catalog, and write compatibility differences in the 41x shim
- add spark411 Delta coverage and validate shim411 plus shim356 Delta regression runs

What changes in Delta 41x:
- Delta 4.1 extends DeltaOperations.Write commit metadata with dynamic overwrite and
  schema flags
- UniversalFormat dependency checks now require table descriptors in create-table
  flows
- Spark 4.1 switches Delta MDC logging to the newer MDC(logKey, value) API
- auto compaction now runs from committed-transaction hooks instead of live
  transaction hooks

Made-with: Cursor
Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman requested a review from a team as a code owner April 22, 2026 09:07
Ignore local Cursor metadata so editor state does not show up as untracked changes on this branch.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman firestarman changed the title Support Delta Lake 4.1 on Spark 4.1 Support Delta Lake 4.1 on Spark 4.1[databricks] Apr 22, 2026
@firestarman firestarman requested a review from nartal1 April 22, 2026 09:16
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR adds a dedicated delta-41x module that wires Spark 4.1 to the real Delta 4.1 runtime (provider, catalog, MDC, write-path, auto-compact) and introduces delta-33x-41x / delta-40x-41x common layers to de-duplicate the shared Delta command stack across versions. The previously raised P1 concerns—metrics.head throwing on an empty optimize result and Try silently swallowing invalid partition-overwrite mode errors—have both been resolved in this revision.

Confidence Score: 5/5

Safe to merge — all previously reported P1 issues are resolved and no new blocking issues were found.

Both prior P1 findings (metrics.head on empty optimize result and Try swallowing invalid partition-overwrite mode) are fixed. The shared layer refactor is correctly scoped with version-specific shims for MDC, commit metadata, UniversalFormat dependency checks, and auto-compaction. Build validation across 356/400/411 and integration test runs (0 failures) give high confidence. Remaining observation is a P2 style suggestion on the test assertion.

No files require special attention.

Important Files Changed

Filename Overview
delta-lake/delta-41x/src/main/scala/com/nvidia/spark/rapids/delta/delta41x/Delta41xProvider.scala New Delta 4.1 provider wiring DML commands, file formats, and staged table support; mirrors Delta40xProvider structure correctly.
delta-lake/delta-41x/src/main/scala/org/apache/spark/sql/delta/rapids/delta41x/Delta41xRuntimeShim.scala New runtime shim for Delta 4.1, wires GpuDeltaCatalog4x with the 41x CreateTableCommand factory and GpuOptimisticTransaction.
delta-lake/delta-41x/src/main/scala/org/apache/spark/sql/delta/hooks/GpuAutoCompact.scala Delta 4.1 auto-compact hook uses CommittedTransaction API; previously reported shouldSkipAutoCompact overload concern confirmed valid by upstream jar check.
delta-lake/delta-41x/src/main/scala/org/apache/spark/sql/delta/rapids/GpuWriteIntoDelta.scala Thin 41x override of GpuWriteIntoDeltaBase; adds Delta 4.1 extended DeltaOperations.Write with DPO and schema flags; isDynamicPartitionOverwriteMode now called directly (Try wrapper fix applied).
delta-lake/delta-41x/src/main/scala/org/apache/spark/sql/delta/rapids/delta41x/GpuCreateDeltaTableCommand.scala 41x create-table command overrides both 3-arg and 4-arg enforceDependenciesInConfiguration to supply CatalogTable descriptor to Delta 4.1 UniversalFormat API.
delta-lake/common/src/main/delta-33x-41x/scala/org/apache/spark/sql/delta/hooks/GpuAutoCompact.scala Shared GpuAutoCompactBase with headOption.foreach guard on metrics (previous P1 fix applied); GpuTransactionalAutoCompactBase for 3.3/4.0 live-txn hook signature.
delta-lake/common/src/main/delta-40x-41x/scala/org/apache/spark/sql/delta/rapids/GpuWriteIntoDeltaBase.scala New shared write base extracted from delta-40x; contains full write path logic with correct DPO / CDF / replaceWhere / identity column handling.
delta-lake/common/src/main/delta-33x-41x/scala/org/apache/spark/sql/delta/rapids/GpuCreateDeltaTableCommandBase.scala Large shared create-table command base; handles CTAS/replace logic across all three delta versions with delegated version-specific overrides.
tests/src/test/spark411/scala/com/nvidia/spark/rapids/DeltaLakeQuerySuite.scala Basic provider-resolution smoke test for Spark 4.1 Delta; verifies non-stub provider loads but doesn't exercise write/read or DPO GPU paths.
scala2.13/delta-lake/delta-41x/pom.xml Scala 2.13 POM for delta-41x; correctly includes delta-33x-41x and delta-40x-41x common source paths and delta41x Maven properties.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Spark 4.1 write request] --> B[Delta41xProvider]
    B --> C[AppendDataExecV1 / GpuAppendDataExecV1]
    C --> D[GpuDeltaCatalog4x]
    D --> E[GpuWriteIntoDelta 41x\nextends GpuWriteIntoDeltaBase]
    E --> F[GpuOptimisticTransaction]
    F --> G[txn.commitIfNeeded\nDeltaOperations.Write\n+DPO +overwriteSchema +mergeSchema]
    A --> H[CREATE TABLE / CTAS]
    H --> D
    D --> I[GpuCreateDeltaTableCommand 41x]
    I --> J{UniversalFormat\n.enforceDependenciesInConfiguration\nwith CatalogTable}
    G --> K[Post-commit CommittedTransaction hook]
    K --> L[GpuAutoCompact 41x\nextends GpuAutoCompactBase]
    L --> M[GpuOptimizeExecutor.optimize\nheadOption.foreach metrics]
Loading

Reviews (13): Last reviewed commit: "Revert temporary Databricks smoke test d..." | Re-trigger Greptile

Keep the Spark 4.1 build and tests on the real Delta 4.1 module so the generated Scala 2.13 poms stay in sync instead of reverting to the stub path.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
Keep the shared Delta delete shim aligned with the repository's import ordering checks so the Delta 4.1 build no longer fails on this style error.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@nartal1
Copy link
Copy Markdown
Collaborator

nartal1 commented Apr 22, 2026

build

@nartal1 nartal1 requested a review from jihoonson April 22, 2026 18:04
Copy link
Copy Markdown
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @firestarman ! Overall LGTM with few nits and questions.

* 1. Casting the SparkSession parameter inside run() method body
* 2. As parameter type for shim methods (toOperationSparkSession, recacheByPlan)
*
*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: fix indendation

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, updated

DeltaTableUtils,
IdentityColumn,
OptimisticTransaction
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we move this to single/ 2 lines?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

VersionUtils.cmpSparkVersion(4, 0, 0) < 0) {
"org.apache.spark.sql.delta.rapids.delta33x.Delta33xRuntimeShim"
} else if (VersionUtils.cmpSparkVersion(4, 1, 0) >= 0) {
"org.apache.spark.sql.delta.rapids.delta41x.Delta41xRuntimeShim"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we move this after 4.0.0 comparison? Just to keep it in sequential order.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, updated

MergeIntoCommandMeta,
OptimizeTableCommandMeta,
UpdateCommandMeta
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Single/ 2 lines here and in Delta40xRuntimeShim.scala.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,128 @@
/*
* Copyright (c) 2025-2026, NVIDIA CORPORATION.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright header 2026 only - here and all other new files under delta-41x/ directory.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, updated

Comment thread delta-lake/README.md Outdated
Delta Lake is not supported on all Spark versions, and for Spark versions where it is not
supported the `delta-stub` project is used.

## Spark 4.1 Status
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be updated. I think we don't have to explicitly specify that we are building against delta-41x instead of delta-stub.

Also, regarding the caveats - Is it same for Delta-4.0? @jihoonson do you know?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caveat on the metadata processed on CPU is same for all Delta versions. This is already documented in our user doc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firestarman I'm not quite sure why you are suggesting to add this status section in the readme doc. We should rather use the issue tracker to track the current status and remaining issues.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done by Cursor, and I didn't know if it is necessary for delta-lake docs, so kept it.
Now I have removed this section and thanks for the information.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for updating the README. The README file usually contains only the introduction to the project, such as what the project is, what is the use case, how the project is structured, etc.

@sameerz sameerz added the feature request New feature or request label Apr 22, 2026
Tighten the Delta 4.x shim formatting and documentation to match review feedback, keep the runtime shim checks in sequential order, and align the new delta-41x file headers with the repository's copyright convention.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
Keep every file newly added by the Delta 4.1 PR on a 2026-only copyright header, including the generated Scala 2.13 Delta module pom.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Force the temporary upstream Spark smoke test to use the matching shim so integration tests load jars that match the overridden SPARK_HOME.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Comment thread pom.xml
Avoid failing Delta auto-compaction when optimize returns no metrics, so the hook
skips event emission instead of throwing on `metrics.head`.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Copy link
Copy Markdown
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @firestarman for the PR. I left some comments on the PR.

Validate Delta integration coverage with serial ./run_pyspark_from_build.sh -m delta_lake --delta_lake runs on spark356, spark400, and spark411, yielding 701 selected / 558 passed / 143 skipped / 0 failed for spark356, 701 selected / 558 passed / 143 skipped / 0 failed for spark400, and 701 selected / 556 passed / 145 skipped / 0 failed for spark411.

What are the two tests skipped with spark411 that ran and passed with spark356 and spark400? We should make sure they don't skip unnecessarily.

Comment on lines +83 to +91
if (clazz == null) {
None
} else {
try {
Some(clazz.getDeclaredField("isUnityCatalog"))
} catch {
case _: NoSuchFieldException => findField(clazz.getSuperclass)
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a sanity check? Or is there some version that actually this field is missing? If it's the latter, we should add a new shim instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change came from a real Spark 4.1 / Delta 4.1 failure rather than a purely defensive cleanup. We hit NoSuchFieldException: isUnityCatalog in GpuDeltaCatalogBase, so the fix walks the class hierarchy because the field is no longer guaranteed to be declared on the concrete DeltaCatalog runtime class. Since the logical flag is still the same, I kept this as shared handling instead of adding a dedicated 4.1-only shim.

* Uses ClassicSparkSession and the Spark-side column conversion helpers.
*/
trait Delta40xCommandShims extends DeltaCommandShims {
trait ClassicSparkCommandShims extends DeltaCommandShims {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the classic and non-classic spark commands? These terms can change and have different meanings over time. We should use clearer terms.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I kept the file in place to preserve review context, but renamed the shim trait/object to ClassicSessionDeltaCommandShims so it refers explicitly to Spark's classic session runtime instead of the more ambiguous classic/non-classic wording.

Comment on lines +72 to +74
protected def buildWriteOperation: DeltaOperations.Operation

protected def copyWithCpuWrite(newCpuWrite: WriteIntoDelta): WriteIntoDeltaLike
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some docs to explain what these functions do.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs for buildWriteOperation and copyWithCpuWrite to explain what is version-specific in the shared Delta 4.x write path.


protected def buildWriteOperation: DeltaOperations.Operation

protected def copyWithCpuWrite(newCpuWrite: WriteIntoDelta): WriteIntoDeltaLike
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it return GpuWriteIntoDeltaBase instead of WriteIntoDeltaLike?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I tightened copyWithCpuWrite to return GpuWriteIntoDeltaBase instead of WriteIntoDeltaLike.

* Thin wrapper delegating to the shared Parquet format implementation.
*/
case class GpuDelta40xParquetFileFormat(
case class GpuDeltaParquetFileFormat(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this class is under delta-lake/common/src/main/delta-40x-41x, but this does not appear in either the package path or the class name. As such, it is easy to miss that this class is only for Delta 4.0 and 4.1. It is rather seen as a common class for every Delta version. Can we add the version in the class name? Maybe GpuDelta4xParquetFileFormat.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I renamed it to GpuDelta4xParquetFileFormat so the Delta 4.0/4.1 scope is visible from the class name.

import org.apache.spark.sql.delta.rapids.GpuWriteIntoDelta
import org.apache.spark.sql.delta.rapids.delta41x.GpuCreateDeltaTableCommand

class GpuDeltaCatalog(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the same as the one for Delta 4.0. Can we share that class instead of adding this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I removed the duplicate 4.0/4.1 wrappers and shared them through GpuDeltaCatalog4x.

Comment on lines +45 to +53
override def run(
spark: SparkSession,
txn: GpuOptimisticTransactionBase,
committedVersion: Long,
postCommitSnapshot: Snapshot,
actions: Seq[Action]): Unit = {
throw new UnsupportedOperationException(
"Spark 4.1 Delta auto-compaction uses the committed transaction hook")
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not exist in 4.1. We may need a new GpuAutoCompactBase for 4.1 that does not define this function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I split the transactional hook signature into GpuTransactionalAutoCompactBase for Delta 3.3/4.0 only, and kept the Delta 4.1 shim on GpuAutoCompactBase so the unsupported transactional run(...) method is no longer part of the 4.1 path.

Comment thread jenkins/databricks/test.sh Outdated
# Override the Databricks-specific shim for this upstream Spark smoke test so
# run_pyspark_from_build.sh selects jars consistent with the temporary SPARK_HOME.
SPARK_HOME=$HOME/spark-${UPSTREAM_SPARK_VERSION}-bin-hadoop3${UPSTREAM_SPARK_SCALA_SUFFIX} \
SPARK_SHIM_VER=${UPSTREAM_SHIM_VER} \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change in this PR? Are you adding Delta support for databricks as well?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you didn't intend to add Databricks support. We should do it in separate PRs. #14420 is the issue for it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That Databricks smoke-test change was unrelated to Delta 4.1 support, so I reverted it from this PR.

}
}

test("delta provider resolves to a real implementation on spark 411") {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert(provider.getClass.getName.contains("Delta41xProvider"))
}

test("delta read and write execute on spark 411") {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems duplicated with read/write integration tests. Doesn't it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I removed the duplicated Spark 4.1 read/write smoke test and kept only the provider resolution check. The read/write coverage remains in the integration tests.

Share the Delta 4.x catalog wrapper, clarify 4.x-specific shim names and docs,
and split the 4.1 auto-compact hook from the transactional base.

Drop the unrelated Databricks smoke-test tweak and the redundant Spark 4.1
Delta smoke test so this PR stays focused on Delta runtime support.

Made-with: Cursor
Signed-off-by: Firestarman <[email protected]>
Keep the Delta 4.0/4.1 provider imports aligned with repository grouping
checks so style CI no longer fails on the shared catalog import order.

Made-with: Cursor
Signed-off-by: Firestarman <[email protected]>
Use the Delta 4.1 overwrite mode value directly so invalid
partitionOverwriteMode settings keep surfacing the CPU-side exception.

Made-with: Cursor
Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Keep spark-shell stderr in CI logs and avoid masking remote test failures with
follow-on report copy errors so DBR smoke test regressions stay diagnosable.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

@firestarman
Copy link
Copy Markdown
Collaborator Author

draft to debug a failure on DB runtime.

@firestarman firestarman marked this pull request as draft April 24, 2026 06:25
Capture standalone worker and executor stdout/stderr when the Databricks
spark-shell smoke test fails so CI exposes the root cause without manual SSH.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Extract matched executor errors and include both the head and tail of standalone
worker logs so Databricks smoke test failures expose the root cause quickly.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

@firestarman
Copy link
Copy Markdown
Collaborator Author

firestarman commented Apr 24, 2026

I investigated the Databricks smoke-test failures, and this does not look related to the Delta 4.1 changes in this PR.
The failing step is the Databricks two-shim spark-shell smoke test that runs before the actual Databricks integration suite. The query being executed is just:
spark.range(100).agg(Map("id" -> "sum")).collect()
so no Delta code is exercised on this path. I reproduced the same class of failure on both:

  • DBR 17.3 (BASE_SPARK_VERSION=4.0.0, upstream smoke shim spark350)
  • DBR 14.3 (BASE_SPARK_VERSION=3.5.0, upstream smoke shim spark330)

After adding extra executor log capture, the executor-side stack trace shows the failure happens during RAPIDS executor plugin initialization, before the query can actually run:

RapidsExecutorPlugin: Initializing memory from Executor Plugin
Exception in the executor plugin, shutting down!
ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered ... cudaErrorUnknown unknown error
  at ai.rapids.cudf.Cuda.getDeviceCount(Native Method)
  at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(...)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(...)
  at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(...)
  at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(...)

I don't know how to bypass it.
More info as below:

# Some errors in the output of "nvidia-smi -q" :

Product Name : NVIDIA A10-24Q
Addressing Mode : Unknown Error

@firestarman firestarman marked this pull request as ready for review April 24, 2026 08:13
The extra spark-shell and executor logging was only added to debug the Databricks failure and is no longer needed after confirming the issue is outside this PR.

Signed-off-by: Firestarman <[email protected]>
Made-with: Cursor
@firestarman
Copy link
Copy Markdown
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Add support for Delta Lake 4.1.x

5 participants