Skip to content

Add footer cache for iceberg#14645

Open
liurenjie1024 wants to merge 5 commits intoNVIDIA:mainfrom
liurenjie1024:ray/14064
Open

Add footer cache for iceberg#14645
liurenjie1024 wants to merge 5 commits intoNVIDIA:mainfrom
liurenjie1024:ray/14064

Conversation

@liurenjie1024
Copy link
Copy Markdown
Collaborator

@liurenjie1024 liurenjie1024 commented Apr 22, 2026

Fixes #14064

Description

Allow iceberg's parquet reader to utilize footer cache. Since in older version of parquet, the ParquetFileReader constructor with ParquetMetadata is not available, we actually ignored the footer cache in iceberg 1.6.x and 1.9.x shim, and it's only available in iceberg 1.10 shim.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

For the reason mentioned in pr description, we should only do perf test against iceberg 1.10 shim

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR introduces footer-cache support for the Iceberg Parquet reader by extracting shared footer-read/cache logic into a new ParquetFooterUtils object and adding per-version GpuParquetIOShim implementations. The 1.10.x shim injects a pre-parsed ParquetMetadata (read from cache or file) into ParquetFileReader via its 4-arg constructor, while the 1.6.x and 1.9.x shims fall back to uncached reads but correctly bump the miss counter. The GpuParquetScan inline cache logic is cleanly removed in favour of the new shared utility, and verifyParquetMagic / encrypted-parquet detection are likewise consolidated there.

Confidence Score: 5/5

Safe to merge — all resource-management patterns are correct and the shim split is complete across all three iceberg versions.

No P0/P1 findings. ARM patterns (withResource/closeOnExcept) are applied correctly throughout. By-name thunk in getFooterBuffer ensures readFooterBuffer is only evaluated on a cache miss. Shims added for all three iceberg versions (C4). The refactoring in GpuParquetScan preserves existing behaviour, including the encrypted-parquet detection path.

No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetFooterUtils.scala New utility object factoring out shared footer read/cache logic; ARM patterns correct, by-name thunk used properly for lazy miss evaluation
iceberg/iceberg-1-10-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala New 1.10.x shim: reads footer via FileCache using 4-arg ParquetFileReader constructor; ARM patterns (withResource/closeOnExcept) correctly applied
iceberg/iceberg-1-6-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala New 1.6.x shim: caching unsupported, falls back to plain open() and bumps miss counter for metric visibility
iceberg/iceberg-1-9-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala New 1.9.x shim: identical to 1.6.x shim — caching unsupported, plain open() with miss counter bump
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/converter/ToIcebergShaded.scala New adapter bridging non-shaded parquet InputFile/SeekableInputStream to iceberg-shaded equivalents; all stream delegates correctly forwarded
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala Inline footer-cache logic removed and delegated to ParquetFooterUtils; verifyParquetMagic also extracted; behaviour preserved
iceberg/common/src/main/scala/org/apache/iceberg/parquet/GpuParquetIO.scala Extended with openReader() that dispatches to per-version shim; thin delegation layer, no resource-management concerns
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/reader.scala newReader() updated to accept optional metrics map and delegate to GpuParquetIO.openReader for footer-cache awareness

Sequence Diagram

sequenceDiagram
    participant Reader as IcebergPartitionedFile.newReader()
    participant GpuParquetIO as GpuParquetIO.openReader()
    participant Shim110 as GpuParquetIOShim (1.10.x)
    participant Shim16_19 as GpuParquetIOShim (1.6/1.9.x)
    participant PFU as ParquetFooterUtils
    participant FC as FileCache
    participant PFR as ParquetFileReader (shaded)

    Reader->>GpuParquetIO: openReader(inputFile, path, opts, metrics)
    GpuParquetIO->>Shim110: openReader(...) [1.10.x]
    GpuParquetIO->>Shim16_19: openReader(...) [1.6/1.9.x]

    alt Iceberg 1.10.x
        Shim110->>PFU: getFooterBuffer(inputFile, metrics, readFooterBufferFromInputFile)
        PFU->>FC: getFooter(inputFile)
        alt Cache HIT
            FC-->>PFU: HMB (cached)
            PFU-->>Shim110: HMB (metrics: hit+size)
        else Cache MISS
            PFU->>PFU: readFooterBufferFromInputFile(inputFile, path)
            PFU->>FC: startFooterCache / complete
            PFU-->>Shim110: HMB (metrics: miss+size)
        end
        Shim110->>Shim110: ToIcebergShaded.shade(HMBInputFile)
        Shim110->>PFR: readFooter(shadedHmbFile, opts, hmbStream) → metadata
        Shim110->>PFR: new ParquetFileReader(realFile, metadata, opts, stream)
    else Iceberg 1.6.x / 1.9.x
        Shim16_19->>Shim16_19: metrics += FILECACHE_FOOTER_MISSES
        Shim16_19->>PFR: ParquetFileReader.open(file, opts)
    end
Loading

Reviews (5): Last reviewed commit: "Fix buid break" | Re-trigger Greptile

Comment thread iceberg/common/src/main/scala/org/apache/iceberg/parquet/GpuParquetIO.scala Outdated
@liurenjie1024 liurenjie1024 marked this pull request as draft April 22, 2026 10:04
@liurenjie1024 liurenjie1024 marked this pull request as ready for review April 23, 2026 06:05
Signed-off-by: Ray Liu <liurenjie2008@gmail.com>
@sameerz sameerz added the performance A performance related task/issue label Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few non-blocking review comments inline. One additional note on PR description scope (#7):

The title says "Add footer cache for iceberg" but only iceberg 1.10.x actually gets footer caching; the 1.6.x and 1.9.x shims always go through the plain ParquetFileReader.open(...) path because the shaded ParquetFileReader in those versions has no public API to inject pre-parsed metadata. Worth calling that limitation out explicitly in the PR description (and probably in the Performance section), so users on the older iceberg shims don't expect a perf win and reviewers don't go looking for tests asserting footer-cache hits there.

throw new RuntimeException(s"$filePath is not a Parquet file (too small length: $fileLen )")
}
val footerLengthIndex = fileLen - FooterLengthSize - MAGIC.length
withResource(inputFile.open()) { inputStream =>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost NVTX range on the iceberg footer-read path.

The original readFooterBufUsingHadoop in GpuParquetScan.scala wraps the actual seek + copy loop in NvtxRegistry.PARQUET_READ_FOOTER_BYTES. This new helper drops it, so iceberg footer reads (which now flow through readFooterBufferFromInputFile from the 1.10.x shim) will silently disappear from NVTX traces and won't be comparable to non-iceberg footer reads.

Suggest wrapping the withResource(inputFile.open()) { ... } block (or just the seek/read loop inside it) in NvtxRegistry.PARQUET_READ_FOOTER_BYTES { ... } for parity.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

IOUtils.readFully(inputStream, magic, 0, magic.length)
verifyParquetMagic(filePath, magic)
val fIdx = footerIndex(filePath, fileLen, footerLength, FooterLengthSize)
val tailBytes = (fileLen - fIdx).toInt
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Long -> Int truncation concern Greptile flagged on the cache-hit path applies here on the raw-read path as well: if fileLen - fIdx ever exceeds 2 GB this silently overflows to a negative value and HostMemoryBuffer.allocate(tailBytes + MAGIC.length, false) will fail with a confusing message (or worse).

Astronomically unlikely for footers, but Math.toIntExact(fileLen - fIdx) would at least give a deterministic, descriptive failure if it ever happens. Cheap to add.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// buffer includes header and trailing length and magic, stripped here
hmb.slice(MAGIC.length, hmb.getLength - Integer.BYTES - MAGIC.length)
}
withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice consolidation here. Follow-up suggestion: the other footer-cache hit/miss block in this same file, readAndSimpleFilterFooter (around line 631), still has its own near-identical FileCache.getFooter / startFooterCache / try { ... token.complete } finally { token.cancel } ladder. That's exactly the pattern you just centralized in ParquetFooterUtils.getFooterBuffer.

It could be reduced to roughly:

withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics,
    readFooterBuffer(fileIO, filePath, conf))) { hmb =>
  ParquetFileReader.readFooter(new HMBInputFile(hmb),
    ParquetMetadataConverter.range(file.start, file.start + file.length))
}

Happy for it to be a separate PR if you want to keep this one minimal, but the duplication is exactly what motivated the extraction, and centralizing also means the resource-leak fix (the try/finally token.cancel() you correctly added in getFooterBuffer) lives in only one place.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@liurenjie1024
Copy link
Copy Markdown
Collaborator Author

liurenjie1024 commented Apr 24, 2026

A few non-blocking review comments inline. One additional note on PR description scope (#7):

The title says "Add footer cache for iceberg" but only iceberg 1.10.x actually gets footer caching; the 1.6.x and 1.9.x shims always go through the plain ParquetFileReader.open(...) path because the shaded ParquetFileReader in those versions has no public API to inject pre-parsed metadata. Worth calling that limitation out explicitly in the PR description (and probably in the Performance section), so users on the older iceberg shims don't expect a perf win and reviewers don't go looking for tests asserting footer-cache hits there.

cc @gerashegalov I've updated the pr description part.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

@liurenjie1024
Copy link
Copy Markdown
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance A performance related task/issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Iceberg parquet reader should use file cache for parquet footers.

4 participants