feat: Collect Parquet NaN metrics during writes by WZhuo · Pull Request #727 · apache/iceberg-cpp

WZhuo · 2026-06-11T03:11:52Z

Collects NaN value counts for float and double columns during Parquet writes, since the Parquet footer statistics do not track NaN counts.

Changes

Write-side NaN metric collection (FieldMetricsCollector): A visitor that walks each record batch before writing, accumulating value counts, null counts, NaN counts, and NaN-excluding lower/upper bounds for float/double fields.
MetricsConfig-aware skipping: Fields whose MetricsMode is kNone are skipped entirely, avoiding wasted work.
Integration with existing footer metrics: Write-side FieldMetrics take precedence over footer statistics in ParquetMetrics::GetMetrics, so NaN counts are populated while counts/bounds still fall back to footer stats when write-side data isn't available.
Tests: ParquetMetricsTest now overrides ReportsNanCounts() to true, and existing NaN test cases verify NaN counts alongside existing value/null count assertions.

Behavior alignment with Java

Fields nested inside lists/maps do not get NaN metrics (both Java and C++ agree — Java collects then discards; C++ skips collection entirely).
NaN values are excluded from lower/upper bounds in both implementations.
Float/double fields with all-NaN values correctly set nan_value_count without setting bounds.

wgtmac · 2026-06-15T05:28:04Z

+  return {};
+}
+
+std::optional<std::vector<uint8_t>> BuildValidRows(const ::arrow::Array& array,


Non-blocking: this could be made cheaper by carrying Arrow validity bitmaps instead of materializing a std::vector<uint8_t> and calling array.IsNull(i) for every row at each struct level. The main detail is that this needs to preserve the current parent-validity behavior: child arrays of a null StructArray do not necessarily have those parent nulls reflected in their own null bitmap, so the effective validity should be parent_validity AND array.null_bitmap().

One possible approach is to keep a bitmap/buffer plus offset for the current effective validity, and when descending into a field, use Arrow bitmap utilities to AND it with the child array validity bitmap (or just reuse the parent bitmap when the child has null_count() == 0). Then the float/double collector can iterate only the effective bitmap bits instead of indexing a byte vector. This should avoid per-level byte-vector allocation and reduce the per-row validity checks for wide/deep struct writes.

feat: Collect Parquet NaN metrics during writes

983ca75

wgtmac approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Collect Parquet NaN metrics during writes#727

feat: Collect Parquet NaN metrics during writes#727
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:parquet_metric

WZhuo commented Jun 11, 2026 •

edited

Loading

Uh oh!

wgtmac Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WZhuo commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Behavior alignment with Java

Uh oh!

wgtmac Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WZhuo commented Jun 11, 2026 •

edited

Loading