Skip to content

feat: Collect Parquet NaN metrics during writes#727

Open
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:parquet_metric
Open

feat: Collect Parquet NaN metrics during writes#727
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:parquet_metric

Conversation

@WZhuo

@WZhuo WZhuo commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Collects NaN value counts for float and double columns during Parquet writes, since the Parquet footer statistics do not track NaN counts.

Changes

  • Write-side NaN metric collection (FieldMetricsCollector): A visitor that walks each record batch before writing, accumulating value counts, null counts, NaN counts, and NaN-excluding lower/upper bounds for float/double fields.
  • MetricsConfig-aware skipping: Fields whose MetricsMode is kNone are skipped entirely, avoiding wasted work.
  • Integration with existing footer metrics: Write-side FieldMetrics take precedence over footer statistics in ParquetMetrics::GetMetrics, so NaN counts are populated while counts/bounds still fall back to footer stats when write-side data isn't available.
  • Tests: ParquetMetricsTest now overrides ReportsNanCounts() to true, and existing NaN test cases verify NaN counts alongside existing value/null count assertions.

Behavior alignment with Java

  • Fields nested inside lists/maps do not get NaN metrics (both Java and C++ agree — Java collects then discards; C++ skips collection entirely).
  • NaN values are excluded from lower/upper bounds in both implementations.
  • Float/double fields with all-NaN values correctly set nan_value_count without setting bounds.

return {};
}

std::optional<std::vector<uint8_t>> BuildValidRows(const ::arrow::Array& array,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this could be made cheaper by carrying Arrow validity bitmaps instead of materializing a std::vector<uint8_t> and calling array.IsNull(i) for every row at each struct level. The main detail is that this needs to preserve the current parent-validity behavior: child arrays of a null StructArray do not necessarily have those parent nulls reflected in their own null bitmap, so the effective validity should be parent_validity AND array.null_bitmap().

One possible approach is to keep a bitmap/buffer plus offset for the current effective validity, and when descending into a field, use Arrow bitmap utilities to AND it with the child array validity bitmap (or just reuse the parent bitmap when the child has null_count() == 0). Then the float/double collector can iterate only the effective bitmap bits instead of indexing a byte vector. This should avoid per-level byte-vector allocation and reduce the per-row validity checks for wide/deep struct writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants