feat: Collect Parquet NaN metrics during writes#727
Conversation
| return {}; | ||
| } | ||
|
|
||
| std::optional<std::vector<uint8_t>> BuildValidRows(const ::arrow::Array& array, |
There was a problem hiding this comment.
Non-blocking: this could be made cheaper by carrying Arrow validity bitmaps instead of materializing a std::vector<uint8_t> and calling array.IsNull(i) for every row at each struct level. The main detail is that this needs to preserve the current parent-validity behavior: child arrays of a null StructArray do not necessarily have those parent nulls reflected in their own null bitmap, so the effective validity should be parent_validity AND array.null_bitmap().
One possible approach is to keep a bitmap/buffer plus offset for the current effective validity, and when descending into a field, use Arrow bitmap utilities to AND it with the child array validity bitmap (or just reuse the parent bitmap when the child has null_count() == 0). Then the float/double collector can iterate only the effective bitmap bits instead of indexing a byte vector. This should avoid per-level byte-vector allocation and reduce the per-row validity checks for wide/deep struct writes.
Collects NaN value counts for float and double columns during Parquet writes, since the Parquet footer statistics do not track NaN counts.
Changes
FieldMetricsCollector): A visitor that walks each record batch before writing, accumulating value counts, null counts, NaN counts, and NaN-excluding lower/upper bounds for float/double fields.MetricsModeiskNoneare skipped entirely, avoiding wasted work.FieldMetricstake precedence over footer statistics inParquetMetrics::GetMetrics, so NaN counts are populated while counts/bounds still fall back to footer stats when write-side data isn't available.ParquetMetricsTestnow overridesReportsNanCounts()totrue, and existing NaN test cases verify NaN counts alongside existing value/null count assertions.Behavior alignment with Java
nan_value_countwithout setting bounds.