Skip to content

feat(parquet): fuse level encoding passes and compact level representation#9653

Open
HippoBaro wants to merge 6 commits intoapache:mainfrom
HippoBaro:faster_sparse_columns_encoding
Open

feat(parquet): fuse level encoding passes and compact level representation#9653
HippoBaro wants to merge 6 commits intoapache:mainfrom
HippoBaro:faster_sparse_columns_encoding

Conversation

@HippoBaro
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

See issue for details. The Parquet column writer currently does per-value work during level encoding regardless of data sparsity, even though the output encoding (RLE) is proportional to the number of runs.

What changes are included in this PR?

Three incremental commits, each building on the previous:

  1. Fuse level encoding with counting and histogram updates. write_mini_batch() previously made three separate passes over each level array: count non-nulls, update the level histogram, and RLE-encode. Now all three happen in a single pass via an observer callback on LevelEncoder. When the RLE encoder enters accumulation mode, the loop scans ahead for the full run length and batches the observer call. This makes counting and histogram updates O(1) per run.

  2. Batch consecutive null/empty rows in write_list. Consecutive null or empty list entries are now collapsed into a single visit_leaves() call that bulk-extends all leaf level buffers, instead of one tree traversal per null row. Mirrors the approach already used by write_struct().

  3. Short-circuit entirely-null columns. When every element in an array is null, skip Vec<i16> level-buffer materialization entirely and store a compact (def_value, rep_value, count) tuple. The writer encodes this via RleEncoder::put_n() in O(1) amortized time, bypassing the normal mini-batch loop.

Are these changes tested?

All tests passing. I added some benchmark to exercice the heavy and all-null code paths, alongside the existing 25% sparseness benchmarks:

Name                                 Before      After      Delta
primitive_all_null/default           37.5 ms     0.20 ms    (−99.5%)
primitive_all_null/zstd              37.1 ms     0.30 ms    (−99.2%)
primitive_sparse_99pct_null/default  42.5 ms     15.7 ms    (−62.9%)
primitive_sparse_99pct_null/p2       42.4 ms     15.9 ms    (−62.4%)
list_prim_sparse_99pct_null/default  40.8 ms     11.2 ms    (−72.4%)
list_prim_sparse_99pct_null/p2       40.8 ms     10.7 ms    (−73.8%)
bool/default                         12.7 ms     10.3 ms    (−18.7%)
primitive/default                   124.1 ms    104.6 ms    (−15.6%)
string_and_binary_view/default       46.3 ms     41.6 ms    (−10.1%)
list_primitive/default              253.9 ms    235.3 ms    (−7.4%)
string_dictionary/default            46.2 ms     43.8 ms    (−5.3%)

Non-nullable column benchmarks are within noise, as expected since they have no definition levels to optimize.

Are there any user-facing changes?

None.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 2, 2026
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from 335fb81 to 44dae05 Compare April 2, 2026 05:05
@HippoBaro
Copy link
Copy Markdown
Contributor Author

This is a continuation of the work done in #9447 to improve runtime performance around sparse and/or highly uniform columns. As such this may be of interest to @alamb and @etseidl.

5a1d3d7 adds three benchmarks that exercise the code path this series optimizes. I created a PR (#9654) to merge those separately if needed so the benchmark bot can have a baseline to compare against.

Thanks!

/// to add more repetitions without per-element overhead.
#[inline]
pub fn is_accumulating(&self, value: u64) -> bool {
self.repeat_count > 8 && self.current_value == value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be '>= 8'?

Also, given the discussion in #7739, I think it's time to at least replace the magic 8 with a constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be '>= 8'?

The RLE encoder transitions to accumulation mode after the 8th value has been buffered and flush_buffered_values() has committed the RLE decision.

Also, given the discussion in #7739, I think it's time to at least replace the magic 8 with a constant.

I agree! Happy to add that at the end of this series.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RLE encoder transitions to accumulation mode after the 8th value has been buffered and flush_buffered_values() has committed the RLE decision.

Here's my understanding: a repeated value is added wit put. The repeat_count is incremented, and it reaches 8. This does not trigger the return branch, and continues on. num_buffered_values is currently 7, the value is added to the buffered_values array, and num_buffered_values is incremented to 8. This triggers flush_buffered_values(). flush_buffered_values() sees that repeat_count is 8, so it simply sets num_buffered_values to 0 and potentially ends a previous bit-packed run by writing the run length indicator, and returns. We then return from put with repeat_count still 8, num_buffered_values = 0, and we're now in accumulating mode. If is_accumulating() is called after a this put() (which seems to always be the case), I think `>= 8' is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pushing back. You are absolutely right. > was correct but spurious. >= is exactly right. It's fixed in the latest version, and I added a unit test for good measure.

Copy link
Copy Markdown
Contributor Author

@HippoBaro HippoBaro Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, c891c35 introduces a new constant BIT_PACK_GROUP_SIZE as requested, and also swaps the leftover literals which referred to the count of bits in a byte with u8::BITS from the std.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 2, 2026

Thanks @HippoBaro, this looks impressive. I'm still looking, but haven't found any obvious problems yet.

Gads, every time I delve this deep into parquet I go a little mad 😵‍💫. I think the RLE encoder could use a little refactoring/comment improvements to make the flow a little more obvious. Not as part of this PR though.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a few comments. More tomorrow.

Comment on lines 722 to 723
let iter = std::iter::repeat_n(info.max_def_level, len);
def_levels.extend(iter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this case (which I think is nullable but no nulls) also make use of the uniform levels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ab9a7bc

let mut values_to_write = 0usize;
let max_def = self.descr.max_def_level();
self.def_levels_encoder
.put_with_observer(levels, |level, count| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ When I added the histograms I wasn't happy with the redundancy here. Nice fix!

Comment on lines +940 to +943
/// Bulk-emit `count` uniform null def/rep levels. If the level Vecs are
/// still empty, stores a compact `uniform_levels` tuple instead of
/// materializing the Vecs. Otherwise falls back to extending them.
fn extend_uniform_null_levels(&mut self, def_val: i16, rep_val: i16, count: usize) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll preface with the admission I'm not all that familiar with this part of the code. So if this is called after some levels have been added to the vecs, it will extend the vecs. What happens in the reverse case, after setting uniform_levels an attempt is made to extend the level vecs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that was a footgun! The latest patch introduces a better state machine where transitions from uniform to dense is explicit: appending a run with a different value to a Uniform automatically now materializes it into a vec first.

/// When set, all def/rep levels are a single repeated value and the
/// Vec fields above are empty. Tuple: (def_value, rep_value, count).
/// This avoids materializing large Vecs for entirely-null columns.
uniform_levels: Option<(i16, i16, usize)>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the logic around these optionals and extend_uniform_null_levels could be made clearer with an enum. The None case for def_levels/rep_levels also seems similar to a uniform value of 0. So maybe it could look something like

enum LevelData {
    Vec(Vec<i16>),
    Uniform(i16, usize),
}

struct ArrayLevels {
    def_levels: LevelData,
    rep_levels: LevelData,
    ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #9653 (comment), thanks!

Add `is_accumulating()` and `extend_run()` methods to `RleEncoder`
that allow callers to detect when the encoder is in RLE accumulation
mode and bulk-extend runs without per-element overhead.

Add `put_with_observer()` to `LevelEncoder` that calls an
`FnMut(i16, usize)` observer for each run of identical values during
encoding. This allows callers to piggyback counting and histogram
updates into the encoding pass without extra iterations over the level
buffer. Refactor `put()` to delegate to it with a no-op observer.

Previously, `write_mini_batch()` made 3 separate passes over each level
array: one to count non-null values or row boundaries, one to update
the level histogram, and one to RLE-encode. Now all three operations
happen in a single pass via the observer closure.

Remove the separate `update_definition_level_histogram()` and
`update_repetition_level_histogram()` methods from PageMetrics. Add
`LevelHistogram::update_n()` for batch histogram updates.

The encoding loop now checks if the encoder entered RLE accumulation
mode after a call to `RleEncoder::put()`. When it does, it scans ahead
for the rest of the run and batches the observer call with the full run
length, enabling O(1) histogram and counting updates per RLE run.

Benchmark results (vs baseline):

  primitive_sparse_99pct_null/default          15.2 ms  (was 40.3 ms, −62%)
  primitive_sparse_99pct_null/parquet_2        16.1 ms  (was 43.5 ms, −63%)
  primitive_sparse_99pct_null/zstd_parquet_2   17.0 ms  (was 44.4 ms, −62%)
  list_primitive_sparse_99pct_null/default     17.4 ms  (was 39.9 ms, −56%)
  list_primitive_sparse_99pct_null/parquet_2   16.7 ms  (was 39.9 ms, −58%)
  list_primitive_sparse_99pct_null/zstd_p2     16.8 ms  (was 40.7 ms, −59%)
  primitive_all_null/default                    8.8 ms  (was 38.0 ms, −77%)
  primitive_all_null/parquet_2                  8.8 ms  (was 36.9 ms, −76%)
  primitive_all_null/zstd_parquet_2             8.9 ms  (was 36.1 ms, −75%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Restructure `write_list()` to accumulate consecutive null and empty rows
and flush them in a single `visit_leaves()` call using
`extend(repeat_n(...))`, instead of calling `visit_leaves()` per row.

With sparse data (99% nulls), a 4096-row batch previously triggered
~4000 individual tree traversals, each pushing a single value per leaf.
Now consecutive null/empty runs are collapsed into one traversal that
extends all leaf level buffers in bulk.

This follows the same pattern already used by `write_struct()`. The
`write_non_null_slice` path is unchanged since each non-null row has
different offsets and cannot be batched.

Benchmark results (vs previous commit):

  list_primitive_sparse_99pct_null/default     10.5 ms  (was 17.4 ms, −40%)
  list_primitive_sparse_99pct_null/parquet_2   10.5 ms  (was 16.7 ms, −37%)
  list_primitive_sparse_99pct_null/zstd_p2     10.6 ms  (was 16.8 ms, −37%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
When every element in a list, struct, or fixed-size list array is null,
short-circuit level building before the row loop and store a compact
`(def_value, rep_value, count)` tuple on `ArrayLevels` instead of
materializing `Vec<i16>` buffers. The same fast path applies at the leaf
level in `write_levels()` when `logical_nulls` covers every row.

On the write side, `ArrowColumnWriter` detects the `uniform_levels`
tuple and calls a dedicated `write_uniform_null_batch()` that encodes
def/rep levels via `RleEncoder::put_n()` in O(1) amortized time,
bypassing the normal mini-batch chunking and per-element iteration. A
new `LevelEncoder::put_n_with_observer()` fuses encoding with histogram
and counting updates in a single call. `write_uniform_null_batch` chunks
at the configured page row count limit to respect page boundaries.

Also defers `non_null_indices.reserve()` to branches that actually
populate it, avoiding an unnecessary allocation for all-null arrays.

Benchmark results (vs previous commit):

  primitive_all_null/default                  192 µs    (was  8.8 ms, −97.8%)
  primitive_all_null/parquet_2                193 µs    (was  8.8 ms, −97.8%)
  primitive_all_null/zstd_parquet_2           250 µs    (was  8.9 ms, −97.2%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Replace the ad hoc level and non-null index vectors with `LevelData` and
`ValueSelection`, so the writer can represent absent, uniform, dense, and
sparse cases directly instead of always materializing the worst-case
shape. This keeps the common paths cheap, removes the dedicated uniform
null fast path by folding it into the generic semantic writer, and
preserves the old all-null throughput by keeping page-sized chunking for
uniform batches.

Extends the compact Uniform/Dense representations (introduced for
all-null columns in the previous commit) to non-null columns, yielding
the same allocation, batching, and encoding benefits for the common
non-null case.

Benchmark results (vs previous commit):

  primitive_non_null/default               57.8 ms  (was 63.4 ms, −9%)
  primitive_non_null/parquet_2             78.0 ms  (was 85.1 ms, −8%)
  struct_non_null/default                  27.3 ms  (was 29.9 ms, −9%)
  struct_non_null/parquet_2                36.1 ms  (was 38.2 ms, −6%)
  struct_non_null/zstd_parquet_2           47.3 ms  (was 50.9 ms, −7%)

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
The literal `8` appeared in two distinct roles throughout `RleEncoder`,
`RleDecoder`, and their tests. Replacing each with a named constant
makes the intent explicit and prevents the two meanings from being
confused.

`BIT_PACK_GROUP_SIZE = 8`
  The Parquet RLE/bit-packing hybrid format always bit-packs values in
  multiples of this count (spec: "we always bit-pack a multiple of 8
  values at a time"). Every occurrence related to the staging buffer
  size, the repeat-count threshold that triggers the RLE decision, and
  the group-count arithmetic in bit-packed headers now uses this name.

`u8::BITS` (= 8, from std)
  Used wherever a bit-count is divided by 8 to obtain a byte-count
  (e.g. `ceil(bit_width, u8::BITS as usize)`). This is a bits-per-byte
  conversion, a fundamentally different concept from the packing-group
  size.

No behaviour change.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from 7902e69 to c891c35 Compare April 8, 2026 21:16
@HippoBaro
Copy link
Copy Markdown
Contributor Author

HippoBaro commented Apr 8, 2026

Thanks for the reviews! I've reworked the branch to address all feedback. Sorry for the delay, it took me a while to experiment.

The main structural change is a LevelData enum refactor suggested by @jhorstmann. Thank you for the excellent suggestion. As I am primarily concerned with the performance of very sparse data, I hadn't considered the possibility to also speed up the non-null-but-uniform code path.

The Option<Vec<i16>> + uniform_levels: Option<(i16, i16, usize)> tuple is replaced by a single enum:

  enum LevelData {
      Absent,
      Materialized(Vec<i16>),
      Uniform { value: i16, count: usize },
  }

Absent replaces the previous None case, Uniform captures any column whose levels are a single repeated value (all-null, or nullable with no nulls), and Materialized is the normal vec path. This unifies the three states into one type and makes transitions between them easy to follow. This yields a nice performance improvement documented in ab9a7bc.

The resulting refactor has a larger LoC footprint, but the API is arguably much cleaner and robust.

Also, rebased as per #9656 (review)

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 8, 2026

Thanks @HippoBaro. I'll try to make some time to review the changes. Probably not today but hopefully tomorrow... 🤞

@HippoBaro HippoBaro changed the title feat(parquet): fuse level encoding passes and batch null runs in column writer feat(parquet): fuse level encoding passes and compact level representation Apr 8, 2026
@HippoBaro HippoBaro requested review from etseidl and jhorstmann April 8, 2026 22:29
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4213853435-1013-r8sfq 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing faster_sparse_columns_encoding (c891c35) to aac969d (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@HippoBaro
Copy link
Copy Markdown
Contributor Author

@alamb The above results will include only parts of the benchmarks this code improves on. The rest are in #9679

alamb pushed a commit that referenced this pull request Apr 9, 2026
# Which issue does this PR close?

- None, but relates to #9653

# Rationale for this change

#9653 introduces optimizations related to non-null uniform workloads.
This adds benchmarks so we can quantify them.

# What changes are included in this PR?

Add three new benchmark cases to the arrow_writer benchmark suite for
evaluating write performance on struct columns at varying null
densities:

* `struct_non_null`: a nullable struct with 0% null rows and
non-nullable primitive children;
* `struct_sparse_99pct_null`: a nullable struct with 99% null rows,
exercising null batching through one level of struct nesting;
* `struct_all_null`: a nullable struct with 100% null rows, exercising
the uniform-null path through struct nesting.

Baseline results (Apple M1 Max):
```
  struct_non_null/default              29.9 ms
  struct_non_null/parquet_2            38.2 ms
  struct_non_null/zstd_parquet_2       50.9 ms
  struct_sparse_99pct_null/default      7.2 ms
  struct_sparse_99pct_null/parquet_2    7.3 ms
  struct_sparse_99pct_null/zstd_p2      8.1 ms
  struct_all_null/default              83.3 µs
  struct_all_null/parquet_2            82.5 µs
  struct_all_null/zstd_parquet_2      106.6 µs
```

# Are these changes tested?

N/A

# Are there any user-facing changes?

None

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

@alamb The above results will include only parts of the benchmarks this code improves on. The rest are in #9679

I merged it in and merged up from main and will rerun the benchmarks

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4214179774-1017-8wm74 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing faster_sparse_columns_encoding (6c73ac7) to adf9308 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              faster_sparse_columns_encoding         main
-----                                              ------------------------------         ----
bool/bloom_filter                                  1.00     13.6±0.04ms    18.4 MB/sec    1.02     13.8±0.05ms    18.1 MB/sec
bool/default                                       1.00     11.3±0.03ms    22.0 MB/sec    1.03     11.7±0.04ms    21.4 MB/sec
bool/parquet_2                                     1.00     14.5±0.04ms    17.3 MB/sec    1.01     14.7±0.05ms    17.0 MB/sec
bool/zstd                                          1.00     11.9±0.04ms    21.0 MB/sec    1.03     12.2±0.05ms    20.4 MB/sec
bool/zstd_parquet_2                                1.00     14.8±0.03ms    16.9 MB/sec    1.02     15.1±0.05ms    16.6 MB/sec
bool_non_null/bloom_filter                         1.02      7.1±0.02ms    17.6 MB/sec    1.00      7.0±0.03ms    17.9 MB/sec
bool_non_null/default                              1.00      4.2±0.02ms    30.0 MB/sec    1.01      4.2±0.02ms    29.8 MB/sec
bool_non_null/parquet_2                            1.02      8.2±0.03ms    15.3 MB/sec    1.00      8.0±0.02ms    15.6 MB/sec
bool_non_null/zstd                                 1.00      4.5±0.15ms    27.6 MB/sec    1.01      4.6±0.02ms    27.4 MB/sec
bool_non_null/zstd_parquet_2                       1.01      8.6±0.03ms    14.6 MB/sec    1.00      8.5±0.02ms    14.8 MB/sec
float_with_nans/bloom_filter                       1.00     91.3±0.41ms   153.3 MB/sec    1.03     94.5±0.25ms   148.2 MB/sec
float_with_nans/default                            1.00     72.8±0.22ms   192.4 MB/sec    1.04     75.4±0.39ms   185.8 MB/sec
float_with_nans/parquet_2                          1.00     94.5±0.24ms   148.2 MB/sec    1.03     97.4±0.26ms   143.8 MB/sec
float_with_nans/zstd                               1.00    110.5±0.18ms   126.7 MB/sec    1.02    113.0±0.17ms   123.9 MB/sec
float_with_nans/zstd_parquet_2                     1.00    131.9±0.20ms   106.2 MB/sec    1.02    134.9±0.24ms   103.8 MB/sec
list_primitive/bloom_filter                        1.06    386.7±3.83ms  1410.3 MB/sec    1.00    364.1±4.62ms  1497.7 MB/sec
list_primitive/default                             1.06    307.1±3.41ms  1776.1 MB/sec    1.00    288.7±2.08ms  1889.2 MB/sec
list_primitive/parquet_2                           1.14    323.5±1.33ms  1685.9 MB/sec    1.00   282.9±10.11ms  1928.0 MB/sec
list_primitive/zstd                                1.05    545.2±4.05ms  1000.3 MB/sec    1.00   518.7±10.83ms  1051.4 MB/sec
list_primitive/zstd_parquet_2                      1.03    523.3±1.78ms  1042.1 MB/sec    1.00    509.5±0.86ms  1070.3 MB/sec
list_primitive_non_null/bloom_filter               1.00   439.1±18.83ms  1239.4 MB/sec    1.04   458.2±14.37ms  1187.8 MB/sec
list_primitive_non_null/default                    1.00   312.0±12.54ms  1744.6 MB/sec    1.06   329.8±20.07ms  1650.3 MB/sec
list_primitive_non_null/parquet_2                  1.00   321.3±23.12ms  1694.0 MB/sec    1.15    368.4±0.92ms  1477.2 MB/sec
list_primitive_non_null/zstd                       1.01   734.4±16.25ms   741.1 MB/sec    1.00   728.5±18.97ms   747.1 MB/sec
list_primitive_non_null/zstd_parquet_2             1.05    719.4±0.75ms   756.5 MB/sec    1.00    683.5±1.62ms   796.3 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     12.3±0.06ms     3.0 GB/sec    3.53     43.4±0.53ms   860.8 MB/sec
list_primitive_sparse_99pct_null/default           1.00     12.0±0.06ms     3.0 GB/sec    3.59     43.0±0.57ms   869.1 MB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     11.9±0.06ms     3.1 GB/sec    3.61     43.1±0.54ms   867.1 MB/sec
list_primitive_sparse_99pct_null/zstd              1.00     13.8±0.08ms     2.6 GB/sec    3.25     44.9±0.56ms   831.5 MB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     12.0±0.04ms     3.0 GB/sec    3.59     43.2±0.56ms   865.4 MB/sec
primitive/bloom_filter                             1.00    151.7±0.50ms   295.9 MB/sec    1.03    156.7±0.40ms   286.4 MB/sec
primitive/default                                  1.00    120.3±0.29ms   373.1 MB/sec    1.04    124.7±0.28ms   359.7 MB/sec
primitive/parquet_2                                1.00    136.1±0.27ms   329.7 MB/sec    1.03    139.5±0.26ms   321.6 MB/sec
primitive/zstd                                     1.00    150.1±0.34ms   299.0 MB/sec    1.03    154.3±0.28ms   290.8 MB/sec
primitive/zstd_parquet_2                           1.00    169.1±0.50ms   265.4 MB/sec    1.02    173.2±0.25ms   259.0 MB/sec
primitive_all_null/bloom_filter                    1.00    893.4±2.99µs    49.1 GB/sec    43.78    39.1±0.02ms  1147.3 MB/sec
primitive_all_null/default                         1.00    274.9±2.10µs   159.4 GB/sec    139.39    38.3±0.02ms  1170.9 MB/sec
primitive_all_null/parquet_2                       1.00    275.6±1.50µs   159.0 GB/sec    139.02    38.3±0.03ms  1171.2 MB/sec
primitive_all_null/zstd                            1.00    385.3±1.79µs   113.8 GB/sec    99.82    38.5±0.03ms  1166.9 MB/sec
primitive_all_null/zstd_parquet_2                  1.00    350.5±1.40µs   125.0 GB/sec    109.54    38.4±0.03ms  1169.0 MB/sec
primitive_non_null/bloom_filter                    1.00    100.3±0.51ms   438.8 MB/sec    1.08    108.5±0.60ms   405.5 MB/sec
primitive_non_null/default                         1.00     61.4±0.16ms   716.9 MB/sec    1.13     69.7±0.28ms   631.7 MB/sec
primitive_non_null/parquet_2                       1.00     83.6±0.84ms   526.5 MB/sec    1.09     91.3±0.21ms   482.0 MB/sec
primitive_non_null/zstd                            1.00     92.2±0.18ms   477.1 MB/sec    1.14    105.5±2.36ms   416.9 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    117.2±0.17ms   375.4 MB/sec    1.13    131.9±1.58ms   333.6 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     18.6±0.11ms     2.4 GB/sec    2.50     46.7±0.15ms   961.9 MB/sec
primitive_sparse_99pct_null/default                1.00     17.5±0.11ms     2.5 GB/sec    2.56     44.7±0.07ms  1003.6 MB/sec
primitive_sparse_99pct_null/parquet_2              1.00     17.2±0.04ms     2.6 GB/sec    2.61     44.8±0.15ms  1001.2 MB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.5±0.07ms     2.1 GB/sec    2.34     47.8±0.09ms   938.5 MB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     19.3±0.06ms     2.3 GB/sec    2.40     46.4±0.13ms   967.1 MB/sec
string/bloom_filter                                1.04   229.5±21.60ms     2.2 GB/sec    1.00   220.2±17.52ms     2.3 GB/sec
string/default                                     1.18   141.1±20.33ms     3.6 GB/sec    1.00    119.7±4.94ms     4.3 GB/sec
string/parquet_2                                   1.63    183.0±1.16ms     2.8 GB/sec    1.00    112.1±6.29ms     4.6 GB/sec
string/zstd                                        1.09   466.5±18.44ms  1123.7 MB/sec    1.00    428.2±3.02ms  1224.2 MB/sec
string/zstd_parquet_2                              1.00    397.8±1.05ms  1317.7 MB/sec    1.00    396.8±1.24ms  1321.2 MB/sec
string_and_binary_view/bloom_filter                1.00     66.7±0.33ms   483.4 MB/sec    1.01     67.4±0.16ms   478.5 MB/sec
string_and_binary_view/default                     1.00     49.3±0.19ms   653.8 MB/sec    1.02     50.5±0.26ms   638.8 MB/sec
string_and_binary_view/parquet_2                   1.00     60.5±0.18ms   533.3 MB/sec    1.01     61.0±0.09ms   528.6 MB/sec
string_and_binary_view/zstd                        1.00     85.9±0.23ms   375.3 MB/sec    1.01     86.9±0.14ms   371.2 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     74.4±0.12ms   433.3 MB/sec    1.01     74.9±0.09ms   430.7 MB/sec
string_dictionary/bloom_filter                     1.05     95.3±1.19ms     2.7 GB/sec    1.00     91.1±1.10ms     2.8 GB/sec
string_dictionary/default                          1.02     50.3±0.86ms     5.1 GB/sec    1.00     49.6±1.41ms     5.2 GB/sec
string_dictionary/parquet_2                        1.55     85.9±0.30ms     3.0 GB/sec    1.00     55.4±0.18ms     4.7 GB/sec
string_dictionary/zstd                             1.21    254.8±1.16ms  1036.6 MB/sec    1.00    211.4±1.58ms  1249.4 MB/sec
string_dictionary/zstd_parquet_2                   1.17    233.6±0.69ms  1130.5 MB/sec    1.00    199.5±0.32ms  1324.0 MB/sec
string_non_null/bloom_filter                       1.05   266.9±16.35ms  1963.1 MB/sec    1.00   253.8±12.45ms     2.0 GB/sec
string_non_null/default                            1.04   144.9±19.34ms     3.5 GB/sec    1.00   138.6±12.16ms     3.7 GB/sec
string_non_null/parquet_2                          1.00    140.8±7.72ms     3.6 GB/sec    1.01    142.5±2.88ms     3.6 GB/sec
string_non_null/zstd                               1.00   574.2±11.63ms   912.5 MB/sec    1.00   575.6±20.73ms   910.3 MB/sec
string_non_null/zstd_parquet_2                     1.00    502.5±1.97ms  1042.8 MB/sec    1.03    519.5±5.83ms  1008.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1544.1s
Peak memory 4.6 GiB
Avg memory 4.4 GiB
CPU user 1467.7s
CPU sys 76.2s
Peak spill 0 B

branch

Metric Value
Wall time 1526.2s
Peak memory 4.6 GiB
Avg memory 4.4 GiB
CPU user 1425.0s
CPU sys 101.0s
Peak spill 0 B

File an issue against this benchmark runner

@HippoBaro
Copy link
Copy Markdown
Contributor Author

I am surprised by the few regressions above, such as:

string_dictionary/parquet_2                        1.55     85.9±0.30ms     3.0 GB/sec    1.00     55.4±0.18ms     4.7 GB/sec

I can't reproduce these locally. I get:

string_dictionary/parquet_2
                        time:   [53.024 ms 53.574 ms 54.565 ms]
                        thrpt:  [4.7271 GiB/s 4.8146 GiB/s 4.8646 GiB/s]
                 change:
                        time:   [−3.0644% −1.9407% −0.1309%] (p = 0.01 < 0.05)
                        thrpt:  [+0.1311% +1.9791% +3.1613%]
                        Change within noise threshold.

Are these known to be noisy?

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

I am surprised by the few regressions above, such as:


string_dictionary/parquet_2                        1.55     85.9±0.30ms     3.0 GB/sec    1.00     55.4±0.18ms     4.7 GB/sec

I can't reproduce these locally. I get:

Are these known to be noisy?

Yes. They are extremely twitchy. I always take them with a grain of salt or ten. 😅

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

I've now run multiple passes of the arrow_writer bench on my workstation and there appear to be no regressions due to this PR. And the speed ups are quite impressive 😄

Details
group                                              levels                                 main
-----                                              ------                                 ----
bool/bloom_filter                                  1.00     12.9±0.09ms    19.4 MB/sec    1.00     12.8±0.12ms    19.5 MB/sec
bool/default                                       1.00      8.5±0.06ms    29.3 MB/sec    1.00      8.5±0.09ms    29.3 MB/sec
bool/parquet_2                                     1.01     11.2±0.18ms    22.2 MB/sec    1.00     11.1±0.17ms    22.5 MB/sec
bool/zstd                                          1.00      9.0±0.10ms    27.8 MB/sec    1.00      9.0±0.10ms    27.9 MB/sec
bool/zstd_parquet_2                                1.01     11.5±0.08ms    21.7 MB/sec    1.00     11.4±0.10ms    21.9 MB/sec
bool_non_null/bloom_filter                         1.02      8.6±0.04ms    14.6 MB/sec    1.00      8.4±0.03ms    14.8 MB/sec
bool_non_null/default                              1.05      2.9±0.01ms    42.4 MB/sec    1.00      2.8±0.04ms    44.4 MB/sec
bool_non_null/parquet_2                            1.02      6.2±0.04ms    20.1 MB/sec    1.00      6.1±0.03ms    20.6 MB/sec
bool_non_null/zstd                                 1.05      3.3±0.04ms    38.2 MB/sec    1.00      3.1±0.06ms    40.1 MB/sec
bool_non_null/zstd_parquet_2                       1.02      6.5±0.06ms    19.1 MB/sec    1.00      6.4±0.04ms    19.5 MB/sec
float_with_nans/bloom_filter                       1.00     81.2±0.69ms   172.4 MB/sec    1.08     87.7±0.42ms   159.7 MB/sec
float_with_nans/default                            1.00     58.0±0.86ms   241.4 MB/sec    1.08     62.8±0.28ms   222.9 MB/sec
float_with_nans/parquet_2                          1.00     71.6±1.10ms   195.6 MB/sec    1.07     76.9±0.49ms   182.2 MB/sec
float_with_nans/zstd                               1.00     88.6±0.36ms   158.0 MB/sec    1.07     94.6±0.36ms   148.0 MB/sec
float_with_nans/zstd_parquet_2                     1.00    101.4±0.80ms   138.1 MB/sec    1.06    107.9±0.96ms   129.7 MB/sec
list_primitive/bloom_filter                        1.06    319.5±1.83ms  1707.2 MB/sec    1.00    302.6±2.73ms  1802.0 MB/sec
list_primitive/default                             1.07    260.7±1.76ms     2.0 GB/sec    1.00    242.8±1.50ms     2.2 GB/sec
list_primitive/parquet_2                           1.00    257.0±1.68ms     2.1 GB/sec    1.00    257.5±3.19ms     2.1 GB/sec
list_primitive/zstd                                1.01    390.4±2.65ms  1397.1 MB/sec    1.00    388.3±3.31ms  1404.6 MB/sec
list_primitive/zstd_parquet_2                      1.03    387.2±2.82ms  1408.4 MB/sec    1.00    374.4±4.46ms  1456.7 MB/sec
list_primitive_non_null/bloom_filter               1.00    354.2±6.61ms  1536.5 MB/sec    1.02    360.1±4.36ms  1511.5 MB/sec
list_primitive_non_null/default                    1.00    262.5±7.11ms     2.0 GB/sec    1.01    265.3±5.08ms     2.0 GB/sec
list_primitive_non_null/parquet_2                  1.00    264.3±4.69ms     2.0 GB/sec    1.07    283.5±7.82ms  1919.6 MB/sec
list_primitive_non_null/zstd                       1.01   527.5±10.36ms  1031.7 MB/sec    1.00   520.9±19.26ms  1044.7 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    510.5±7.07ms  1066.1 MB/sec    1.00   509.9±13.27ms  1067.4 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00      9.2±0.06ms     4.0 GB/sec    3.15     29.0±0.24ms  1288.8 MB/sec
list_primitive_sparse_99pct_null/default           1.00      8.7±0.08ms     4.2 GB/sec    3.30     28.6±0.64ms  1304.7 MB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00      8.7±0.07ms     4.2 GB/sec    3.28     28.5±0.40ms  1310.8 MB/sec
list_primitive_sparse_99pct_null/zstd              1.00     10.3±0.10ms     3.5 GB/sec    2.91     29.9±0.21ms  1248.5 MB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00      8.8±0.10ms     4.1 GB/sec    3.22     28.4±0.25ms  1315.2 MB/sec
primitive/bloom_filter                             1.00    128.9±0.80ms   348.2 MB/sec    1.02    132.0±1.05ms   339.9 MB/sec
primitive/default                                  1.00     84.9±1.59ms   528.8 MB/sec    1.02     86.7±0.67ms   517.5 MB/sec
primitive/parquet_2                                1.00     94.6±1.36ms   474.4 MB/sec    1.02     96.9±0.76ms   463.2 MB/sec
primitive/zstd                                     1.00    104.0±0.78ms   431.6 MB/sec    1.03    107.2±1.27ms   418.5 MB/sec
primitive/zstd_parquet_2                           1.00    117.0±1.62ms   383.4 MB/sec    1.03    120.0±0.74ms   373.9 MB/sec
primitive_all_null/bloom_filter                    1.00   1058.5±6.49µs    41.4 GB/sec    18.25    19.3±0.10ms     2.3 GB/sec
primitive_all_null/default                         1.00    198.3±1.38µs   221.0 GB/sec    92.92    18.4±0.06ms     2.4 GB/sec
primitive_all_null/parquet_2                       1.00    200.9±1.97µs   218.2 GB/sec    91.94    18.5±0.09ms     2.4 GB/sec
primitive_all_null/zstd                            1.00    341.9±1.60µs   128.2 GB/sec    54.27    18.6±0.07ms     2.4 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    317.2±1.37µs   138.2 GB/sec    58.48    18.5±0.08ms     2.4 GB/sec  
primitive_non_null/bloom_filter                    1.00     94.8±1.16ms   464.1 MB/sec    1.10    103.9±0.44ms   423.5 MB/sec
primitive_non_null/default                         1.00     38.5±0.22ms  1141.6 MB/sec    1.16     44.8±0.22ms   982.8 MB/sec
primitive_non_null/parquet_2                       1.00     52.7±0.51ms   834.4 MB/sec    1.13     59.4±1.01ms   740.3 MB/sec
primitive_non_null/zstd                            1.00     59.2±0.37ms   743.6 MB/sec    1.13     66.8±0.62ms   658.6 MB/sec
primitive_non_null/zstd_parquet_2                  1.00     76.0±0.98ms   579.1 MB/sec    1.11     84.1±1.49ms   523.2 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     12.9±0.27ms     3.4 GB/sec    2.23     28.8±0.70ms  1557.2 MB/sec
primitive_sparse_99pct_null/default                1.00     11.3±1.85ms     3.9 GB/sec    2.35     26.6±0.32ms  1686.3 MB/sec
primitive_sparse_99pct_null/parquet_2              1.00     11.6±1.71ms     3.8 GB/sec    2.30     26.8±0.28ms  1672.7 MB/sec
primitive_sparse_99pct_null/zstd                   1.00     13.8±0.14ms     3.2 GB/sec    2.13     29.4±0.29ms  1528.3 MB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     12.4±0.06ms     3.5 GB/sec    2.27     28.1±0.28ms  1595.2 MB/sec
string/bloom_filter                                1.00   169.3±11.30ms     3.0 GB/sec    1.05   178.1±13.47ms     2.9 GB/sec
string/default                                     1.05   121.8±12.92ms     4.2 GB/sec    1.00    116.3±3.32ms     4.4 GB/sec
string/parquet_2                                   1.03    120.8±6.66ms     4.2 GB/sec    1.00    117.6±1.10ms     4.4 GB/sec
string/zstd                                        1.00    308.2±4.22ms  1701.1 MB/sec    1.03   317.4±13.62ms  1651.6 MB/sec
string/zstd_parquet_2                              1.01    287.9±2.18ms  1821.2 MB/sec    1.00    284.1±1.61ms  1845.6 MB/sec
string_and_binary_view/bloom_filter                1.00     48.8±0.29ms   661.4 MB/sec    1.01     49.3±0.35ms   654.5 MB/sec
string_and_binary_view/default                     1.00     34.6±0.27ms   932.2 MB/sec    1.00     34.5±0.32ms   934.9 MB/sec
string_and_binary_view/parquet_2                   1.01     43.9±0.28ms   734.1 MB/sec    1.00     43.7±0.31ms   738.4 MB/sec
string_and_binary_view/zstd                        1.00     61.1±0.34ms   528.0 MB/sec    1.00     61.3±1.04ms   526.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     53.6±0.63ms   601.6 MB/sec    1.00     53.6±0.58ms   602.2 MB/sec
string_dictionary/bloom_filter                     1.00     76.4±0.63ms     3.4 GB/sec    1.42    108.4±0.44ms     2.4 GB/sec
string_dictionary/default                          1.00     51.7±0.24ms     5.0 GB/sec    1.58     81.8±0.34ms     3.2 GB/sec
string_dictionary/parquet_2                        1.00     55.5±0.66ms     4.6 GB/sec    1.50     83.5±0.55ms     3.1 GB/sec
string_dictionary/zstd                             1.00    150.0±1.17ms  1760.5 MB/sec    1.08    162.3±7.72ms  1627.8 MB/sec
string_dictionary/zstd_parquet_2                   1.00    142.7±0.88ms  1850.4 MB/sec    1.00    142.7±1.09ms  1850.5 MB/sec
string_non_null/bloom_filter                       1.00    191.4±1.91ms     2.7 GB/sec    1.09    208.4±8.39ms     2.5 GB/sec
string_non_null/default                            1.00    126.2±1.83ms     4.1 GB/sec    1.13    142.0±7.93ms     3.6 GB/sec
string_non_null/parquet_2                          1.00    137.1±2.30ms     3.7 GB/sec    1.00    137.7±1.85ms     3.7 GB/sec
string_non_null/zstd                               1.00    378.5±1.99ms  1384.4 MB/sec    1.06    400.3±7.49ms  1309.0 MB/sec
string_non_null/zstd_parquet_2                     1.00    359.4±2.26ms  1458.0 MB/sec    1.04    372.0±7.03ms  1408.5 MB/sec
struct_all_null/bloom_filter                       1.00    452.8±3.14µs    34.8 GB/sec    17.39     7.9±0.04ms  2047.7 MB/sec
struct_all_null/default                            1.00     85.5±0.63µs   184.1 GB/sec    87.80     7.5±0.04ms     2.1 GB/sec
struct_all_null/parquet_2                          1.00     86.5±1.38µs   182.0 GB/sec    86.71     7.5±0.03ms     2.1 GB/sec
struct_all_null/zstd                               1.00    146.8±1.12µs   107.3 GB/sec    51.77     7.6±0.09ms     2.1 GB/sec
struct_all_null/zstd_parquet_2                     1.00    136.4±1.14µs   115.4 GB/sec    55.50     7.6±0.06ms     2.1 GB/sec
struct_non_null/bloom_filter                       1.00     41.0±0.59ms   390.6 MB/sec    1.29     53.0±0.27ms   301.8 MB/sec
struct_non_null/default                            1.00     17.7±0.12ms   901.8 MB/sec    1.59     28.2±0.16ms   567.4 MB/sec
struct_non_null/parquet_2                          1.00     23.3±0.13ms   686.6 MB/sec    1.46     34.1±0.20ms   469.3 MB/sec
struct_non_null/zstd                               1.00     24.3±0.13ms   658.0 MB/sec    1.44     35.1±0.22ms   455.8 MB/sec
struct_non_null/zstd_parquet_2                     1.00     33.6±0.19ms   476.6 MB/sec    1.31     44.1±0.46ms   363.0 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      5.9±0.04ms     2.7 GB/sec    2.11     12.4±0.15ms  1303.5 MB/sec
struct_sparse_99pct_null/default                   1.00      5.0±0.04ms     3.2 GB/sec    2.32     11.6±0.11ms  1393.1 MB/sec
struct_sparse_99pct_null/parquet_2                 1.00      5.0±0.03ms     3.2 GB/sec    2.32     11.6±0.13ms  1393.7 MB/sec
struct_sparse_99pct_null/zstd                      1.00      6.2±0.04ms     2.6 GB/sec    2.07     12.8±0.19ms  1264.3 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      5.6±0.03ms     2.8 GB/sec    2.16     12.1±0.13ms  1330.1 MB/sec

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

@kszucs do you have time to look at this PR? It touches on your CDC code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: level encoding cost should be proportional to RLE output size

5 participants