Skip to content

Speed up packed varint decoding by validating the region up front#153

Open
kkaefer wants to merge 1 commit into
masterfrom
kk/optimize-varint-iterators
Open

Speed up packed varint decoding by validating the region up front#153
kkaefer wants to merge 1 commit into
masterfrom
kk/optimize-varint-iterators

Conversation

@kkaefer

@kkaefer kkaefer commented Jun 30, 2026

Copy link
Copy Markdown
Member

Decoding packed varint fields is now 30-150% faster, depending on varint length. The speedup comes from the new decode_varint_unchecked() / skip_varint_unchecked() helpers, which decode a varint without checking against an end pointer on every byte. A side effect is that our iterator size is now a single pointer, aligning it with the fixed iterators.

This is safe because a packed varint region is valid only if its last byte has the continuation bit clear, i.e. it ends exactly on a varint boundary. That guarantees a decode starting anywhere inside the region always terminates at or before the final byte, so no end pointer is needed. The packed varint iterators (const_varint_iterator, const_svarint_iterator) check this once in their constructor, and as a result no longer need to store the end pointer at all: they now hold a single pointer instead of two. That halves their size, but on its own it doesn't measurably change decoding speed (overlong varints are still detected via the byte-count limit).

Behavior change: a packed varint field whose trailing varint is truncated now throws end_of_buffer_exception when the iterator range is created (in get_packed_*()) instead of later during iteration. Arguably this doesn't matter because a truncated varint in a packed field is illegal anyway. The iterators' (data, end) constructor is also no longer noexcept.

Decoding packed varint fields is now 30-150% faster, depending on varint length. The speedup comes from the new `decode_varint_unchecked()` / `skip_varint_unchecked()` helpers, which decode a varint without checking against an end pointer on every byte. A side effect is that our iterator size is now a single pointer, aligning it with the fixed iterators.

This is safe because a packed varint region is valid only if its last byte has the continuation bit clear, i.e. it ends exactly on a varint boundary. That guarantees a decode starting anywhere inside the region always terminates at or before the final byte, so no end pointer is needed. The packed varint iterators (`const_varint_iterator`, `const_svarint_iterator`) check this once in their constructor, and as a result no longer need to store the end pointer at all: they now hold a single pointer instead of two. That halves their size, but on its own it doesn't measurably change decoding speed (overlong varints are still detected via the byte-count limit).

Behavior change: a packed varint field whose trailing varint is truncated now throws `end_of_buffer_exception` when the iterator range is created (in `get_packed_*()`) instead of later during iteration. Arguably this doesn't matter because a truncated varint in a packed field is illegal anyway. The iterators' `(data, end)` constructor is also no longer `noexcept`.
@kkaefer kkaefer requested a review from a team as a code owner June 30, 2026 20:06
@kkaefer kkaefer requested a review from joto June 30, 2026 20:07
@kkaefer

kkaefer commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Here are more benchmark results:

Magnitude Elements master (ns) This PR (ns) Speedup
bytes1 10 13 8 1.65x
bytes1 100 131 70 1.87x
bytes1 1000 1258 634 1.98x
bytes1 10000 12544 6280 2.00x
bytes1 100000 125293 62685 2.00x
bytes2 10 23 8 2.74x
bytes2 100 168 85 1.98x
bytes2 1000 1577 789 2.00x
bytes2 10000 15659 7827 2.00x
bytes2 100000 157517 78690 2.00x
bytes3 10 35 12 3.06x
bytes3 100 270 115 2.34x
bytes3 1000 2594 1101 2.36x
bytes3 10000 25869 10934 2.37x
bytes3 100000 257822 109451 2.36x
bytes5 10 31 18 1.69x
bytes5 100 298 178 1.67x
bytes5 1000 3139 1730 1.81x
bytes5 10000 28505 17245 1.65x
bytes5 100000 306460 172060 1.78x
bytes10 10 45 37 1.21x
bytes10 100 456 350 1.30x
bytes10 1000 4457 3751 1.19x
bytes10 10000 44513 34380 1.29x
bytes10 100000 446080 343361 1.30x
mixed 10 44 33 1.32x
mixed 100 452 290 1.56x
mixed 1000 4057 2937 1.38x
mixed 10000 83866 42600 1.97x
mixed 100000 1091575 740620 1.47x
mixed_byte_length 10 42 29 1.48x
mixed_byte_length 100 409 303 1.35x
mixed_byte_length 1000 4161 3086 1.35x
mixed_byte_length 10000 84034 48212 1.74x
mixed_byte_length 100000 1270616 741718 1.71x

Legend:

  • bytesX: Varints that have exactly this many bytes.
  • mixed: Pick a random bit width in [0, 63], then a value of that width.
  • mixed_byte_length: Pick a byte length uniformly in [1, 10], then a value of that length.

The speedup varies based on the characteristics of the packed varint buffer, but we can see a clear performance increase in all cases, even if we only have a small handful of varints in that buffer, and also if those varints are really short (1-2 bytes).

@kkaefer

kkaefer commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

I guess one side effect of this change is that if you accidentally call operator++ on an iterator that is at/beyond the end of the buffer it keeps incrementing, whereas previously, we would have thrown an execpetion. However, STL iterators do this as well.

// The decoders used by this iterator run without an explicit end
// pointer. This is only safe if the region ends exactly on a varint
// boundary, i.e. the last byte does not have its continuation bit set.
if (data != end && (static_cast<unsigned char>(*(end - 1)) & 0x80U) != 0) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do data < end here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data != end makes it clearer that this check is there to cover the case where the constructor is used to construct the "end iterator". With data < end it seems that we want to defend against the case where data > end which we don't.

But there is another change we could make here: Add a different constructor with a single const char* end parameter to construct the "end iterator". This could save one "if" check for the end iterator case (but we'd still need that check in this constructor to be backwards compatible).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a different constructor with a single const char* end parameter to construct the "end iterator"

I considered this as well, but decided to not do this to avoid accidental creation of an iterator without doing the last-byte check. With the current shape, it's pretty hard to misuse, and I don't think end iterator creation is a bottleneck.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@joto

joto commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

This looks pretty good. The only thing I am concerned about is that we don't have tests for the new decode_varint_unchecked() and skip_varint_unchecked functions. They are mostly like the original functions, but different enough that they should get tests in their own right. Maybe the tests can be rewritten so that they can be applied to both the original and the _unchecked() versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants