Skip to content

Commit a857204

Browse files
authored
Merge pull request #510 from aalexand/fast-87-etc
abseil.io/fast: Publish episodes 87, 88, 90, 93, plus a few updates.
2 parents 473053b + 9f83742 commit a857204

19 files changed

+929
-43
lines changed

_posts/2023-03-02-fast-21.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #21 on January 16, 2020
1212

1313
*By [Paul Wankadia](mailto:junyer@google.com) and [Darryl Gove](mailto:djgove@google.com)*
1414

15-
Updated 2023-03-02
15+
Updated 2024-10-21
1616

1717
Quicklink: [abseil.io/fast/21](https://abseil.io/fast/21)
1818

_posts/2023-03-02-fast-39.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)*
1414

15-
Updated 2023-10-10
15+
Updated 2025-03-24
1616

1717
Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39)
1818

@@ -146,14 +146,14 @@ would "reduce" the
146146
[data center tax](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf),
147147
but we would actually hurt [application productivity](/fast/7)-per-CPU. Time we
148148
spend in malloc is
149-
[less important than application performance](https://research.google/pubs/pub50370.pdf).
149+
[less important than application performance](https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf).
150150

151151
Trace-driven simulations with hardware-validated architectural simulators showed
152152
the prefetched data was frequently used. Additionally, it is better to stall on
153153
a TLB miss at the prefetch site--which has no dependencies, than to stall at the
154154
point of use.
155155

156-
## Pitfalls
156+
## Pitfalls {#pitfalls}
157157

158158
There are a number of things that commonly go wrong when writing benchmarks. The
159159
following is a non-exhaustive list:
@@ -175,15 +175,23 @@ following is a non-exhaustive list:
175175
[Stabilizer (by Berger, et. al.)](https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf)
176176
deliberately perturb these parameters to improve benchmarking statistical
177177
quality.
178+
* Sensitivity to stack alignment. Changes anywhere in the stack--added/removed
179+
variables, better (or worse) spilling due to compiler optimizations,
180+
etc.--can affect the alignment at the start of the function-under-test. This
181+
has been seen to produce 20% performance swings.
178182
* Representative data. The data in the benchmark needs to be "similar" to the
179183
data in production - for example, imagine having short strings in the
180184
benchmark, and long strings in the fleet. This also extends to the code
181185
paths in the benchmarks being similar to the code paths that the application
182-
exercises.
186+
exercises. This is a common pain point for macrobenchmarks too. A loadtest
187+
may cover certain request types, rather than all of those seen by production
188+
servers.
189+
183190
* Benchmarking the right code. It's very easy to introduce code into the
184191
benchmark that's not present in the real workload. For example, using a
185192
random number generator's cost for a benchmark could exceed the cost of the
186193
work being benchmarked.
194+
187195
* Being aware of steady state vs dynamic behaviour. For more complex
188196
benchmarks it's easy to produce something that converges to a steady state -
189197
for example if it has a constant arrival rate and service time. Production

_posts/2023-03-02-fast-53.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021
1212

1313
*By [Mircea Trofin](mailto:mtrofin@google.com)*
1414

15-
Updated 2023-09-04
15+
Updated 2024-11-19
1616

1717
Quicklink: [abseil.io/fast/53](https://abseil.io/fast/53)
1818

_posts/2023-03-02-fast-9.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-10-10
15+
Updated 2025-03-27
1616

1717
Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9)
1818

@@ -64,7 +64,7 @@ Prior to cleanups, the implementations weren't the same.
6464
working around a
6565
[false dependency bug](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011)
6666
in some processors.
67-
* When the compiler builtin is used (the "slow" version), we actually end up
67+
* When the compiler built-in is used (the "slow" version), we actually end up
6868
with a better sequence of machine code and can perform stronger
6969
optimizations at compile-time around constant folding.
7070

_posts/2023-09-14-fast-7.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-10-31
15+
Updated 2025-03-25
1616

1717
Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
1818

_posts/2023-09-30-fast-52.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #52 on September 30, 2021
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-09-30
15+
Updated 2025-03-24
1616

1717
Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52)
1818

@@ -130,13 +130,14 @@ test, and successfully land new features in production. Beyond just optimizing
130130
Extra complexity that delays an improvement to product experiences is a
131131
non-obvious externality.
132132

133-
For example, TCMalloc has a number of tuning options and customization points,
134-
but ultimately, several optimizations came from sanding away extra configuration
135-
complexity. The rarely used malloc hooks API required careful structuring of
136-
TCMalloc's fast path to allow users who didn't use hooks--most users--to not pay
137-
for their possible presence. In another case, removing the `sbrk` allocator
138-
allowed TCMalloc to structure its virtual address space carefully, enabling
139-
several enhancements.
133+
For example, TCMalloc has a number of
134+
[tuning options](https://github.com/google/tcmalloc/blob/master/docs/tuning.md)
135+
and customization points, but ultimately, several optimizations came from
136+
sanding away extra configuration complexity. The rarely used malloc hooks API
137+
required careful structuring of TCMalloc's fast path to allow users who didn't
138+
use hooks--most users--to not pay for their possible presence. In another case,
139+
removing the `sbrk` allocator allowed TCMalloc to structure its virtual address
140+
space carefully, enabling several enhancements.
140141

141142
## Beyond knobs
142143

@@ -147,7 +148,7 @@ An existing library, *X*, might be inadequate or insufficiently expressive,
147148
which can motivate building a "better" alternative, *Y*, along some dimensions.
148149
Realizing the benefit of using *Y* is dependent on users both discovering *Y*
149150
and picking between *X* and *Y* *correctly*--and in the case of a long-lived
150-
code base, keeping that choice optimal over time.
151+
codebase, keeping that choice optimal over time.
151152

152153
For some uses, this strategy is infeasible. `my::super_fast_string` will
153154
probably never replace `std::string` because the latter is so entrenched and the

_posts/2023-10-10-fast-64.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #64 on October 21, 2022
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-10-10
15+
Updated 2025-03-24
1616

1717
Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64)
1818

@@ -42,10 +42,10 @@ preconditions to unlock the best possible performance. We need to document and
4242
test these sharp edges. Future debugging has an opportunity cost: When we spend
4343
time tracking down and fixing bugs, we are not developing new optimizations. We
4444
can use assertions for preconditions, especially in debug/sanitizer builds, to
45-
double-check contracts and *enforce* them. Testing robots never sleep, while
46-
humans are fallible. Randomized implementation behaviors provide a useful
47-
bulwark against Hyrum's Law from creeping in to implicitly expand the contract
48-
of an interface.
45+
double-check contracts and *enforce* them. Testing
46+
[robots never sleep](/fast/93), while humans are fallible. Randomized
47+
implementation behaviors provide a useful bulwark against Hyrum's Law from
48+
creeping in to implicitly expand the contract of an interface.
4949

5050
## Express intents
5151

@@ -128,7 +128,7 @@ There are situations where the benefits of duplicate APIs outweight the costs.
128128

129129
The Abseil hash containers
130130
([SwissMap](https://abseil.io/about/design/swisstables)) added new hashtable
131-
implementations to the code base, which at first glance, appear redundant with
131+
implementations to the codebase, which at first glance, appear redundant with
132132
the ones in the C++ standard library. This apparent duplication allowed us to
133133
have a more efficient set of containers which match the standard library API,
134134
but adhere to a weaker set of constraints.
@@ -140,6 +140,11 @@ to a node-based implementation that requires data indirections and constrains
140140
performance. Given `std::unordered_map`'s widespread usage, it was not feasible
141141
to relax these guarantees all at once.
142142

143+
Node-based containers necessitate implementation overheads, but they come with a
144+
direct benefit: They actively facilitate migration while allowing weaker
145+
containers to be available. Making a guarantee stronger without an accompanying
146+
benefit is undesirable.
147+
143148
The migration was a replacement path for the legacy containers, not an
144149
alternative. The superior performance characteristics meant that users could
145150
"just use SwissMap" without tedious benchmarking on a case-by-case basis.
@@ -218,7 +223,7 @@ constrain future implementations by creating sharp performance edges.
218223
higher-level intent--transferring pointer ownership, "lending" a submessage
219224
to another one, etc.
220225

221-
### Concluding remarks
226+
## Concluding remarks
222227

223228
Good performance should be available by default, not an optional feature. While
224229
[feature flags and knobs can be useful for testing and initial rollout](/fast/52),

_posts/2023-10-15-fast-60.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #60 on June 6, 2022
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-10-15
15+
Updated 2025-03-24
1616

1717
Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60)
1818

@@ -73,7 +73,11 @@ provided. A key driver for hashtable-specific profiling is that the CPU profiles
7373
of a hashtable with a
7474
[bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864)
7575
with a good hash function. The added information collected for stuck bits helps
76-
us drive optimization decisions we wouldn't have been able to make.
76+
us drive optimization decisions we wouldn't have been able to make. The capacity
77+
information collected during hashtable-profiling is incidental to the profiler's
78+
richer, hashtable-specific details, but wouldn't be a particularly compelling
79+
reason to collect it on its own given the redundant information available from
80+
ordinary heap profiles.
7781

7882
## Sampling strategies
7983

_posts/2023-10-20-fast-70.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #70 on June 26, 2023
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2023-10-20
15+
Updated 2025-03-25
1616

1717
Quicklink: [abseil.io/fast/70](https://abseil.io/fast/70)
1818

@@ -120,7 +120,7 @@ and improvement in microbenchmark times help validate that the optimization is
120120
working according to our mental model of the code being optimized. We avoid
121121
false positives by doing so: Changing the
122122
[font color of a webpage to green](https://xkcd.com/882/) and running a loadtest
123-
*might* give a positive result
123+
*might* give a [positive result](/fast/88)
124124
[purely by chance](https://en.wikipedia.org/wiki/Bonferroni_correction), not due
125125
to a causal effect.
126126

_posts/2023-11-10-fast-74.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #74 on September 29, 2023
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com) and [Matt Kulukundis](mailto:kfm@google.com)*
1414

15-
Updated 2023-11-10
15+
Updated 2025-03-25
1616

1717
Quicklink: [abseil.io/fast/74](https://abseil.io/fast/74)
1818

@@ -98,6 +98,10 @@ one process to another invalidates caches and
9898
manifest as misses and stalls for ordinary user code, completely disconnected
9999
from the context switch itself.
100100

101+
We see a similar effect from changing scheduling parameters. Preferring to keep
102+
threads on the same core will improve cache locality, even though it may
103+
increase apparent kernel scheduler latency.
104+
101105
### Sweeping away protocol buffers
102106

103107
Consider an extreme example. When our hashtable profiler for Abseil's hashtables
@@ -161,7 +165,7 @@ long-lived `Cord`s.
161165
Similarly, consider when we embed type `A` as a data member in type `B`.
162166
Changing `sizeof(A)` indirectly changes `sizeof(B)` and the memory we allocate
163167
when we type `new B()` or `std::vector<B>`. Small types and memory paddings are
164-
peanut-buttered across the code base, but in aggregate can consume large amounts
168+
peanut-buttered across the codebase, but in aggregate can consume large amounts
165169
of memory for commonly used types.
166170

167171
### Improving data placement
@@ -198,6 +202,20 @@ calls. Even though `rep movsb` can be outperformed by hand-optimized
198202
implementations in microbenchmarks, this strategy can reduce code cache pressure
199203
and external overheads.
200204

205+
### Making effective prefetches
206+
207+
During the A/B experiment evaluation of adaptive prefetching, we
208+
[observed](https://research.google/pubs/limoncello-prefetchers-for-scale/) that
209+
while topline, [application-performance broadly improved](/fast/7), individual
210+
functions sometimes regressed. If a function generally benefited from the HW
211+
prefetcher, it often regressed. If a function were antagonized by memory
212+
bandwidth saturation, it could improve.
213+
214+
This data identified an opportunity to add software prefetches to functions with
215+
simple access patterns. This allowed us to optimize for the best of both worlds:
216+
keeping effective prefetches intact, while ablating often ineffective hardware
217+
prefetchers under saturation.
218+
201219
## Closing thoughts
202220

203221
When improving anything (performance, quality, even reliability) do not mistake

0 commit comments

Comments
 (0)