abseil
diff --git a/‎_posts/2023-03-02-fast-21.md‎
Lines changed: 1 addition & 1 deletion b/‎_posts/2023-03-02-fast-21.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_posts/2023-03-02-fast-39.md‎
Lines changed: 12 additions & 4 deletions b/‎_posts/2023-03-02-fast-39.md‎
Lines changed: 12 additions & 4 deletions
diff --git a/‎_posts/2023-03-02-fast-53.md‎
Lines changed: 1 addition & 1 deletion b/‎_posts/2023-03-02-fast-53.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_posts/2023-03-02-fast-9.md‎
Lines changed: 2 additions & 2 deletions b/‎_posts/2023-03-02-fast-9.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎_posts/2023-09-14-fast-7.md‎
Lines changed: 1 addition & 1 deletion b/‎_posts/2023-09-14-fast-7.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_posts/2023-09-30-fast-52.md‎
Lines changed: 10 additions & 9 deletions b/‎_posts/2023-09-30-fast-52.md‎
Lines changed: 10 additions & 9 deletions
diff --git a/‎_posts/2023-10-10-fast-64.md‎
Lines changed: 12 additions & 7 deletions b/‎_posts/2023-10-10-fast-64.md‎
Lines changed: 12 additions & 7 deletions
diff --git a/‎_posts/2023-10-15-fast-60.md‎
Lines changed: 6 additions & 2 deletions b/‎_posts/2023-10-15-fast-60.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎_posts/2023-10-20-fast-70.md‎
Lines changed: 2 additions & 2 deletions b/‎_posts/2023-10-20-fast-70.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎_posts/2023-11-10-fast-74.md‎
Lines changed: 20 additions & 2 deletions b/‎_posts/2023-11-10-fast-74.md‎
Lines changed: 20 additions & 2 deletions
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #21 on January 16, 2020
 
 *By [Paul Wankadia](mailto:junyer@google.com) and [Darryl Gove](mailto:djgove@google.com)*
 
-Updated 2023-03-02
+Updated 2024-10-21
 
 Quicklink: [abseil.io/fast/21](https://abseil.io/fast/21)
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021
 
 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)*
 
-Updated 2023-10-10
+Updated 2025-03-24
 
 Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39)
 
@@ -146,14 +146,14 @@ would "reduce" the
 [data center tax](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf),
 but we would actually hurt [application productivity](/fast/7)-per-CPU. Time we
 spend in malloc is
-[less important than application performance](https://research.google/pubs/pub50370.pdf).
+[less important than application performance](https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf).
 
 Trace-driven simulations with hardware-validated architectural simulators showed
 the prefetched data was frequently used. Additionally, it is better to stall on
 a TLB miss at the prefetch site--which has no dependencies, than to stall at the
 point of use.
 
-## Pitfalls
+## Pitfalls {#pitfalls}
 
 There are a number of things that commonly go wrong when writing benchmarks. The
 following is a non-exhaustive list:
@@ -175,15 +175,23 @@ following is a non-exhaustive list:
     [Stabilizer (by Berger, et. al.)](https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf)
     deliberately perturb these parameters to improve benchmarking statistical
     quality.
+*   Sensitivity to stack alignment. Changes anywhere in the stack--added/removed
+    variables, better (or worse) spilling due to compiler optimizations,
+    etc.--can affect the alignment at the start of the function-under-test. This
+    has been seen to produce 20% performance swings.
 *   Representative data. The data in the benchmark needs to be "similar" to the
     data in production - for example, imagine having short strings in the
     benchmark, and long strings in the fleet. This also extends to the code
     paths in the benchmarks being similar to the code paths that the application
-    exercises.
+    exercises. This is a common pain point for macrobenchmarks too. A loadtest
+    may cover certain request types, rather than all of those seen by production
+    servers.
+
 *   Benchmarking the right code. It's very easy to introduce code into the
     benchmark that's not present in the real workload. For example, using a
     random number generator's cost for a benchmark could exceed the cost of the
     work being benchmarked.
+
 *   Being aware of steady state vs dynamic behaviour. For more complex
     benchmarks it's easy to produce something that converges to a steady state -
     for example if it has a constant arrival rate and service time. Production
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021
 
 *By [Mircea Trofin](mailto:mtrofin@google.com)*
 
-Updated 2023-09-04
+Updated 2024-11-19
 
 Quicklink: [abseil.io/fast/53](https://abseil.io/fast/53)
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-10-10
+Updated 2025-03-27
 
 Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9)
 
@@ -64,7 +64,7 @@ Prior to cleanups, the implementations weren't the same.
     working around a
     [false dependency bug](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011)
     in some processors.
-*   When the compiler builtin is used (the "slow" version), we actually end up
+*   When the compiler built-in is used (the "slow" version), we actually end up
     with a better sequence of machine code and can perform stronger
     optimizations at compile-time around constant folding.
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-10-31
+Updated 2025-03-25
 
 Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #52 on September 30, 2021
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-09-30
+Updated 2025-03-24
 
 Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52)
 
@@ -130,13 +130,14 @@ test, and successfully land new features in production. Beyond just optimizing
 Extra complexity that delays an improvement to product experiences is a
 non-obvious externality.
 
-For example, TCMalloc has a number of tuning options and customization points,
-but ultimately, several optimizations came from sanding away extra configuration
-complexity. The rarely used malloc hooks API required careful structuring of
-TCMalloc's fast path to allow users who didn't use hooks--most users--to not pay
-for their possible presence. In another case, removing the `sbrk` allocator
-allowed TCMalloc to structure its virtual address space carefully, enabling
-several enhancements.
+For example, TCMalloc has a number of
+[tuning options](https://github.com/google/tcmalloc/blob/master/docs/tuning.md)
+and customization points, but ultimately, several optimizations came from
+sanding away extra configuration complexity. The rarely used malloc hooks API
+required careful structuring of TCMalloc's fast path to allow users who didn't
+use hooks--most users--to not pay for their possible presence. In another case,
+removing the `sbrk` allocator allowed TCMalloc to structure its virtual address
+space carefully, enabling several enhancements.
 
 ## Beyond knobs
 
@@ -147,7 +148,7 @@ An existing library, *X*, might be inadequate or insufficiently expressive,
 which can motivate building a "better" alternative, *Y*, along some dimensions.
 Realizing the benefit of using *Y* is dependent on users both discovering *Y*
 and picking between *X* and *Y* *correctly*--and in the case of a long-lived
-code base, keeping that choice optimal over time.
+codebase, keeping that choice optimal over time.
 
 For some uses, this strategy is infeasible. `my::super_fast_string` will
 probably never replace `std::string` because the latter is so entrenched and the
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #64 on October 21, 2022
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-10-10
+Updated 2025-03-24
 
 Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64)
 
@@ -42,10 +42,10 @@ preconditions to unlock the best possible performance. We need to document and
 test these sharp edges. Future debugging has an opportunity cost: When we spend
 time tracking down and fixing bugs, we are not developing new optimizations. We
 can use assertions for preconditions, especially in debug/sanitizer builds, to
-double-check contracts and *enforce* them. Testing robots never sleep, while
-humans are fallible. Randomized implementation behaviors provide a useful
-bulwark against Hyrum's Law from creeping in to implicitly expand the contract
-of an interface.
+double-check contracts and *enforce* them. Testing
+[robots never sleep](/fast/93), while humans are fallible. Randomized
+implementation behaviors provide a useful bulwark against Hyrum's Law from
+creeping in to implicitly expand the contract of an interface.
 
 ## Express intents
 
@@ -128,7 +128,7 @@ There are situations where the benefits of duplicate APIs outweight the costs.
 
 The Abseil hash containers
 ([SwissMap](https://abseil.io/about/design/swisstables)) added new hashtable
-implementations to the code base, which at first glance, appear redundant with
+implementations to the codebase, which at first glance, appear redundant with
 the ones in the C++ standard library. This apparent duplication allowed us to
 have a more efficient set of containers which match the standard library API,
 but adhere to a weaker set of constraints.
@@ -140,6 +140,11 @@ to a node-based implementation that requires data indirections and constrains
 performance. Given `std::unordered_map`'s widespread usage, it was not feasible
 to relax these guarantees all at once.
 
+Node-based containers necessitate implementation overheads, but they come with a
+direct benefit: They actively facilitate migration while allowing weaker
+containers to be available. Making a guarantee stronger without an accompanying
+benefit is undesirable.
+
 The migration was a replacement path for the legacy containers, not an
 alternative. The superior performance characteristics meant that users could
 "just use SwissMap" without tedious benchmarking on a case-by-case basis.
@@ -218,7 +223,7 @@ constrain future implementations by creating sharp performance edges.
     higher-level intent--transferring pointer ownership, "lending" a submessage
     to another one, etc.
 
-### Concluding remarks
+## Concluding remarks
 
 Good performance should be available by default, not an optional feature. While
 [feature flags and knobs can be useful for testing and initial rollout](/fast/52),
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #60 on June 6, 2022
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-10-15
+Updated 2025-03-24
 
 Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60)
 
@@ -73,7 +73,11 @@ provided. A key driver for hashtable-specific profiling is that the CPU profiles
 of a hashtable with a
 [bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864)
 with a good hash function. The added information collected for stuck bits helps
-us drive optimization decisions we wouldn't have been able to make.
+us drive optimization decisions we wouldn't have been able to make. The capacity
+information collected during hashtable-profiling is incidental to the profiler's
+richer, hashtable-specific details, but wouldn't be a particularly compelling
+reason to collect it on its own given the redundant information available from
+ordinary heap profiles.
 
 ## Sampling strategies
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #70 on June 26, 2023
 
 *By [Chris Kennelly](mailto:ckennelly@google.com)*
 
-Updated 2023-10-20
+Updated 2025-03-25
 
 Quicklink: [abseil.io/fast/70](https://abseil.io/fast/70)
 
@@ -120,7 +120,7 @@ and improvement in microbenchmark times help validate that the optimization is
 working according to our mental model of the code being optimized. We avoid
 false positives by doing so: Changing the
 [font color of a webpage to green](https://xkcd.com/882/) and running a loadtest
-*might* give a positive result
+*might* give a [positive result](/fast/88)
 [purely by chance](https://en.wikipedia.org/wiki/Bonferroni_correction), not due
 to a causal effect.
 
 
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #74 on September 29, 2023
 
 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Matt Kulukundis](mailto:kfm@google.com)*
 
-Updated 2023-11-10
+Updated 2025-03-25
 
 Quicklink: [abseil.io/fast/74](https://abseil.io/fast/74)
 
@@ -98,6 +98,10 @@ one process to another invalidates caches and
 manifest as misses and stalls for ordinary user code, completely disconnected
 from the context switch itself.
 
+We see a similar effect from changing scheduling parameters. Preferring to keep
+threads on the same core will improve cache locality, even though it may
+increase apparent kernel scheduler latency.
+
 ### Sweeping away protocol buffers
 
 Consider an extreme example. When our hashtable profiler for Abseil's hashtables
@@ -161,7 +165,7 @@ long-lived `Cord`s.
 Similarly, consider when we embed type `A` as a data member in type `B`.
 Changing `sizeof(A)` indirectly changes `sizeof(B)` and the memory we allocate
 when we type `new B()` or `std::vector<B>`. Small types and memory paddings are
-peanut-buttered across the code base, but in aggregate can consume large amounts
+peanut-buttered across the codebase, but in aggregate can consume large amounts
 of memory for commonly used types.
 
 ### Improving data placement
@@ -198,6 +202,20 @@ calls. Even though `rep movsb` can be outperformed by hand-optimized
 implementations in microbenchmarks, this strategy can reduce code cache pressure
 and external overheads.
 
+### Making effective prefetches
+
+During the A/B experiment evaluation of adaptive prefetching, we
+[observed](https://research.google/pubs/limoncello-prefetchers-for-scale/) that
+while topline, [application-performance broadly improved](/fast/7), individual
+functions sometimes regressed. If a function generally benefited from the HW
+prefetcher, it often regressed. If a function were antagonized by memory
+bandwidth saturation, it could improve.
+
+This data identified an opportunity to add software prefetches to functions with
+simple access patterns. This allowed us to optimize for the best of both worlds:
+keeping effective prefetches intact, while ablating often ineffective hardware
+prefetchers under saturation.
+
 ## Closing thoughts
 
 When improving anything (performance, quality, even reliability) do not mistake