VPA: add performance benchmarking by maxcao13 · Pull Request #9192 · kubernetes/autoscaler

maxcao13 · 2026-02-06T00:35:19Z

What type of PR is this?

/kind cleanup
/kind documentation
/area vertical-pod-autoscaler

What this PR does / why we need it:

With many changes coming to the VPA project, it would be beneficial to know if any can cause performance regression, or if a change will actually improve performance. For example, something a large change like adding InPlace update mode, or adding a performance enhancement change such as the work done in #8807. This PR proposes a new benchmarking tool to the VPA which can measure the latency of VPA components under performance load in a real cluster (Kind), and not a simulated cluster. We can then use the results on a pull request to compare with the main branch.

Ideally, this would run as a github action or a prow job in CI, for reproducible and consistent environments to run on each change to the VPA which we could then compare to main/master easily. But for now, this is simply just a tool to be run locally. That would ideally come in a follow up PR.

Details about the benchmark itself, internals, quick-start, etcetera, are in the README.md in the vertical-pod-autoscaler/benchmark directory.

Which issue(s) this PR fixes:

Partially fixes #8866

Special notes for your reviewer:

This PR was done in tandem with Cursor Agent mode using the Cursor IDE.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

maxcao13 · 2026-02-06T00:38:49Z

Here's how I tested this locally:

# in vertical-pod-autoscaler/
$ KIND_VERSION="kindest/node:v1.35.0@sha256:452d707d4862f52530247495d180205e029056831160e22870e37e3f6c1ac31f"
$ kind delete cluster --name kind
$ kind create cluster --image=${KIND_VERSION} --config=benchmark/kind-config.yaml
$ ./hack/deploy-for-e2e-locally.sh full-vpa
$ go build -C benchmark -o ../bin/vpa-benchmark .
$ bin/vpa-benchmark --profile=xxlarge
... results

It is possible that my local setup is more or less powerful than any one else trying this, and that might fudge the timings of the sleeps and such, so I'd love some feedback on how to make this better if it doesn't work for everyone. And yes, loops can actually take longer than the allotted updater-interval or recommender-interval times if there's too many things going on in the cluster like evictions. I had to manually increase the intervals so that wouldn't happen, but I'm assuming it's possible on someone's setup the loop will take too long and not finish, or the loop will be very fast, and didn't need all that time. Since docker shares local machine resources and all that.

FWIW: I did try this out on Omer's PR, and I'm not sure if there was a notable difference. The bulk of the time is spent on EvictPods, most likely because those eviction calls are actual API call to the apiserver every time, instead of getting pods from the informer cache.

# before
========== Results ==========
┌───────────────┬───────────────────┐
│     STEP      │ XXLARGE  ( 1000 ) │
├───────────────┼───────────────────┤
│ AdmissionInit │ 0.0004s           │
│ EvictPods     │ 98.6963s          │
│ FilterPods    │ 0.0925s           │
│ ListPods      │ 0.0025s           │
│ ListVPAs      │ 0.0027s           │
│ total         │ 98.7945s          │
└───────────────┴───────────────────┘

# after
========== Results ==========
┌───────────────┬───────────────────┐
│     STEP      │ XXLARGE  ( 1000 ) │
├───────────────┼───────────────────┤
│ AdmissionInit │ 0.0003s           │
│ EvictPods     │ 99.1713s          │
│ FilterPods    │ 0.0922s           │
│ ListPods      │ 0.0020s           │
│ ListVPAs      │ 0.0026s           │
│ total         │ 99.2683s          │
└───────────────┴───────────────────┘

maxcao13 · 2026-02-06T00:48:53Z

/cc @omerap12 @adrianmoisey

omerap12

I’m planning to go through the code in depth, but overall I like it. Especially the use of KWOK.
There’s one bug you can reproduce by running:

$ ./bin/vpa-benchmark 
=== VPA Benchmark with KWOK ===
Profiles: [small], Runs per profile: 1

Installing KWOK...
  KWOK already installed
Creating KWOK fake node...
  KWOK node already exists
Configuring VPA deployments...
  Configured VPA deployments with increased QPS/burst and updater interval

========== Profile: small (25 VPAs) ==========
Scaling down VPA components...
  Waiting for 6 VPA pods to terminate...
  Waiting for 2 VPA pods to terminate...
  VPA components scaled down
Deleting all VPA checkpoints...
  Deleted 0 VPA checkpoints
Cleaning up existing benchmark resources...
Creating 25 ReplicaSets (2 pods each, 50 total)...
Creating 25 VPAs...
Waiting for 50 KWOK pods to be running...
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
  Pods: 0/50
...
W0206 13:21:42.996430   69510 main.go:174] Run 1 failed: timeout waiting for pods: context deadline exceeded
No successful runs for profile small!
Final cleanup...

Benchmark completed successfully.

I guess this is because I didn't enter a profile: https://github.com/kubernetes/autoscaler/pull/9192/files#diff-95216a5af7dfd121672dd3633edd84664820b5e5390eff6902f95c2bb52ede51R69

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

omerap12 · 2026-02-06T13:37:06Z

vertical-pod-autoscaler/benchmark/main.go

+	stagesURL := fmt.Sprintf("https://github.com/kubernetes-sigs/kwok/releases/download/%s/stage-fast.yaml", kwokVersion)
+	cmd = exec.CommandContext(ctx, "kubectl", "apply", "-f", stagesURL)
+	if output, err := cmd.CombinedOutput(); err != nil {
+		return fmt.Errorf("kubectl apply stage-fast.yaml failed: %v\n%s", err, output)


I’m debating whether we want the Go code to handle the infrastructure setup as well.
We’ll probably build a bash script on top of it, so I’m wondering if the script should install all required tools (e.g., KWOK, yq, etc.), while the Go code focuses solely on benchmarking.

Yeah, I was debating this as well.

I think having a bash script outside makes sense to keep the binary focused on benchmarking. I was mostly thinking about keeping all the code contained in one program, but that makes more sense to me. Thanks for the feedback 👍

maxcao13 · 2026-02-06T17:42:41Z

I'll look into the bug, thanks!

EDIT: Hmm, I cannot recreate the bug, there might be something going in your environment. For example, the kwok-controller not changing those pods to be ready?

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

Hm... I'm wondering how we could benefit from controlling the resource usage, and what other situations we could performance test using this CRD?

I guess it would help to be explicit in what resource usage we want the recommendations to generate from, so that we can always trigger updates/evictions from the updater. I will take a look to see if I can add ClusterResourceUsage for simulation that is outside of the initial resource requests of each pods. For now, I think the recommender is just hitting the minCPU and minMemory and recommending those which will trigger evictions anyways.

iamzili

Good work Max! The PR seems reasonable to me, I have just left a few comments.

Furthermore, based on the code, the main goal of this binary is to check the duration of each phase in the updater within RunOnce. Are there any plans to do the same for the recommender, or even for the admission controller?

iamzili · 2026-02-07T11:43:17Z

vertical-pod-autoscaler/benchmark/main.go

+
+// configureVPADeployments modifies VPA deployments with increased QPS/burst and updater interval.
+// Uses kubectl get/yq/kubectl apply.
+func configureVPADeployments(ctx context.Context) error {


just a side note related to this function: I think it would be great to make the ./hack/deploy-for-e2e-locally.sh script more versatile so that it can deploy the VPA components with custom flags, which would make this function obsolete. I proposed such an improvement in the PR whose goal is to incorporate the helm chart into ./hack/deploy-for-e2e-locally.sh:

#8990 (comment)

Ah, wasn't aware of that PR. I'll keep it in mind, thanks! I want to get rid of the setup in the go code anyways, so I'll be waiting for that to resolve.

vertical-pod-autoscaler/benchmark/main.go

iamzili · 2026-02-07T16:33:54Z

vertical-pod-autoscaler/benchmark/main.go

+	// Step 10: Wait for updater's first loop
+	// The updater uses time.Tick which waits the full interval before the first tick
+	// We set --updater-interval=2m, so wait 2 minutes for the first loop to start
+	fmt.Println("Waiting 2 minutes for updater's first loop...")
+	time.Sleep(2 * time.Minute)


let me ask, what is the reason for this sleep? I'm asking because we can completely rely on the mechanism implemented in the waitForAndScrapeMetrics function (i.e. polling until the vpa_updater_execution_latency_seconds_sum{step="total"} metric is present).

The reason is because we know for sure that the first tick of the updater and recommender happens at the interval time, not at t=0, so I'm just sleeping as to not waste any execution on polling. I could in theory just add more time to the timeout of waitForAndScrapeMetrics, but again I was thinking that it's guaranteed we will have to wait that long anyways.

omerap12 · 2026-02-08T07:45:49Z

I'll look into the bug, thanks!

EDIT: Hmm, I cannot recreate the bug, there might be something going in your environment. For example, the kwok-controller not changing those pods to be ready?

Moreover, should we also use this https://kwok.sigs.k8s.io/docs/user/resource-usage-configuration/ ?

Hm... I'm wondering how we could benefit from controlling the resource usage, and what other situations we could performance test using this CRD?

I guess it would help to be explicit in what resource usage we want the recommendations to generate from, so that we can always trigger updates/evictions from the updater. I will take a look to see if I can add ClusterResourceUsage for simulation that is outside of the initial resource requests of each pods. For now, I think the recommender is just hitting the minCPU and minMemory and recommending those which will trigger evictions anyways.

Yeah, It’s just something I came across. I’m not sure whether we should use it yet, but it might be worth considering as the benchmark evolves.

maxcao13 · 2026-02-13T20:40:36Z

Are there any plans to do the same for the recommender

Yes! Hopefully if this is approved, then a followup to add scraping the recommender and displaying metrics would come.

Yeah, It’s just something I came across. I’m not sure whether we should use it yet, but it might be worth considering as the benchmark evolves.

Sounds good, I don't think we need it yet. AFAICT it can just simulate metrics, which would be good for benchmarking the recommender, but not for the updater.

maxcao13 · 2026-02-13T20:42:01Z

I've added a commit in here to add the benchmark as a github action, and it's running right now. I can separate it to a new PR if we like and all this is approved, but I put it in here to show it off, but also to see if the github infra will agree with what I've done :-)

EDIT: https://github.com/kubernetes/autoscaler/actions/runs/22002078972/job/63576978582?pr=9192
See the end of the Build and run benchmark step.

omerap12

I really really really like it.
I have some comments but I think this is good to go.
@adrianmoisey , I know you found it super useful as well so let's push that forward.

omerap12 · 2026-02-17T14:23:13Z

vertical-pod-autoscaler/benchmark/hack/configure-vpa.sh

+echo "  Configuring vpa-recommender (QPS=100, burst=200, memory-saver=true)..."
+kubectl get deployment vpa-recommender -n "${VPA_NAMESPACE}" -o yaml | \
+  yq '.spec.template.spec.containers[0].args = ["--kube-api-qps=100", "--kube-api-burst=200", "--memory-saver=true"]' | \
+  kubectl apply -f -
+
+echo "  Configuring vpa-updater (QPS=100, burst=200, updater-interval=2m)..."
+kubectl get deployment vpa-updater -n "${VPA_NAMESPACE}" -o yaml | \
+  yq '.spec.template.spec.containers[0].args = ["--kube-api-qps=100", "--kube-api-burst=200", "--updater-interval=2m"]' | \
+  kubectl apply -f -
+
+echo "  Configuring vpa-admission-controller (QPS=100, burst=200)..."
+kubectl get deployment vpa-admission-controller -n "${VPA_NAMESPACE}" -o yaml | \
+  yq '.spec.template.spec.containers[0].args = ["--kube-api-qps=100", "--kube-api-burst=200"]' | \
+  kubectl apply -f -
+
+echo "=== VPA configuration complete ==="


Can we use our helm chart instead?

Wondering if I should wait on #8990 to resolve, so I can re-use that PR's deployment implementation, and then I can override with a custom set of helm values?

We don't know when #8990 will be merged so I say no.
/cc @adrianmoisey

My opinion is that this code isn't production code. It's easy to change and is risk free. Let's merge it as is and fix it later.
We know that we plan on using the helm chart in various places, and we know it's not quite ready yet.

omerap12 · 2026-02-17T14:24:19Z

vertical-pod-autoscaler/benchmark/hack/full-benchmark.sh

+kubeadmConfigPatches:
+- |
+  kind: ClusterConfiguration
+  apiServer:
+    extraArgs:
+      max-requests-inflight: "2000"
+      max-mutating-requests-inflight: "1000"
+  controllerManager:
+    extraArgs:
+      concurrent-replicaset-syncs: "500"
+      kube-api-qps: "500"
+      kube-api-burst: "1000"


We already have a kind cluster configuration at: https://github.com/kubernetes/autoscaler/blob/master/.github/kind-config.yaml
Can use that instead?

I'm actually re-using that one, by copying it over to a temp file and then appending to it. I don't necessarily think these kubeadmconfigpatches should be default though.

Yeah it shouldn't. I missed this https://github.com/kubernetes/autoscaler/pull/9192/changes#diff-9e7c5b7ba2b00cc92ad1246ffba0554719763c1eeba00cef00449ba28519e643R44 (sorry).

omerap12 · 2026-02-17T14:25:44Z

vertical-pod-autoscaler/benchmark/hack/install-kwok.sh

+if kubectl get deployment kwok-controller -n "${KWOK_NAMESPACE}" &>/dev/null; then
+  echo "  KWOK already installed, skipping"
+else
+  echo "  Applying KWOK manifests..."
+  kubectl apply -f "https://github.com/kubernetes-sigs/kwok/releases/download/${KWOK_VERSION}/kwok.yaml"
+
+  echo "  Applying KWOK stage-fast manifests..."
+  kubectl apply -f "https://github.com/kubernetes-sigs/kwok/releases/download/${KWOK_VERSION}/stage-fast.yaml"
+
+  echo "  Waiting for KWOK controller to be ready..."
+  kubectl wait --for=condition=Available deployment/kwok-controller -n "${KWOK_NAMESPACE}" --timeout=60s


KWOK has its own chart: https://github.com/kubernetes-sigs/kwok/tree/main/charts/kwok
Can we use it instead?

Pushed. Just FYI, I will be gone on vacation for 2 weeks, so if I am ignoring anything that is why 😁

/cc @adrianmoisey

adrianmoisey · 2026-02-17T18:06:28Z

I really really really like it. I have some comments but I think this is good to go. @adrianmoisey , I know you found it super useful as well so let's push that forward.

Yup, I haven't had time to review, but the idea seems sound.

I have some wish list items (which don't need to be in this PR, we can figure this out later).

It would be nice to be able to make some pods that don't match a VPA, to closer replicate a real cluster
Tracking memory could also be super useful

omerap12 · 2026-02-17T19:19:13Z

I really really really like it. I have some comments but I think this is good to go. @adrianmoisey , I know you found it super useful as well so let's push that forward.

Yup, I haven't had time to review, but the idea seems sound.

I have some wish list items (which don't need to be in this PR, we can figure this out later).

It would be nice to be able to make some pods that don't match a VPA, to closer replicate a real cluster

Tracking memory could also be super useful

I believe we can do 1 and 2 in a follow up PRs (maybe let other folks help out?)

adrianmoisey · 2026-02-18T09:45:19Z

Oh, another wishlist item: output the metrics as junit files and move the benchmark into prow, that way we can graph it over time

I'll go write my wishlist items on the issue linked from here so they aren't lost

This commit adds VPA benchmarking to the project. It adds code for a benchmark binary which measures the internal latency of the VPA updater loop. This allows users to test changes and how it will affect the performance of the VPA. The code and documentation lives in the vertical-pod-autoscaler/benchmark directory. Signed-off-by: Max Cao <macao@redhat.com>

Adds a new GitHub action which runs the vpa-benchmark e2e. Signed-off-by: Max Cao <macao@redhat.com>

omerap12

/lgtm
/label tide/merge-method-squash

omerap12 · 2026-02-19T08:10:22Z

/assign @adrianmoisey

adrianmoisey · 2026-02-19T08:11:20Z

/approve

Thanks! This is a great start!

k8s-ci-robot · 2026-02-19T08:11:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianmoisey, maxcao13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [adrianmoisey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* ci(perf): add VPA performance benchmarking This commit adds VPA benchmarking to the project. It adds code for a benchmark binary which measures the internal latency of the VPA updater loop. This allows users to test changes and how it will affect the performance of the VPA. The code and documentation lives in the vertical-pod-autoscaler/benchmark directory. Signed-off-by: Max Cao <macao@redhat.com> * ci(perf): add vpa-benchmark github action Adds a new GitHub action which runs the vpa-benchmark e2e. Signed-off-by: Max Cao <macao@redhat.com> --------- Signed-off-by: Max Cao <macao@redhat.com>

k8s-ci-robot requested review from omerap12 and voelzmo February 6, 2026 00:35

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 6, 2026

maxcao13 force-pushed the vpa-benchmark branch from c6c5450 to 1988ae0 Compare February 6, 2026 00:47

k8s-ci-robot requested a review from adrianmoisey February 6, 2026 00:48

omerap12 reviewed Feb 6, 2026

View reviewed changes

iamzili reviewed Feb 7, 2026

View reviewed changes

maxcao13 force-pushed the vpa-benchmark branch from 1988ae0 to d59e190 Compare February 13, 2026 20:40

omerap12 reviewed Feb 17, 2026

View reviewed changes

maxcao13 added 2 commits February 18, 2026 14:15

ci(perf): add vpa-benchmark github action

ba0f64f

Adds a new GitHub action which runs the vpa-benchmark e2e. Signed-off-by: Max Cao <macao@redhat.com>

maxcao13 force-pushed the vpa-benchmark branch from d59e190 to ba0f64f Compare February 18, 2026 22:15

omerap12 reviewed Feb 19, 2026

View reviewed changes

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 19, 2026

k8s-ci-robot assigned omerap12 Feb 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2026

k8s-ci-robot assigned adrianmoisey Feb 19, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2026

k8s-ci-robot merged commit c1154aa into kubernetes:master Feb 19, 2026
14 checks passed

Comments

Conversation

maxcao13 commented Feb 6, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

maxcao13 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxcao13 commented Feb 6, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcao13 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamzili left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omerap12 commented Feb 8, 2026

Uh oh!

maxcao13 commented Feb 13, 2026

Uh oh!

maxcao13 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianmoisey commented Feb 17, 2026

Uh oh!

omerap12 commented Feb 17, 2026

Uh oh!

adrianmoisey commented Feb 18, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

omerap12 commented Feb 19, 2026

maxcao13 commented Feb 6, 2026 •

edited

Loading

maxcao13 commented Feb 6, 2026 •

edited

Loading

maxcao13 commented Feb 13, 2026 •

edited

Loading