feat: Placement Group Support by joshuakguo · Pull Request #9030 · aws/karpenter-provider-aws

joshuakguo · 2026-03-23T23:11:48Z

Description

add support for configuring EC2 Placement Groups on the EC2NodeClass

spec:
  placementGroupSelector:
    name: string
    id: string

add node labels karpenter.k8s.aws/placement-group-id and karpenter.k8s.aws/placement-group-partition (only for partition placement groups) for instances launched in placement groups

How was this change tested?

unit tests
manual testing

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2026-03-23T23:15:07Z

Preview deployment ready!

Preview URL: https://pr-9030.d18coufmbnnaag.amplifyapp.com

Built from commit 190e85efc9d157df8a60bbe5da752f65be0764b3

nathangeology

Placement groups a great feature to support. Left a few comments, I hope they are helpful!

nathangeology · 2026-03-25T16:07:48Z

pkg/providers/instancetype/offering/offering.go

+
+// expandPartitionOfferings expands each offering into N offerings (one per partition) for partition placement groups.
+// This enables the scheduler to use TopologySpreadConstraints with the partition topology key.
+func (p *DefaultProvider) expandPartitionOfferings(offerings cloudprovider.Offerings, nodeClass NodeClass) cloudprovider.Offerings {


expandPartitionOfferings takes every offering and multiplies it by partitionCount. EC2 supports up to 7 partitions. With, say, 500 instance types × 3 AZs × 2 capacity types × 7 partitions, you're looking at 21,000 offerings instead of 3,000. This could meaningfully impact scheduler simulation performance and memory. Worth benchmarking or at least confirming the scheduler handles this volume gracefully. Also, the partition expansion happens after the offering cache lookup, so the cached offerings get expanded on every call. The expansion work isn't cached (I think). In my head, I do wonder a bit whether we need to expand the offerings here at this stage or if this can be handled another way that doesn't put a direct multiplier on the offerings count. Will have to think more about this, but let me know if you have any ideas there. At least setting skew on up to 7 topo domains shouldn't strain the placement sim too hard (unlike something like host-name spread).

Good call out, I'll also need to think about this more and look into benchmarking.

pkg/providers/instancetype/offering/offering.go

pkg/apis/v1/ec2nodeclass.go

nathangeology · 2026-03-25T16:27:14Z

pkg/cloudprovider/registrationhooks/placementgrouphook.go

+	}
+
+	// Get the instance details, skipping cache to get fresh partition data
+	inst, err := h.instanceProvider.Get(ctx, instanceID, instance.SkipCache)


This is making an API call to EC2 on every registration attempt. This could generate a lot of API traffic, especially in cases where things are just a little slow or a lot of instances are being provisioned at once. There's no back-off or retry logic around this and it looks like it just re-queues the request indefinitely. Worth considering some bounded retry logic here with a back-off if things go sideways with the API. It might also be good to have a timeout where this fails and think through how to handle that failure in terms of what the customer might be happy with in those cases.

Ok, we talked about this one. It sounds like the overall reconciler has a retry/back-off that should catch this.

nathangeology · 2026-03-25T16:30:01Z

pkg/cloudprovider/cloudprovider.go

+	// Placement group labels
+	if nodeClass != nil && len(nodeClass.Status.PlacementGroup) > 0 {
+		pg := nodeClass.Status.PlacementGroup[0]
+		labels[v1.LabelPlacementGroupID] = pg.ID


The instanceToNodeClaim function sets LabelPlacementGroupID from the NodeClass status, but doesn't set LabelPlacementGroupPartition. The partition label is only set by the registration hook after a DescribeInstances call. This means there's a window where the NodeClaim has the PG ID label but not the partition label. If the registration hook fails or the controller restarts during this window, the partition label might never get set. Is there a reconciliation path that re-runs the hook? Should I be worried about this gap or do we not see a code path where this could go wrong on us?

IIUC if the node registration hook fails, we should still retry since the node isn't registered yet. If the controller restarts, we'll see that the node hasn't been registered yet at start up and continue on trying to register it.

nathangeology · 2026-03-25T16:34:48Z

pkg/providers/placementgroup/provider.go

+	}
+
+	pg := &out.PlacementGroups[0]
+	p.cache.SetDefault(q.CacheKey(), pg)


The provider caches the placement group result with SetDefault (using the default TTL). If a placement group transitions from available to deleting while cached, Karpenter will keep using the stale available result until the cache expires. The DescribePlacementGroupsInput filters by state=available, so a re-query would correctly return nothing, but the cache prevents that re-query. The cache TTL isn't visible in this diff and I'm not sure what it is off-hand. So it's worth checking what it is and whether it's appropriate for a resource that could be deleted. Also worth noting that the "not found" case (line 69, return nil, nil) is not cached, so repeated lookups for a nonexistent PG will hit EC2 every time — but a PG that was found and later gets deleted will keep returning the stale cached result.

the ttl is defined here https://github.com/aws/karpenter-provider-aws/pull/9030/changes/BASE..0421ea8d944b715942e79f5cdf70eafdd50b7ea0#diff-964e351ee2375d359c78d69e514c4edc42577219761c4475f391ed2daf715e51R36 I currently have it set to 24 hours but maybe something more like an hour? or maybe 6? is more reasonable? I'll also add a delete for the not found case.

nathangeology · 2026-03-25T16:38:31Z