Skip to content

feat: Placement Group Support#9030

Open
joshuakguo wants to merge 1 commit intoaws:mainfrom
joshuakguo:placement-groups-support
Open

feat: Placement Group Support#9030
joshuakguo wants to merge 1 commit intoaws:mainfrom
joshuakguo:placement-groups-support

Conversation

@joshuakguo
Copy link
Copy Markdown
Contributor

Fixes #3324

Description

  • add support for configuring EC2 Placement Groups on the EC2NodeClass
spec:
  placementGroupSelector:
    name: string
    id: string
  • add node labels karpenter.k8s.aws/placement-group-id and karpenter.k8s.aws/placement-group-partition (only for partition placement groups) for instances launched in placement groups

How was this change tested?

  • unit tests
  • manual testing

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

Preview deployment ready!

Preview URL: https://pr-9030.d18coufmbnnaag.amplifyapp.com

Built from commit 190e85efc9d157df8a60bbe5da752f65be0764b3

@joshuakguo joshuakguo force-pushed the placement-groups-support branch from f23151b to 0421ea8 Compare March 24, 2026 20:33
Copy link
Copy Markdown

@nathangeology nathangeology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placement groups a great feature to support. Left a few comments, I hope they are helpful!


// expandPartitionOfferings expands each offering into N offerings (one per partition) for partition placement groups.
// This enables the scheduler to use TopologySpreadConstraints with the partition topology key.
func (p *DefaultProvider) expandPartitionOfferings(offerings cloudprovider.Offerings, nodeClass NodeClass) cloudprovider.Offerings {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandPartitionOfferings takes every offering and multiplies it by partitionCount. EC2 supports up to 7 partitions. With, say, 500 instance types × 3 AZs × 2 capacity types × 7 partitions, you're looking at 21,000 offerings instead of 3,000. This could meaningfully impact scheduler simulation performance and memory. Worth benchmarking or at least confirming the scheduler handles this volume gracefully. Also, the partition expansion happens after the offering cache lookup, so the cached offerings get expanded on every call. The expansion work isn't cached (I think). In my head, I do wonder a bit whether we need to expand the offerings here at this stage or if this can be handled another way that doesn't put a direct multiplier on the offerings count. Will have to think more about this, but let me know if you have any ideas there. At least setting skew on up to 7 topo domains shouldn't strain the placement sim too hard (unlike something like host-name spread).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out, I'll also need to think about this more and look into benchmarking.

}

// Get the instance details, skipping cache to get fresh partition data
inst, err := h.instanceProvider.Get(ctx, instanceID, instance.SkipCache)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making an API call to EC2 on every registration attempt. This could generate a lot of API traffic, especially in cases where things are just a little slow or a lot of instances are being provisioned at once. There's no back-off or retry logic around this and it looks like it just re-queues the request indefinitely. Worth considering some bounded retry logic here with a back-off if things go sideways with the API. It might also be good to have a timeout where this fails and think through how to handle that failure in terms of what the customer might be happy with in those cases.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we talked about this one. It sounds like the overall reconciler has a retry/back-off that should catch this.

// Placement group labels
if nodeClass != nil && len(nodeClass.Status.PlacementGroup) > 0 {
pg := nodeClass.Status.PlacementGroup[0]
labels[v1.LabelPlacementGroupID] = pg.ID
Copy link
Copy Markdown

@nathangeology nathangeology Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instanceToNodeClaim function sets LabelPlacementGroupID from the NodeClass status, but doesn't set LabelPlacementGroupPartition. The partition label is only set by the registration hook after a DescribeInstances call. This means there's a window where the NodeClaim has the PG ID label but not the partition label. If the registration hook fails or the controller restarts during this window, the partition label might never get set. Is there a reconciliation path that re-runs the hook? Should I be worried about this gap or do we not see a code path where this could go wrong on us?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if the node registration hook fails, we should still retry since the node isn't registered yet. If the controller restarts, we'll see that the node hasn't been registered yet at start up and continue on trying to register it.

}

pg := &out.PlacementGroups[0]
p.cache.SetDefault(q.CacheKey(), pg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider caches the placement group result with SetDefault (using the default TTL). If a placement group transitions from available to deleting while cached, Karpenter will keep using the stale available result until the cache expires. The DescribePlacementGroupsInput filters by state=available, so a re-query would correctly return nothing, but the cache prevents that re-query. The cache TTL isn't visible in this diff and I'm not sure what it is off-hand. So it's worth checking what it is and whether it's appropriate for a resource that could be deleted. Also worth noting that the "not found" case (line 69, return nil, nil) is not cached, so repeated lookups for a nonexistent PG will hit EC2 every time — but a PG that was found and later gets deleted will keep returning the stale cached result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ttl is defined here https://github.com/aws/karpenter-provider-aws/pull/9030/changes/BASE..0421ea8d944b715942e79f5cdf70eafdd50b7ea0#diff-964e351ee2375d359c78d69e514c4edc42577219761c4475f391ed2daf715e51R36 I currently have it set to 24 hours but maybe something more like an hour? or maybe 6? is more reasonable? I'll also add a delete for the not found case.

CapacityReservations []CapacityReservation `json:"capacityReservations,omitempty"`
// PlacementGroup contains the current placement group values that are available to this NodeClass.
// +optional
PlacementGroup []PlacementGroup `json:"placementGroup,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status field is PlacementGroup []PlacementGroup but the reconciler only ever sets it to nil or a single-element slice. Every consumer does len(pgs) > 0 then pgs[0]. Using a pointer *PlacementGroup would be a cleaner API and avoid the repeated length-check-then-index pattern throughout the codebase. Anyway, just a design nit, ignore if there's another good reason to use a slice here.

@joshuakguo joshuakguo force-pushed the placement-groups-support branch 2 times, most recently from c44d044 to d0e24ad Compare March 26, 2026 01:25
CapacityReservations []CapacityReservation `json:"capacityReservations,omitempty"`
// PlacementGroups contains the placement group values that are available to this NodeClass.
// +optional
PlacementGroups []PlacementGroup `json:"placementGroups,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design states "Each EC2NodeClass maps to exactly one placement group," and the selector enforces this (single name or ID, not a list of terms) unlike subnets/security groups/capacity reservations where selectors can match multiple resources. Why not use a *PlacementGroup instead of []PlacementGroup?

ConditionTypeSecurityGroupsReady,
ConditionTypeInstanceProfileReady,
ConditionTypeValidationSucceeded,
ConditionTypePlacementGroupReady,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should PlacementGroupReady only be included in StatusConditions() when spec.placementGroupSelector is set? Currently it's added unconditionally, which means every ec2nodeclass, even those without a placement group, gets this condition initialized to Unknown, keeping the root Ready condition Unknown until the reconciler runs and sets it to True. If the spec doesn't select on a PG, it doesn't make sense for the status to have it in my opinion.

}

func PlacementGroupStateFromEC2(state ec2types.PlacementGroupState) (PlacementGroupState, error) {
if state == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this default to available when the state is empty? Unlike PlacementGroupStrategyFromEC2 which errors on empty input, this silently assumes the PG is ready to use when we don't actually know its state. If EC2 returns an empty state, shouldn't we treat that as an error rather than optimistically defaulting to available? It could be unlikely but assuming it as available may not be the right choice.

@joshuakguo joshuakguo force-pushed the placement-groups-support branch from d0e24ad to 190e85e Compare March 27, 2026 21:05
@joshuakguo joshuakguo marked this pull request as ready for review March 27, 2026 21:25
@joshuakguo joshuakguo requested a review from a team as a code owner March 27, 2026 21:25
@joshuakguo joshuakguo requested a review from jigisha620 March 27, 2026 21:25
Copy link
Copy Markdown
Contributor Author

@joshuakguo joshuakguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

@github-actions
Copy link
Copy Markdown
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-190e85efc9d157df8a60bbe5da752f65be0764b3.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-190e85efc9d157df8a60bbe5da752f65be0764b3" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for AWS placement group strategies

3 participants