Conversation
|
Preview deployment ready! Preview URL: https://pr-9030.d18coufmbnnaag.amplifyapp.com Built from commit |
f23151b to
0421ea8
Compare
nathangeology
left a comment
There was a problem hiding this comment.
Placement groups a great feature to support. Left a few comments, I hope they are helpful!
|
|
||
| // expandPartitionOfferings expands each offering into N offerings (one per partition) for partition placement groups. | ||
| // This enables the scheduler to use TopologySpreadConstraints with the partition topology key. | ||
| func (p *DefaultProvider) expandPartitionOfferings(offerings cloudprovider.Offerings, nodeClass NodeClass) cloudprovider.Offerings { |
There was a problem hiding this comment.
expandPartitionOfferings takes every offering and multiplies it by partitionCount. EC2 supports up to 7 partitions. With, say, 500 instance types × 3 AZs × 2 capacity types × 7 partitions, you're looking at 21,000 offerings instead of 3,000. This could meaningfully impact scheduler simulation performance and memory. Worth benchmarking or at least confirming the scheduler handles this volume gracefully. Also, the partition expansion happens after the offering cache lookup, so the cached offerings get expanded on every call. The expansion work isn't cached (I think). In my head, I do wonder a bit whether we need to expand the offerings here at this stage or if this can be handled another way that doesn't put a direct multiplier on the offerings count. Will have to think more about this, but let me know if you have any ideas there. At least setting skew on up to 7 topo domains shouldn't strain the placement sim too hard (unlike something like host-name spread).
There was a problem hiding this comment.
Good call out, I'll also need to think about this more and look into benchmarking.
| } | ||
|
|
||
| // Get the instance details, skipping cache to get fresh partition data | ||
| inst, err := h.instanceProvider.Get(ctx, instanceID, instance.SkipCache) |
There was a problem hiding this comment.
This is making an API call to EC2 on every registration attempt. This could generate a lot of API traffic, especially in cases where things are just a little slow or a lot of instances are being provisioned at once. There's no back-off or retry logic around this and it looks like it just re-queues the request indefinitely. Worth considering some bounded retry logic here with a back-off if things go sideways with the API. It might also be good to have a timeout where this fails and think through how to handle that failure in terms of what the customer might be happy with in those cases.
There was a problem hiding this comment.
Ok, we talked about this one. It sounds like the overall reconciler has a retry/back-off that should catch this.
| // Placement group labels | ||
| if nodeClass != nil && len(nodeClass.Status.PlacementGroup) > 0 { | ||
| pg := nodeClass.Status.PlacementGroup[0] | ||
| labels[v1.LabelPlacementGroupID] = pg.ID |
There was a problem hiding this comment.
The instanceToNodeClaim function sets LabelPlacementGroupID from the NodeClass status, but doesn't set LabelPlacementGroupPartition. The partition label is only set by the registration hook after a DescribeInstances call. This means there's a window where the NodeClaim has the PG ID label but not the partition label. If the registration hook fails or the controller restarts during this window, the partition label might never get set. Is there a reconciliation path that re-runs the hook? Should I be worried about this gap or do we not see a code path where this could go wrong on us?
There was a problem hiding this comment.
IIUC if the node registration hook fails, we should still retry since the node isn't registered yet. If the controller restarts, we'll see that the node hasn't been registered yet at start up and continue on trying to register it.
| } | ||
|
|
||
| pg := &out.PlacementGroups[0] | ||
| p.cache.SetDefault(q.CacheKey(), pg) |
There was a problem hiding this comment.
The provider caches the placement group result with SetDefault (using the default TTL). If a placement group transitions from available to deleting while cached, Karpenter will keep using the stale available result until the cache expires. The DescribePlacementGroupsInput filters by state=available, so a re-query would correctly return nothing, but the cache prevents that re-query. The cache TTL isn't visible in this diff and I'm not sure what it is off-hand. So it's worth checking what it is and whether it's appropriate for a resource that could be deleted. Also worth noting that the "not found" case (line 69, return nil, nil) is not cached, so repeated lookups for a nonexistent PG will hit EC2 every time — but a PG that was found and later gets deleted will keep returning the stale cached result.
There was a problem hiding this comment.
the ttl is defined here https://github.com/aws/karpenter-provider-aws/pull/9030/changes/BASE..0421ea8d944b715942e79f5cdf70eafdd50b7ea0#diff-964e351ee2375d359c78d69e514c4edc42577219761c4475f391ed2daf715e51R36 I currently have it set to 24 hours but maybe something more like an hour? or maybe 6? is more reasonable? I'll also add a delete for the not found case.
pkg/apis/v1/ec2nodeclass_status.go
Outdated
| CapacityReservations []CapacityReservation `json:"capacityReservations,omitempty"` | ||
| // PlacementGroup contains the current placement group values that are available to this NodeClass. | ||
| // +optional | ||
| PlacementGroup []PlacementGroup `json:"placementGroup,omitempty"` |
There was a problem hiding this comment.
The status field is PlacementGroup []PlacementGroup but the reconciler only ever sets it to nil or a single-element slice. Every consumer does len(pgs) > 0 then pgs[0]. Using a pointer *PlacementGroup would be a cleaner API and avoid the repeated length-check-then-index pattern throughout the codebase. Anyway, just a design nit, ignore if there's another good reason to use a slice here.
c44d044 to
d0e24ad
Compare
| CapacityReservations []CapacityReservation `json:"capacityReservations,omitempty"` | ||
| // PlacementGroups contains the placement group values that are available to this NodeClass. | ||
| // +optional | ||
| PlacementGroups []PlacementGroup `json:"placementGroups,omitempty"` |
There was a problem hiding this comment.
The design states "Each EC2NodeClass maps to exactly one placement group," and the selector enforces this (single name or ID, not a list of terms) unlike subnets/security groups/capacity reservations where selectors can match multiple resources. Why not use a *PlacementGroup instead of []PlacementGroup?
pkg/apis/v1/ec2nodeclass_status.go
Outdated
| ConditionTypeSecurityGroupsReady, | ||
| ConditionTypeInstanceProfileReady, | ||
| ConditionTypeValidationSucceeded, | ||
| ConditionTypePlacementGroupReady, |
There was a problem hiding this comment.
Should PlacementGroupReady only be included in StatusConditions() when spec.placementGroupSelector is set? Currently it's added unconditionally, which means every ec2nodeclass, even those without a placement group, gets this condition initialized to Unknown, keeping the root Ready condition Unknown until the reconciler runs and sets it to True. If the spec doesn't select on a PG, it doesn't make sense for the status to have it in my opinion.
| } | ||
|
|
||
| func PlacementGroupStateFromEC2(state ec2types.PlacementGroupState) (PlacementGroupState, error) { | ||
| if state == "" { |
There was a problem hiding this comment.
Why does this default to available when the state is empty? Unlike PlacementGroupStrategyFromEC2 which errors on empty input, this silently assumes the PG is ready to use when we don't actually know its state. If EC2 returns an empty state, shouldn't we treat that as an error rather than optimistically defaulting to available? It could be unlikely but assuming it as available may not be the right choice.
d0e24ad to
190e85e
Compare
joshuakguo
left a comment
There was a problem hiding this comment.
/karpenter snapshot
|
Snapshot successfully published to |
Fixes #3324
Description
karpenter.k8s.aws/placement-group-idandkarpenter.k8s.aws/placement-group-partition(only for partition placement groups) for instances launched in placement groupsHow was this change tested?
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.