feat: support network interface configuration by ryan-mist · Pull Request #9027 · aws/karpenter-provider-aws

ryan-mist · 2026-03-23T09:02:50Z

Fixes #N/A

Description

adds support for configuring network interfaces on the nodeclass

spec:
  networkInterfaces:
        - networkCardIndex: int
          deviceIndex: int
          interfaceType: "interface" | "efa-only"

adds Node label karpenter.k8s.aws/instance-efa-count when instances are launched with EFA devices

How was this change tested?

unit tests
manual testing

Manual Tests

Drift

# Apply NodeClass with no network interface configuration
ryanmist@c889f3b6ff52 efa % k apply -f nodeclass.yaml
ec2nodeclass.karpenter.k8s.aws/efa-1 created
ryanmist@c889f3b6ff52 efa % k get ec2nodeclass
NAME    READY   AGE
efa-1   True    25s
ryanmist@c889f3b6ff52 efa % k apply -f nodepool.yaml
nodepool.karpenter.sh/i4g-16xlarge-pool created

# Create Node; Launched with No EFA Configurations
ryanmist@c889f3b6ff52 efa % k apply -f pod.yaml
deployment.apps/non-efa-deployment configured
ryanmist@c889f3b6ff52 efa % aws ec2 describe-instances --instance-ids i-030f2ec9b6b7104fa  --query "Reservations[].Instances[].NetworkInterfaces[].InterfaceType"
[
    "interface"
]

# Change NodeClass Configuration 
ryanmist@c889f3b6ff52 efa % k apply -f nodeclass.yaml
ec2nodeclass.karpenter.k8s.aws/efa-1 configured

# Node is Drifted
    Last Transition Time:  2026-03-23T07:34:36Z
    Message:               NodeClassDrift
    Observed Generation:   1
    Reason:                NodeClassDrift
    Status:                True
    Type:                  Drifted

# New Node Launched with EFA
ryanmist@c889f3b6ff52 efa % aws ec2 describe-instances --instance-ids i-0d2a8ae49f5ab9137  --query "Reservations[].Instances[].NetworkInterfaces[].InterfaceType"
[
    "efa-only",
    "interface"
]

Static Provisioning

# Apply NodeClass And Static NodePool
# Multiple NodeClaims provisioned
NAME                      TYPE           CAPACITY   ZONE         NODE   READY     AGE
i4g-16xlarge-pool-dk5sb   i4g.16xlarge   spot       us-west-2d          Unknown   16s
i4g-16xlarge-pool-r5t6h   i4g.16xlarge   spot       us-west-2d          Unknown   30s
# With Correct EFA Label and Allocatable / Capacity 
Labels:       karpenter.k8s.aws/ec2nodeclass=efa-1
              karpenter.k8s.aws/instance-efa-count=1
Status:
  Allocatable:
    vpc.amazonaws.com/efa:      1

Dynamic Provisioning

# When pods does not request EFA - instance is still launched with NodeClass config
ryanmist@c889f3b6ff52 efa % aws ec2 describe-instances --instance-ids i-00bca795a6864fbf7  --query "Reservations[].Instances[].NetworkInterfaces[].InterfaceType"
[
    "interface",
    "efa-only",
    "efa-only",
    "efa-only",
    "efa-only"
]
Labels:       karpenter.k8s.aws/ec2nodeclass=efa-1
              karpenter.k8s.aws/instance-efa-count=4
Allocatable:
    vpc.amazonaws.com/efa:      4

# When pods does request EFA - instance is launched with NodeClass config
ryanmist@c889f3b6ff52 efa % k apply -f efa-pod.yaml
deployment.apps/efa-deployment configured
Labels:       karpenter.k8s.aws/ec2nodeclass=efa-1
              karpenter.k8s.aws/instance-efa-count=4
Resources:
  Requests:
    Cpu:                    1150m
    Memory:                 100Mi
    Pods:                   4
    vpc.amazonaws.com/efa:  1
Status:
  Allocatable:
    Cpu:                        191450m
    Ephemeral - Storage:        17Gi
    Memory:                     1937871Mi
    nvidia.com/gpu:             8
    Pods:                       149
    vpc.amazonaws.com/efa:      4

# When pod requests over EFA resource of NodeClass
ryanmist@c889f3b6ff52 efa % k apply -f efa-pod.yaml
deployment.apps/five-efa-deployment created

From Karpenter Logs
{"level":"ERROR","time":"2026-03-23T08:47:09.824Z","logger":"controller","caller":"scheduling/scheduler.go:257","message":"could not schedule pod","commit":"cfc53f3-dirty","controller":"provisioner","namespace":"","name":"","reconcileID":"38b08750-bb1b-4ed3-9929-d5687cfd95f2","Pod":{"name":"five-efa-deployment-769b5785f6-7nqrs","namespace":"default"},"error":"no instance type has enough resources, requirements=karpenter.k8s.aws/ec2nodeclass In [efa-1], karpenter.sh/nodepool In [efa-pool], kubernetes.io/os In [linux], resources={\"cpu\":\"1150m\",\"memory\":\"100Mi\",\"pods\":\"4\",\"vpc.amazonaws.com/efa\":\"5\"}"}

# When pod requests EFA w/ NodeClass with no Network Interface Configurations
ryanmist@c889f3b6ff52 efa % k apply -f efa-pod.yaml
deployment.apps/efa-deployment configured
ryanmist@c889f3b6ff52 efa % aws ec2 describe-instances --instance-ids i-035054426a7520613  --query "Reservations[].Instances[].NetworkInterfaces[].InterfaceType"
[
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa",
    "efa"
]

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nathangeology

Hope these comments help. Let me know if you have any questions.

pkg/providers/instance/instance.go

pkg/providers/instancetype/offering/offering.go

pkg/providers/launchtemplate/launchtemplate.go

pkg/controllers/nodeclass/validation.go

pkg/providers/instancetype/compatibility/compatibility.go

nathangeology · 2026-03-24T22:03:53Z

pkg/apis/v1/labels.go

 	LabelInstanceAcceleratorCount             = apis.Group + "/instance-accelerator-count"
 	LabelNodeClass                            = apis.Group + "/ec2nodeclass"
 	LabelInstanceTenancy                      = apis.Group + "/instance-tenancy"
+	LabelEFACount                             = apis.Group + "/instance-efa-count"


The new label karpenter.k8s.aws/instance-efa-count is added to WellKnownLabels and excluded from the nodepool CEL validation tests (users can't set it as a requirement directly). But in computeRequirements(), it's initially set to DoesNotExist and then conditionally overwritten with the actual count when networkInterfaces != nil. This means when a pod requests EFA resources without NodeClass network interface config, the label won't be set via requirements — it only gets set on the node via instanceToNodeClaim in cloudprovider.go. That asymmetry between scheduling requirements and actual node labels could cause issues if someone tries to use the label as a scheduling constraint.

Its a fair point. What I ran into was that there is no way for Karpenter to know about EFA configurations for instances provisioned without NodeClass network interface configurations (EFAs based on pod resource requests are done during scheduling time).

The way I approached this was to support the scheduling label for static EFA configurations (i.e. preconfigured network interface configuration). Although, now that I'm thinking about it a bit more I think it makes sense to drop support for this as a label supporting scheduling in Karpenter and just keep it around as a informative label. Thoughts @jmdeal?

For some context, the reason we are adding this label is to provide a path forward for fixing this issue - aws/eks-charts#1239

What if we made the presence of the label trigger the dynamic EFA provisioning path? We would handle syncing the label to the NodeClaim in a similar way to how we handle syncing the capacity reservation labels - we'd only apply the label if it was explicitly requested via the NodeClaim requirements. This works around the existing limitation where we can't represent that an instance type may have a label with a given set of values, or it may not have that label.

That makes a lot of sense, I think thats probably the best way to do it. Changed

pkg/cloudprovider/cloudprovider.go

pkg/apis/v1/ec2nodeclass.go

jmdeal · 2026-03-25T09:17:57Z

pkg/apis/v1/labels.go

 	LabelInstanceAcceleratorCount             = apis.Group + "/instance-accelerator-count"
 	LabelNodeClass                            = apis.Group + "/ec2nodeclass"
 	LabelInstanceTenancy                      = apis.Group + "/instance-tenancy"
+	LabelEFACount                             = apis.Group + "/instance-efa-count"


What if we made the presence of the label trigger the dynamic EFA provisioning path? We would handle syncing the label to the NodeClaim in a similar way to how we handle syncing the capacity reservation labels - we'd only apply the label if it was explicitly requested via the NodeClaim requirements. This works around the existing limitation where we can't represent that an instance type may have a label with a given set of values, or it may not have that label.

pkg/providers/instance/instance.go

pkg/providers/instancetype/compatibility/compatibility.go

pkg/providers/instancetype/offering/offering.go

pkg/apis/v1/ec2nodeclass.go

pkg/controllers/nodeclass/validation_test.go

github-actions · 2026-03-26T01:27:27Z

Preview deployment ready!

Preview URL: https://pr-9027.d18coufmbnnaag.amplifyapp.com

Built from commit ff174646bcfc676c072f846dfe3bff5eb52945aa

ryan-mist

/karpenter snapshot

github-actions · 2026-03-26T03:39:30Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-936b08d1884d1646c503ae66f451c7eb2a87d0c9.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-936b08d1884d1646c503ae66f451c7eb2a87d0c9" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

ryan-mist

/karpenter snapshot

github-actions · 2026-03-26T21:25:05Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-bbcdb60da8d671b2491de2e789ad84fd476a76ff.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-bbcdb60da8d671b2491de2e789ad84fd476a76ff" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

jmdeal · 2026-03-26T23:32:06Z

pkg/apis/v1/ec2nodeclass.go

 	InstanceStorePolicy *InstanceStorePolicy `json:"instanceStorePolicy,omitempty"`
+	// NetworkInterfaces specifies the network interface configurations to be attached to provisioned instances.
+	// +kubebuilder:validation:XValidation:message="networkInterfaces must not have duplicate networkCardIndex and deviceIndex pairs",rule="self.all(x, self.filter(y, x.networkCardIndex == y.networkCardIndex && x.deviceIndex == y.deviceIndex).size() == 1)"
+	// +kubebuilder:validation:XValidation:message="networkInterfaces must include a primary interface with interfaceType='interface'",rule="self.size() == 0 || self.exists(x, x.deviceIndex == 0 && x.networkCardIndex == 0 && x.interfaceType == 'interface')"


Does it always need to be device index 0 or does it just need to be on NC 0?

Yeah, it has to be primary network interface (NC=0, DI=0)

From docs, theres a table with Can be used as primary network interface for instance and EFA-only is no - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html, and then in docs:

A primary network interface has a device index of 0.

ryan-mist

/karpenter snapshot

github-actions · 2026-03-27T01:39:57Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-27fffb37a48dbadb9cbcc8e57c81bbd2e4da8b0e.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-27fffb37a48dbadb9cbcc8e57c81bbd2e4da8b0e" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

ryan-mist

/karpenter snapshot

github-actions · 2026-03-27T21:19:49Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-ff174646bcfc676c072f846dfe3bff5eb52945aa.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-ff174646bcfc676c072f846dfe3bff5eb52945aa" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

jamesmt-aws · 2026-03-28T04:33:57Z

pkg/providers/instancetype/compatibility/compatibility.go

+			return false
+		}
+		// (3) the configured number of device indices for a network card is greater than what the instance offers
+		if lo.FromPtr(info.NetworkInfo.NetworkCards[nci].MaximumNetworkInterfaces) <= networkInterface.DeviceIndex {


can we look this up by field instead of by position, maybe by lo.Find()? I think the AWS API returns an index field, and I assume that's intentional. this is the kind of thing that would be a nightmare to debug on an exotic instance type.

jamesmt-aws · 2026-03-28T04:43:05Z

pkg/cloudprovider/suite_test.go

+			Expect(cloudProviderNodeClaim.Labels).To(HaveKey(v1.LabelEFACount))
 		})
-		It("shouldn't include vpc.amazonaws.com/efa on a nodeclaim if it doesn't request it", func() {
+		It("shouldn't include vpc.amazonaws.com/efa on a nodeclaim if it doesn't request it", func() { // FAILS


do we just need a separate label for "has the capability at all" and "the provisioned count of EFA adapters?" I'm guessing most pods fill up the nodes entirely so it won't come up much, but it would be good to make the design super clear about this

feat: support network interface configuration

ecb1e4f

ryan-mist force-pushed the network-interface-support branch from cfc53f3 to ecb1e4f Compare March 23, 2026 15:52

ryan-mist requested a review from a team as a code owner March 23, 2026 15:52

ryan-mist requested a review from azishabibi March 23, 2026 15:52

nathangeology reviewed Mar 24, 2026

View reviewed changes

validation fix

04bca56

jmdeal reviewed Mar 25, 2026

View reviewed changes

ryan-mist added 2 commits March 25, 2026 10:31

fixes based on comments

76c505b

efa count label triggers EFA provisioning

ced7850

Merge branch 'main' into network-interface-support

8a4ebc3

ryan-mist force-pushed the network-interface-support branch from de1c210 to 8a4ebc3 Compare March 26, 2026 01:33

e2e test

936b08d

ryan-mist commented Mar 26, 2026

View reviewed changes

docs

6eed7cb

ryan-mist commented Mar 26, 2026

View reviewed changes

jmdeal reviewed Mar 26, 2026

View reviewed changes

bump e2e test time

27fffb3

ryan-mist force-pushed the network-interface-support branch from bbcdb60 to 27fffb3 Compare March 27, 2026 01:33

ryan-mist commented Mar 27, 2026

View reviewed changes

ryan-mist added 2 commits March 27, 2026 13:37

Merge branch 'main' into network-interface-support

757845a

fix merging

ff17464

ryan-mist commented Mar 27, 2026

View reviewed changes

jamesmt-aws reviewed Mar 28, 2026

View reviewed changes

Conversation

ryan-mist commented Mar 23, 2026

Manual Tests

Uh oh!

nathangeology left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-mist left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

ryan-mist left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-mist Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-mist left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

ryan-mist left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Mar 26, 2026 •

edited

Loading

ryan-mist Mar 26, 2026 •

edited

Loading