Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api error VPCIdNotSpecified: No default VPC for this user #7834

Closed
clevandowski opened this issue Mar 4, 2025 · 14 comments · Fixed by #7844
Closed

api error VPCIdNotSpecified: No default VPC for this user #7834

clevandowski opened this issue Mar 4, 2025 · 14 comments · Fixed by #7844
Assignees
Labels
bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/needs-investigation Issues that need to be investigated before triaging

Comments

@clevandowski
Copy link

Description

Observed Behavior:

On AWS EKS, I just upgrade karpenter from v1.2.2 to v1.3.0, and the controller logs the following error each minutes (1 line per nodeclass):

[pod/karpenter-659b59cc4-tqqm5/controller] {"level":"ERROR","time":"2025-03-04T08:32:01.706Z","logger":"controller","message":"Reconciler error","commit":"ff59416","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"stable"},"namespace":"","name":"stable","reconcileID":"5d38cd09-d39c-4ef7-a57d-7e2986000246","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: 3696d466-36c7-4cbf-820d-faeb3639b92c, api error VPCIdNotSpecified: No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC."}
[pod/karpenter-659b59cc4-tqqm5/controller] {"level":"ERROR","time":"2025-03-04T08:32:02.190Z","logger":"controller","message":"Reconciler error","commit":"ff59416","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"ephemeral"},"namespace":"","name":"ephemeral","reconcileID":"bd5fef0b-e5e5-4de8-b2ff-42660716c633","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: 9a602b01-0f78-4831-b733-0679646c27d2, api error VPCIdNotSpecified: No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC."}

Did I forget to update some IAM permission, or did I miss a tag or something in SG Config ?

Expected Behavior:

Controller does not log ERRORs

Reproduction Steps (Please include YAML):

Versions:

  • Chart Version: 1.3.0
  • Kubernetes Version (kubectl version):
Client Version: v1.31.6
Kustomize Version: v5.4.2
Server Version: v1.31.5-eks-8cce635
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@clevandowski clevandowski added bug Something isn't working needs-triage Issues that need to be triaged labels Mar 4, 2025
@pszczypta-autopay
Copy link

For me it worked on the one environment after upgrading 1.2.2 -> 1.3.0, but didn't work on the other after fresh installation.

@bhagtrajaram
Copy link

bhagtrajaram commented Mar 4, 2025

Hi, i also have the same experience after upgrading from 1.2.2 to 1.3.0 it seems to work but after a clean install it doesn't.

My ec2nodeclass looks like:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: workers
spec:
  role: ${node_iam_role_name}
  amiFamily: ${ami_family}
  amiSelectorTerms:
    - id: ${ami_selected_id}
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${cluster_name}
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${cluster_name}
  - id: ${cluster_primary_security_group_id}

The cluster_primary_security_group_id has the vpc association.

Cloudtrail also complains. Maybe i'm missing something in the class.

@mikkeloscar
Copy link

mikkeloscar commented Mar 4, 2025

We see this and similar issues after upgrading from v1.2.0 to v1.3.0.

The error seems to come from new validation logic introduced in v1.3.0 in this PR: #7624

We also see a similar error like this:

{"level":"ERROR","time":"2025-03-04T13:14:25.758Z","logger":"controller","caller":"controller/controller.go:288","message":"Reconciler error","commit":"ff59416-dirty","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"karpenter-gpu"},"namespace":"","name":"karpenter-gpu","reconcileID":"bf5af574-b18f-4179-abf9-ad130eeeaecc","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: 27b29b9f-8a51-4afe-b452-3fd8ed41c5ed, api error MissingInput: No subnets found for the default VPC 'vpc-abc'. Please specify a subnet."}

@younsl
Copy link

younsl commented Mar 5, 2025

I also encountered the same issue after upgrading karpenter chart from v1.2.1 to v1.3.0.

The karpenter controller pod logs show multiple VPCIdNotSpecified errors when trying to create instances. It seems like Karpenter is trying to use GroupName, which is only supported in EC2-Classic or the default VPC. However, our AWS account does not have a default VPC.

karpenter pod's error log:

kubectl logs -l app.kubernetes.io/name=karpenter -n kube-system
{"level":"ERROR","time":"2025-03-05T09:07:52.015Z","logger":"controller","message":"Reconciler error","commit":"ff59416","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"<REDACTED>"},"namespace":"","name":"<REDACTED>","reconcileID":"fff90e63-0b13-4ab8-a6b6-0a94461a2e26","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: e9ce7351-e9ad-43fd-ba6f-bc87279bfa97, api error VPCIdNotSpecified: No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC."}
{"level":"ERROR","time":"2025-03-05T09:07:52.204Z","logger":"controller","message":"Reconciler error","commit":"ff59416","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"<REDACTED>"},"namespace":"","name":"default","reconcileID":"d1960fd9-270c-41e8-a0c4-9163ec7f8bb6","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: e7ee63c6-2e7b-4cf0-86bd-ea420389372a, api error VPCIdNotSpecified: No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC."}

@mikkeloscar
Copy link

It looks like setting a security group is missing in the call to ec2:RunInstances.

@sidick
Copy link

sidick commented Mar 5, 2025

I wonder, does this only affect AWS accounts where there is no actual default VPC setup? Happened to me going from 1.2.2 to 1.3.0 too

@MKnichal
Copy link

MKnichal commented Mar 5, 2025

Tested on two accounts - one with default VPN on it, and 2nd without. The one with Default VPC works fine, without errors, second one without default VPC producing a lot of errors.

{"level":"ERROR","time":"2025-03-05T11:36:33.400Z","logger":"controller","message":"Reconciler error","commit":"ff59416","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"eng-20gb"},"namespace":"","name":"eng-20gb","reconcileID":"db41c45c-162d-447b-89f7-78c34c506145","error":"validating ec2:RunInstances authorization, operation error EC2: RunInstances, https response error StatusCode: 400, RequestID: 2fd2431d-b512-46b7-875b-b9a36bb3bfe6, api error VPCIdNotSpecified: No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC."}

@clevandowski
Copy link
Author

I can see exactly the same behavior as MKnichal, but on 2 differents regions of the same account.
On region with a default VPC, karpenter does not log any error, yet the configuration of the nodeclasses is the same

@rschalo
Copy link
Contributor

rschalo commented Mar 5, 2025

Thanks all for the reports, we'll get someone looking at this today.

/triage needs-investigation
/priority critical-urgent

@rschalo rschalo added triage/needs-investigation Issues that need to be investigated before triaging priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Issues that need to be triaged labels Mar 5, 2025
@rschalo
Copy link
Contributor

rschalo commented Mar 5, 2025

/assign @jonathan-innis

@jonathan-innis
Copy link
Contributor

Update: Looks like a miss in our auth checking validation logic that doesn't specify the subnets or security groups that are normally passed-in through CreateFleet -- validating the fix now but it should just be passing these in. I'm also updating our CI testing accounts so they don't have a default VPC since that would have caught this ahead of time

@drduker
Copy link

drduker commented Mar 5, 2025

hoping i can replace with a new public image

@drduker
Copy link

drduker commented Mar 5, 2025

This error is gone for me now with that latest image in the PR that was merged, but not able to get nodes up. Just see a bunch of "Starting Controller/etc" messages.

@elitistphoenix
Copy link

Could we please get a comment when the new image is available? Still getting the issue as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet