Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS: Support tagging node group's underlying ASG #2884

Merged

Conversation

richardchen331
Copy link
Contributor

@richardchen331 richardchen331 commented Oct 27, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:
There is an open issue on EKS where EKS doesn’t support tagging the underlying ASGs, however it’s quite important for many organizations for compliance, cost. This PR adds such support in CAPA, where user can specify an optional AdditionalAsgTags field in AWSManagedMachinePool and the node group controller will reconcile the underlying ASG tags.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2881

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Support tagging EKS node group's underlying ASG

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 27, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added needs-priority cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 27, 2021
@k8s-ci-robot
Copy link
Contributor

@richardchen331: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 27, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @richardchen331!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @richardchen331. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 27, 2021
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 27, 2021
@richardchen331 richardchen331 changed the title Support tagging node group's underlying ASG EKS: Support tagging node group's underlying ASG Oct 27, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 27, 2021
for k, v := range s.scope.ManagedMachinePool.Spec.AdditionalAsgTags {
// The k/vCopy is used to address the "Implicit memory aliasing in for loop" issue
// https://stackoverflow.com/questions/62446118/implicit-memory-aliasing-in-for-loop
kCopy := k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we index the ranged map as given in the stackoverflow link?

for key := range s.scope.ManagedMachinePool.Spec.AdditionalAsgTags {
    key:= key (Can remove it and check if it still throws error)
    input.Tags = append(input.Tags, &autoscaling.Tag{
		Key:               &key,
		PropagateAtLaunch: &trueVal,
		ResourceId:        asg.Name,
		ResourceType:      &asgType,
		Value:             & s.scope.ManagedMachinePool.Spec.AdditionalAsgTags[key],
	})
}
if _, err := s.AutoscalingClient.CreateOrUpdateTags(input); err != nil {
	return errors.Wrap(err, "failed to reconcile AutoScalingGroup tags for nodegroup")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change to this but got error invalid operation: cannot take address of s.scope.ManagedMachinePool.Spec.AdditionalAsgTags[key] (map index expression of type string), so I'll keep it as it for now

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2021
@k8s-ci-robot k8s-ci-robot removed the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 8, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 8, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 8, 2021
@richardchen331
Copy link
Contributor Author

Updated code to remove unspecified ASG tags as well

@richardcase
Copy link
Member

Same issue. I will investigate

@richardchen331
Copy link
Contributor Author

e2e tests passed after rebasing the PR. @richardcase do you mind taking a final look at this PR? Thanks!

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

Probable a flake, so:

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

@richardchen331 - Looking at the CAPA logs for the e2e failure i see:

I0117 11:20:31.049787       1 awsmanagedmachinepool_controller.go:175]  "msg"="Reconciling AWSManagedMachinePool"  
I0117 11:20:31.907542       1 tags.go:126]  "msg"="Reconciling ASG tags"  "cluster-name"="cluster-m075u4" "nodegroup-name"="eks-nodes-3eljk4_cluster-m075u4-pool-0"
I0117 11:20:32.057082       1 awsmanagedmachinepool_controller.go:195]  "msg"="Reconciling deletion of AWSManagedMachinePool" 

From the logs it looks like its starting reconciling the ASG tags but then:

  • fails during reconciling
  • Or more likely we are hittinh a timeout that is causing the cluster to be deleted at the time of reconciling the ASG tags

The timeout is probably more likely due to this in the logs:

   Timed out after 1800.000s.
      
  Expected
      <int>: 1
  to equal
      <int>: 2 

@richardcase
Copy link
Member

I've increased the timeout by 5 mins to see it's purely unfortunate timing or if the extra time reveals any errors.

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-eks

@richardchen331
Copy link
Contributor Author

@richardcase Looking at the e2e test log it looks like that:

  • The test timed out "Waiting for the machine pool workload nodes to exist" after "Scaling the machine pool up" (log here, test implementation here)
  • It timed out because of MachinePool.Status.ReadyReplicas (1) != MachinePool.Spec.Replicas (2)
  • Looking at the MP resource and the AWSMMP resource it looks that the cloud resource indeed scaled up, just that one of the node is not ready.
  • This doesn't seems to be related to the code change. I'm going to rerun the test and see what happens.

@richardchen331
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

This is strange as the periodic e2e for EKS are passing every night.

I can't see why this change would cause the e2e to fail.

I'll run the e2e locally on your branch to see if it highlights anything.

@richardchen331
Copy link
Contributor Author

@richardcase I think I found the culprit... So EKS nodes relies on the "kubernetes.io/cluster/cluster-name:owned" tag in order to join the cluster (https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#worker-node-fail), however we are incorrectly updating the ASG tags because ManagedMachinePoolScope.ClusterName() returns the Cluster CR name instead of the actual EKS cluster name. I updated the PR and let's see if tests can pass.

@richardcase
Copy link
Member

Ah, we've had that issue in other places. Fingers crossed it wired works and the e2e pass. If it didn't I'll run them locally this morning

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-eks

@richardchen331
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

Thats good the eks e2e passed

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-test

@richardcase
Copy link
Member

The tests are passing, i think this is good to go:

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit 8996bce into kubernetes-sigs:main Jan 22, 2022
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.3.0, v1.x Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EKS: Support tagging node group's underlying ASG
5 participants