Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added configurable delay before addons creation #3214

Conversation

vchepkov
Copy link

@vchepkov vchepkov commented Nov 25, 2024

Description

Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.

Motivation and Context

when implementing vpc-cni custom networking, addon fails to create in the configured dataplane_wait_duration time interval, causing pods using host network. I opened a case with AWS support and they informed me that there is a 3 minutes delay between cluster marked as ACTIVE and addon VPI to be ready, without delay vpc-cni creation (or any add-on, but vpc-cni is critical) can take up to 20 minutes.

I left default to 0s, for those who do not use custom networking, but in my test suite I set it to 3 minutes and each time addon was created under 20 seconds

How Has This Been Tested?

  • I have updated at least one of the examples/* to demonstrate and validate my change(s)
  • I have tested and validated these changes using one or more of the provided examples/* projects
  • I have executed pre-commit run -a on my pull request

@vchepkov vchepkov force-pushed the addon_delay_duration branch from 5373c51 to 3832b40 Compare November 26, 2024 00:00
@vchepkov vchepkov changed the title Add configurable delay before addons creation featdd configurable delay before addons creation Nov 26, 2024
@vchepkov vchepkov changed the title featdd configurable delay before addons creation feat: add configurable delay before addons creation Nov 26, 2024
@bryantbiggs
Copy link
Member

I'm not following - this is at cluster creation or after the cluster has been created that you are facing the issue?

For cluster creation, simply set: before_compute = true in the VPC CNI addon with the relevant configuration values for custom networking

For existing clusters, you will need to set the appropriate values on the VPC CNI and then cycle the nodes

@vchepkov vchepkov force-pushed the addon_delay_duration branch from 3832b40 to c7bc123 Compare November 26, 2024 00:05
Currently, even though the cluster status is active, the addon API is not completely ready yet.
This causes retries and significant delays during addon creation, potentially negating the before_compute delay.
@vchepkov vchepkov force-pushed the addon_delay_duration branch from c7bc123 to 7d35a22 Compare November 26, 2024 00:06
@vchepkov vchepkov changed the title feat: add configurable delay before addons creation feat: Added configurable delay before addons creation Nov 26, 2024
@vchepkov
Copy link
Author

vchepkov commented Nov 26, 2024

At the cluster creation.
Several months ago some changes were introduced and addons take up to 20 minutes to create.
I was chasing this problem ever since.

@vchepkov
Copy link
Author

I attached PR to my support case, hopefully AWS engineers will chime in

@bryantbiggs
Copy link
Member

At the cluster creation.

This is how you would do that today

module "eks" {
  source  = "terraform-aws-modules/eks/aws"

  # Truncated for brevity
  ...

  cluster_addons = {
     vpc-cni = {
        before_compute       = true
        configuration_values = jsonencode({
        env = {
           AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
           ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
        })
     }
  }

...
}

Several months ago some changes were introduced and addons take up to 20 minutes to create.
I was chasing this problem ever since.

Can you share a reproduction?

@vchepkov
Copy link
Author

Our code is slightly more complicated, but it's essentially the same

 vpc-cni = (var.enable_custom_networking ? {
      before_compute              = "true"
      addon_version               = "v${var.cni_version}"
      resolve_conflicts_on_create = "NONE"
      resolve_conflicts_on_update = "OVERWRITE"
      service_account_role_arn    = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        env = {
          AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
          ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
        }
        enableNetworkPolicy = tostring(var.enable_network_policy)
        enableWindowsIpam   = tostring(var.enable_windows)
      })
      } : {
      addon_version               = "v${var.cni_version}"
      resolve_conflicts_on_create = "NONE"
      resolve_conflicts_on_update = "OVERWRITE"
      service_account_role_arn    = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        enableNetworkPolicy = tostring(var.enable_network_policy)
        enableWindowsIpam   = tostring(var.enable_windows)
      })
    })

I can't reproduce it locally, but in gitlab environment, where many clusters are created and deleted to test the code it happens half of the time. Adding the delay was recommended by AWS Support

@antonbabenko
Copy link
Member

This issue has been resolved in version 20.30.0 🎉

@bryantbiggs
Copy link
Member

@vchepkov would you mind updating to the latest, v20.30.0 that was just released, and reporting back on if that resolves the issues you are seeing?

@bryantbiggs
Copy link
Member

also, do check out the test case I added which shows configurations that should be used to help improve this situation. In the next breaking change of the module, these will be baked in as the defaults, but for now users can set them to reach the desired outcome:

  • # Disable the default self-managed addons to avoid the penalty of adopting them later
    bootstrap_self_managed_addons = false
  • # Use subnet tags to avoid the need to inject the ENIConfig
    # which requires a live API server endpoint which leads to a dependency of:
    # Control plane -> API request to create ENIConfig -> VPC CNI addon -> nodes/compute
    # With the subnet discovery feature, we can avoid this dependency:
    # Control plane -> VPC CNI addon -> nodes/compute
    ENABLE_SUBNET_DISCOVERY = "true"
  • # Tag for subnet discovery
    "kubernetes.io/role/cni" = 1

@vchepkov
Copy link
Author

Definitely, will test. But I am a bit skeptical, because in my tests it matches what support folks said, you need a 3 minute delay and timing in your MRs is way different. Maybe combination with disabling bootstrap addons? I will report back shortly

@bryantbiggs
Copy link
Member

I have the internal ticket between support and the service team 😉 and am quite optimistic that this will help the situation

@vchepkov
Copy link
Author

oh, I don't need ENIConfig anymore? that is nice and 🤞

@vchepkov
Copy link
Author

vchepkov commented Nov 29, 2024

@bryantbiggs , I tried new v20.30.1 on Wednesday and it didn't work, but I haven't implemented subnet discovery yet
So today I added it and result is the same - none of the addons are being created - so they will timeout in 20 minutes :

module.eks.module.eks-cluster.aws_eks_addon.this["coredns"]: Still creating... [3m20s elapsed]
module.eks.module.eks-cluster.aws_eks_addon.before_compute["vpc-cni"]: Still creating... [4m30s elapsed]
module.eks.module.eks-cluster.aws_eks_addon.this["kube-proxy"]: Still creating... [3m30s elapsed]

From what I can tell, the only difference in our code - we use self_managed_node_groups. Is it possible that disabling bootstrap_self_managed_addons require eks managed nodes?

@bryantbiggs
Copy link
Member

Can you show more of the apply sequence of events

@vchepkov
Copy link
Author

Removed gitlab's coloring

clusters-output-cleaned.txt

@vchepkov
Copy link
Author

I added a managed group, just to rule it out, just copied example as is and it fails to create too

module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m20s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m30s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m40s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m50s elapsed]

@bryantbiggs
Copy link
Member

thank you for sharing, still digging through it but wanted to comment and better understand a few things:

  1. I don't believe it would have any influence on the issue you are experiencing, but why the use of self-managed node groups instead of EKS managed node groups?
  2. We do support EKS Pod Identity on EKS addons today, and we have a module for this to make that a bit easier https://github.com/terraform-aws-modules/terraform-aws-eks-pod-identity instead of using IRSA. (Note: VPC CNI IRSA doesn't make a lot of sense unless you are removing the permissions needed by VPC CNI after the node group has been created and nodes join the cluster. You can create clusters without the VPC CNI IRSA role - they inherit the permissions of the node - but you can't create clusters without the permissions on the node IAM role when using VPC CNI w/ IRSA)
  3. Any reason for continuing to use aws-auth ConfigMap instead of switching to cluster access entries? This helps avoid chaining providers due to Terraform needing to authenticate and go into the cluster to create/update the configmap. Right now it looks like you are adding the same roles in both the ConfigMap and as access entries
  4. Are you defining your node groups within the EKS module itself, like this - or are you defining them as standalone, external modules, like this?
  5. Did you get a chance to look at the logs for the VPC CNI and CoreDNS on why they aren't reaching a healthy state?

@vchepkov
Copy link
Author

vchepkov commented Nov 29, 2024

It's my pleasure, thank you for looking into it. I can upload all my code to support ticket if you like, if that helps

  1. There are some security tools that we need to install on host (security wants them)
  2. we have been using that irsa module for quite awhile, I think it was just following principle of least privileges. If that's causing issue we can refactor it
  3. some other apps we use still rely on classic config-map, so we enable it. Technically, nothing relies on it in this module and I can move it elsewhere, but I don't think it's causing an issue.
  4. we using the same module and set self_managed_node_group_defaults and self_managed_node_groups

@bryantbiggs
Copy link
Member

hi @vchepkov - can you please share a minimum reproduction of this issue. Here is my minimum reproduction that demonstrates the functionality is working as intended https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/tests/fast-addons

@vchepkov
Copy link
Author

vchepkov commented Dec 3, 2024

I will try to use your example with our roles and deployment pipeline, I am at loss now.
I have stripped out irsa and aws-auth and it didn't make any difference

@vchepkov
Copy link
Author

vchepkov commented Dec 9, 2024

@bryantbiggs , I wasn't able to exactly reproduce the issue yet, but, two changes makes addons behave unexpectedly.

$ git diff
diff --git a/tests/fast-addons/main.tf b/tests/fast-addons/main.tf
index 12d2a41..cd1242b 100644
--- a/tests/fast-addons/main.tf
+++ b/tests/fast-addons/main.tf
@@ -3,9 +3,10 @@ provider "aws" {
 }
 
 locals {
-  name            = "ex-${basename(path.cwd)}"
+  name            = "vchepkov-${basename(path.cwd)}"
   cluster_version = "1.31"
-  region          = "eu-west-1"
+  region          = "us-east-1"
+  cni_version     = "v1.19.0-eksbuild.1"
 
   tags = {
     Test       = local.name
@@ -43,7 +44,7 @@ module "eks" {
       most_recent = true
     }
     vpc-cni = {
-      most_recent    = true
+      addon_version  = local.cni_version
       before_compute = true
       configuration_values = jsonencode({
         env = {
@@ -61,7 +62,7 @@ module "eks" {
   vpc_id     = module.vpc.vpc_id
   subnet_ids = module.vpc.private_subnets
 
-  eks_managed_node_groups = {
+  self_managed_node_groups = {
     example = {
       instance_types = ["m6i.large"]

As I mentioned before, we do use self_managed_node_groups and we pin addon_version, because we had an outage in the past with an addon upgrade. These two changes make coredns to consistently take 15 minutes to create.

@bryantbiggs
Copy link
Member

sorry, which two changes make the addons behave differently?

@vchepkov
Copy link
Author

vchepkov commented Dec 9, 2024

I posted a diff in the comment?

-      most_recent    = true
+      addon_version  = local.cni_version
...
-  eks_managed_node_groups = {
+  self_managed_node_groups = {

@bryantbiggs
Copy link
Member

Ah ok those two - wasn't sure about the changes to region, etc

@vchepkov
Copy link
Author

Ah sorry, yes, those are not important

Copy link

github-actions bot commented Jan 9, 2025

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants