feat: Added configurable delay before addons creation #3214

vchepkov · 2024-11-25T23:37:23Z

Description

Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.

Motivation and Context

when implementing vpc-cni custom networking, addon fails to create in the configured dataplane_wait_duration time interval, causing pods using host network. I opened a case with AWS support and they informed me that there is a 3 minutes delay between cluster marked as ACTIVE and addon VPI to be ready, without delay vpc-cni creation (or any add-on, but vpc-cni is critical) can take up to 20 minutes.

I left default to 0s, for those who do not use custom networking, but in my test suite I set it to 3 minutes and each time addon was created under 20 seconds

How Has This Been Tested?

I have updated at least one of the examples/* to demonstrate and validate my change(s)
I have tested and validated these changes using one or more of the provided examples/* projects

I have executed pre-commit run -a on my pull request

bryantbiggs · 2024-11-26T00:05:32Z

I'm not following - this is at cluster creation or after the cluster has been created that you are facing the issue?

For cluster creation, simply set: before_compute = true in the VPC CNI addon with the relevant configuration values for custom networking

For existing clusters, you will need to set the appropriate values on the VPC CNI and then cycle the nodes

Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.

vchepkov · 2024-11-26T00:09:00Z

At the cluster creation.
Several months ago some changes were introduced and addons take up to 20 minutes to create.
I was chasing this problem ever since.

vchepkov · 2024-11-26T00:16:55Z

I attached PR to my support case, hopefully AWS engineers will chime in

bryantbiggs · 2024-11-26T01:18:25Z

At the cluster creation.

This is how you would do that today

module "eks" {
  source  = "terraform-aws-modules/eks/aws"

  # Truncated for brevity
  ...

  cluster_addons = {
     vpc-cni = {
        before_compute       = true
        configuration_values = jsonencode({
        env = {
           AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
           ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
        })
     }
  }

...
}

Several months ago some changes were introduced and addons take up to 20 minutes to create.
I was chasing this problem ever since.

Can you share a reproduction?

vchepkov · 2024-11-26T01:28:20Z

Our code is slightly more complicated, but it's essentially the same

 vpc-cni = (var.enable_custom_networking ? {
      before_compute              = "true"
      addon_version               = "v${var.cni_version}"
      resolve_conflicts_on_create = "NONE"
      resolve_conflicts_on_update = "OVERWRITE"
      service_account_role_arn    = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        env = {
          AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
          ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
        }
        enableNetworkPolicy = tostring(var.enable_network_policy)
        enableWindowsIpam   = tostring(var.enable_windows)
      })
      } : {
      addon_version               = "v${var.cni_version}"
      resolve_conflicts_on_create = "NONE"
      resolve_conflicts_on_update = "OVERWRITE"
      service_account_role_arn    = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        enableNetworkPolicy = tostring(var.enable_network_policy)
        enableWindowsIpam   = tostring(var.enable_windows)
      })
    })

I can't reproduce it locally, but in gitlab environment, where many clusters are created and deleted to test the code it happens half of the time. Adding the delay was recommended by AWS Support

antonbabenko · 2024-11-26T19:32:23Z

This issue has been resolved in version 20.30.0 🎉

bryantbiggs · 2024-11-26T19:37:13Z

@vchepkov would you mind updating to the latest, v20.30.0 that was just released, and reporting back on if that resolves the issues you are seeing?

bryantbiggs · 2024-11-26T19:38:52Z

also, do check out the test case I added which shows configurations that should be used to help improve this situation. In the next breaking change of the module, these will be baked in as the defaults, but for now users can set them to reach the desired outcome:

terraform-aws-eks/tests/fast-addons/main.tf

Lines 30 to 31 in b468ead

    
           # Disable the default self-managed addons to avoid the penalty of adopting them later 
        
           bootstrap_self_managed_addons = false

terraform-aws-eks/tests/fast-addons/main.tf

Lines 50 to 55 in b468ead

    
           # Use subnet tags to avoid the need to inject the ENIConfig 
        
           # which requires a live API server endpoint which leads to a dependency of: 
        
           # Control plane -> API request to create ENIConfig -> VPC CNI addon -> nodes/compute 
        
           # With the subnet discovery feature, we can avoid this dependency: 
        
           # Control plane -> VPC CNI addon -> nodes/compute 
        
           ENABLE_SUBNET_DISCOVERY = "true"

terraform-aws-eks/tests/fast-addons/main.tf

Lines 139 to 140 in b468ead

# Tag for subnet discovery

"kubernetes.io/role/cni" = 1

vchepkov · 2024-11-26T19:46:19Z

Definitely, will test. But I am a bit skeptical, because in my tests it matches what support folks said, you need a 3 minute delay and timing in your MRs is way different. Maybe combination with disabling bootstrap addons? I will report back shortly

bryantbiggs · 2024-11-26T19:47:43Z

I have the internal ticket between support and the service team 😉 and am quite optimistic that this will help the situation

vchepkov · 2024-11-26T19:50:07Z

oh, I don't need ENIConfig anymore? that is nice and 🤞

vchepkov · 2024-11-29T13:28:35Z

@bryantbiggs , I tried new v20.30.1 on Wednesday and it didn't work, but I haven't implemented subnet discovery yet
So today I added it and result is the same - none of the addons are being created - so they will timeout in 20 minutes :

module.eks.module.eks-cluster.aws_eks_addon.this["coredns"]: Still creating... [3m20s elapsed]
module.eks.module.eks-cluster.aws_eks_addon.before_compute["vpc-cni"]: Still creating... [4m30s elapsed]
module.eks.module.eks-cluster.aws_eks_addon.this["kube-proxy"]: Still creating... [3m30s elapsed]

From what I can tell, the only difference in our code - we use self_managed_node_groups. Is it possible that disabling bootstrap_self_managed_addons require eks managed nodes?

bryantbiggs · 2024-11-29T13:39:57Z

Can you show more of the apply sequence of events

vchepkov · 2024-11-29T13:59:23Z

Removed gitlab's coloring

clusters-output-cleaned.txt

vchepkov · 2024-11-29T14:07:46Z

I added a managed group, just to rule it out, just copied example as is and it fails to create too

module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m20s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m30s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m40s elapsed]
module.eks.module.eks-cluster.module.eks_managed_node_group["example"].aws_eks_node_group.this[0]: Still creating... [20m50s elapsed]

bryantbiggs · 2024-11-29T17:40:48Z

thank you for sharing, still digging through it but wanted to comment and better understand a few things:

I don't believe it would have any influence on the issue you are experiencing, but why the use of self-managed node groups instead of EKS managed node groups?
We do support EKS Pod Identity on EKS addons today, and we have a module for this to make that a bit easier https://github.com/terraform-aws-modules/terraform-aws-eks-pod-identity instead of using IRSA. (Note: VPC CNI IRSA doesn't make a lot of sense unless you are removing the permissions needed by VPC CNI after the node group has been created and nodes join the cluster. You can create clusters without the VPC CNI IRSA role - they inherit the permissions of the node - but you can't create clusters without the permissions on the node IAM role when using VPC CNI w/ IRSA)
Any reason for continuing to use aws-auth ConfigMap instead of switching to cluster access entries? This helps avoid chaining providers due to Terraform needing to authenticate and go into the cluster to create/update the configmap. Right now it looks like you are adding the same roles in both the ConfigMap and as access entries
Are you defining your node groups within the EKS module itself, like this - or are you defining them as standalone, external modules, like this?
Did you get a chance to look at the logs for the VPC CNI and CoreDNS on why they aren't reaching a healthy state?

vchepkov · 2024-11-29T17:51:05Z

It's my pleasure, thank you for looking into it. I can upload all my code to support ticket if you like, if that helps

There are some security tools that we need to install on host (security wants them)
we have been using that irsa module for quite awhile, I think it was just following principle of least privileges. If that's causing issue we can refactor it
some other apps we use still rely on classic config-map, so we enable it. Technically, nothing relies on it in this module and I can move it elsewhere, but I don't think it's causing an issue.
we using the same module and set self_managed_node_group_defaults and self_managed_node_groups

bryantbiggs · 2024-12-03T20:05:04Z

hi @vchepkov - can you please share a minimum reproduction of this issue. Here is my minimum reproduction that demonstrates the functionality is working as intended https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/tests/fast-addons

vchepkov · 2024-12-03T20:12:50Z

I will try to use your example with our roles and deployment pipeline, I am at loss now.
I have stripped out irsa and aws-auth and it didn't make any difference

vchepkov · 2024-12-09T21:55:52Z

@bryantbiggs , I wasn't able to exactly reproduce the issue yet, but, two changes makes addons behave unexpectedly.

$ git diff
diff --git a/tests/fast-addons/main.tf b/tests/fast-addons/main.tf
index 12d2a41..cd1242b 100644
--- a/tests/fast-addons/main.tf
+++ b/tests/fast-addons/main.tf
@@ -3,9 +3,10 @@ provider "aws" {
 }
 
 locals {
-  name            = "ex-${basename(path.cwd)}"
+  name            = "vchepkov-${basename(path.cwd)}"
   cluster_version = "1.31"
-  region          = "eu-west-1"
+  region          = "us-east-1"
+  cni_version     = "v1.19.0-eksbuild.1"
 
   tags = {
     Test       = local.name
@@ -43,7 +44,7 @@ module "eks" {
       most_recent = true
     }
     vpc-cni = {
-      most_recent    = true
+      addon_version  = local.cni_version
       before_compute = true
       configuration_values = jsonencode({
         env = {
@@ -61,7 +62,7 @@ module "eks" {
   vpc_id     = module.vpc.vpc_id
   subnet_ids = module.vpc.private_subnets
 
-  eks_managed_node_groups = {
+  self_managed_node_groups = {
     example = {
       instance_types = ["m6i.large"]

As I mentioned before, we do use self_managed_node_groups and we pin addon_version, because we had an outage in the past with an addon upgrade. These two changes make coredns to consistently take 15 minutes to create.

bryantbiggs · 2024-12-09T22:05:01Z

sorry, which two changes make the addons behave differently?

vchepkov · 2024-12-09T23:50:03Z

I posted a diff in the comment?

-      most_recent    = true
+      addon_version  = local.cni_version
...
-  eks_managed_node_groups = {
+  self_managed_node_groups = {

bryantbiggs · 2024-12-10T00:01:13Z

Ah ok those two - wasn't sure about the changes to region, etc

vchepkov · 2024-12-10T00:03:22Z

Ah sorry, yes, those are not important

github-actions · 2025-01-09T02:28:46Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vchepkov force-pushed the addon_delay_duration branch from 5373c51 to 3832b40 Compare November 26, 2024 00:00

vchepkov changed the title ~~Add configurable delay before addons creation~~ featdd configurable delay before addons creation Nov 26, 2024

vchepkov changed the title ~~featdd configurable delay before addons creation~~ feat: add configurable delay before addons creation Nov 26, 2024

vchepkov force-pushed the addon_delay_duration branch from 3832b40 to c7bc123 Compare November 26, 2024 00:05

feat: Added configurable delay before addons creation

7d35a22

Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.

vchepkov force-pushed the addon_delay_duration branch from c7bc123 to 7d35a22 Compare November 26, 2024 00:06

vchepkov changed the title ~~feat: add configurable delay before addons creation~~ feat: Added configurable delay before addons creation Nov 26, 2024

bryantbiggs mentioned this pull request Nov 26, 2024

feat: Improve addon dependency chain and decrease time to provision addons (due to retries) #3218

Merged

3 tasks

bryantbiggs closed this in #3218 Nov 26, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added configurable delay before addons creation #3214

feat: Added configurable delay before addons creation #3214

vchepkov commented Nov 25, 2024 •

edited

Loading

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024 •

edited

Loading

vchepkov commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

antonbabenko commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

vchepkov commented Nov 29, 2024 •

edited

Loading

bryantbiggs commented Nov 29, 2024

vchepkov commented Nov 29, 2024

vchepkov commented Nov 29, 2024

bryantbiggs commented Nov 29, 2024

vchepkov commented Nov 29, 2024 •

edited

Loading

bryantbiggs commented Dec 3, 2024

vchepkov commented Dec 3, 2024

vchepkov commented Dec 9, 2024

bryantbiggs commented Dec 9, 2024

vchepkov commented Dec 9, 2024

bryantbiggs commented Dec 10, 2024

vchepkov commented Dec 10, 2024

github-actions bot commented Jan 9, 2025

feat: Added configurable delay before addons creation #3214

feat: Added configurable delay before addons creation #3214

Conversation

vchepkov commented Nov 25, 2024 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024 • edited Loading

vchepkov commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

antonbabenko commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

bryantbiggs commented Nov 26, 2024

vchepkov commented Nov 26, 2024

vchepkov commented Nov 29, 2024 • edited Loading

bryantbiggs commented Nov 29, 2024

vchepkov commented Nov 29, 2024

vchepkov commented Nov 29, 2024

bryantbiggs commented Nov 29, 2024

vchepkov commented Nov 29, 2024 • edited Loading

bryantbiggs commented Dec 3, 2024

vchepkov commented Dec 3, 2024

vchepkov commented Dec 9, 2024

bryantbiggs commented Dec 9, 2024

vchepkov commented Dec 9, 2024

bryantbiggs commented Dec 10, 2024

vchepkov commented Dec 10, 2024

github-actions bot commented Jan 9, 2025

vchepkov commented Nov 25, 2024 •

edited

Loading

vchepkov commented Nov 26, 2024 •

edited

Loading

vchepkov commented Nov 29, 2024 •

edited

Loading

vchepkov commented Nov 29, 2024 •

edited

Loading