-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added configurable delay before addons creation #3214
feat: Added configurable delay before addons creation #3214
Conversation
5373c51
to
3832b40
Compare
I'm not following - this is at cluster creation or after the cluster has been created that you are facing the issue? For cluster creation, simply set: For existing clusters, you will need to set the appropriate values on the VPC CNI and then cycle the nodes |
3832b40
to
c7bc123
Compare
Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.
c7bc123
to
7d35a22
Compare
At the cluster creation. |
I attached PR to my support case, hopefully AWS engineers will chime in |
This is how you would do that today module "eks" {
source = "terraform-aws-modules/eks/aws"
# Truncated for brevity
...
cluster_addons = {
vpc-cni = {
before_compute = true
configuration_values = jsonencode({
env = {
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
ENI_CONFIG_LABEL_DEF = "topology.kubernetes.io/zone"
})
}
}
...
}
Can you share a reproduction? |
Our code is slightly more complicated, but it's essentially the same
I can't reproduce it locally, but in gitlab environment, where many clusters are created and deleted to test the code it happens half of the time. Adding the delay was recommended by AWS Support |
This issue has been resolved in version 20.30.0 🎉 |
@vchepkov would you mind updating to the latest, |
also, do check out the test case I added which shows configurations that should be used to help improve this situation. In the next breaking change of the module, these will be baked in as the defaults, but for now users can set them to reach the desired outcome:
|
Definitely, will test. But I am a bit skeptical, because in my tests it matches what support folks said, you need a 3 minute delay and timing in your MRs is way different. Maybe combination with disabling bootstrap addons? I will report back shortly |
I have the internal ticket between support and the service team 😉 and am quite optimistic that this will help the situation |
oh, I don't need ENIConfig anymore? that is nice and 🤞 |
@bryantbiggs , I tried new v20.30.1 on Wednesday and it didn't work, but I haven't implemented subnet discovery yet
From what I can tell, the only difference in our code - we use self_managed_node_groups. Is it possible that disabling bootstrap_self_managed_addons require eks managed nodes? |
Can you show more of the apply sequence of events |
Removed gitlab's coloring |
I added a managed group, just to rule it out, just copied example as is and it fails to create too
|
thank you for sharing, still digging through it but wanted to comment and better understand a few things:
|
It's my pleasure, thank you for looking into it. I can upload all my code to support ticket if you like, if that helps
|
hi @vchepkov - can you please share a minimum reproduction of this issue. Here is my minimum reproduction that demonstrates the functionality is working as intended https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/tests/fast-addons |
I will try to use your example with our roles and deployment pipeline, I am at loss now. |
@bryantbiggs , I wasn't able to exactly reproduce the issue yet, but, two changes makes addons behave unexpectedly.
As I mentioned before, we do use self_managed_node_groups and we pin addon_version, because we had an outage in the past with an addon upgrade. These two changes make coredns to consistently take 15 minutes to create. |
sorry, which two changes make the addons behave differently? |
I posted a diff in the comment?
|
Ah ok those two - wasn't sure about the changes to region, etc |
Ah sorry, yes, those are not important |
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Description
Currently, even though the cluster status is active, the addon API is not completely ready yet. This causes retries and significant delays during addon creation, potentially negating the before_compute delay.
Motivation and Context
when implementing vpc-cni custom networking, addon fails to create in the configured dataplane_wait_duration time interval, causing pods using host network. I opened a case with AWS support and they informed me that there is a 3 minutes delay between cluster marked as ACTIVE and addon VPI to be ready, without delay vpc-cni creation (or any add-on, but vpc-cni is critical) can take up to 20 minutes.
I left default to 0s, for those who do not use custom networking, but in my test suite I set it to 3 minutes and each time addon was created under 20 seconds
How Has This Been Tested?
examples/*
to demonstrate and validate my change(s)examples/*
projectspre-commit run -a
on my pull request