-
Notifications
You must be signed in to change notification settings - Fork 521
Provisioning of VM extension 'vmssCSE' has timed out #1860
Comments
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
This comment has been minimized.
This comment has been minimized.
@vijaygos are you sure these are logs from an instance with a failure? These logs look like a successful run to me. ( |
@CecileRobertMichon , I can see the VMSS status as "Failed". However, I am not sure how to determine which is the "failed instance". Is there any way to find the failed instance from the portal? |
@vijaygos I don't think the portal shows which instance failed unfortunately. We have an issue open to improve extension logs to print the hostname but in the meantime there's no easy way to get the instance ID that I know of. See #1496. If you are able to repro the issue with scaling a few instances at a time that might be the easiest way to know which instance to get the logs from. |
Ah! Thanks @CecileRobertMichon for pointing me to that. This is rather strange. I have looked at the status for all the 31 VMs in our cluster and they all show "ProvisioningState/succeeded". However, the VMSS Overview page shows a Failed status. |
We are experiencing the exact same thing. Any updates on what could be wrong? |
Also, Experiencing the same issue after updating ServicePrincipal credentials |
Also getting this issue today though no changes from cluster perspective. We were getting a rate limit issue earlier. |
Is there a common pattern in terms of cluster operations? Are these new cluster buildouts? Or a result of scale events on existing clusters? |
Answering for @sylus - it's an existing cluster. We haven't made changes to the scale set ourselves - the last scale operation was last week. We found the issue when we had teams reporting that pods with disks weren't coming up. It now seems that it's unable to mount the disks because it can't unmount them from the old nodes. We seemed to hit a subscription write limit earlier today - though I'm not sure if that's related to this issue or not (if it was retrying too often). |
We experience the same issue. Existing cluster, no changes regarding scale sets, but pods can't mount data disks and keep hanging in 'Init'. |
Same here; it worked fine last evening and today no VM is able to attach any disks, displaying the |
We filed an Azure support ticket and supposed to get a call back this morning will post back any info. |
I was able to restart our VMSS nodes gracefully by running the following az cli command. Previously restarting the vmss node using the GUI also resulted in Command: And the Cluster status resolved, All hung up Disks freed themselves and reattach without an issue. |
Unfortunately, in my case, the outcome of the above wasn't any different than the GUI triggered process:
|
We have the same issue except it affects both vmssCSE and RunCommandLinux extensions. We didn't enable or disable any extensions. I tried to do the update command but it never completes successfully failing with this:
I tried to reinstall the extension with force and it also failed. This is a critical bug and is affecting our production. How has this not been addressed? We don't need this extension. Nobody needs to run linux commands via a cli extension and this should not be there by default. It's a bad design period to force this on people. |
How many folks on this thread can correlate these failures with active Azure API rate limiting events? @rsingh612, were you able to determine operational differences between your original failure events, and your successful (if manual) |
@jackfrancis we saw this:
I assumed what happened was a vm extension autoupdated leading to the vmss enterring a failed state. While in a failed state disk attach detach operations fail but will keep retrying. My hypothesis was maybe because of the failed state the disks retried triggering the api limit, but maybe that is an incorrect hypothesis. |
We are now getting this on a completely different cluster that has been up for months after resolving issue in another cluster in separate subscription. Are we 100% there was no roll out change? It seems suspicious with all the people here having same problem and over in: Obviously something has had to change, is Microsoft looking into this? It almost is if all the extension status for the VMSS have been lost all of a sudden. Failed to upgrade virtual machine instance 'k8s-linuxpool1-13820493-vmss_1'. Error: Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained. On the instance itself however it shows this: vmssCSE (Microsoft.Azure.Extensions.CustomScript, 2.0.7)
Provisioning succeeded
Info
ProvisioningState/succeeded Can't even enable boot diagnostics, going to try to slowly replace each instance. |
We were able to add a node and that process seems to have fixed the scaleset at least for now. ^_^ |
We are actively working on a resolution for this issue. |
Adding a node didn't work for us sadly. @devigned Any idea on when you might expect a resolution to this and whether or not it will involve upgrading or migrating to a new aks / nodepool? |
It seems like most of these issue are related to Azure API request limits being exceeded. It would be extremely helpful if everyone experiencing this issue could ping back with the following:
Thank you all. We will do our best to root cause this and get a fix out as soon as possible. |
Number of nodes in your cluster? Number of disks attached to those nodes? What version of K8s? |
@sharkymcdongles give this a run please:
Please report back any interesting rate limiting logs you find. |
@devigned this is aks, I dont have access to the masters. this is the affected vmss in azure:
and aks:
|
@jackfrancis We upgraded to 1.15.5 with aks-engine 0.42.2 and enabled the v2 backoff. But it seems that when it hits a rate limit scenario it just hammers the Azure API and the only way to recover is to turn off controller-manager for a while to clear it For example:
(We also hit the HighCostGetVMScaleSet3Min one too) |
Will the fixes being addressed over at the AKS issue be applicable here? |
Yeah as @zachomedia said it appears worse and we barely get 15-20 mins of proper Kubernetes operation before rate limited again. @devigned @jackfrancis can we arbitratrily increase our limit? Our clients are getting pretty insistent lol at the state of things this week and is putting a lot of pressure on us. Really worried will result in us having to move workloads somewhere else maybe not with VMSS but love the rolling upgrades so don't particularly want. Have been stalled for the past 2-3 days. We can maybe increase our support ticket to priority. Note: Do seem to be able to find the disk potentially causing problem but lot of moving parts so trying to further isolate. |
/cc @aramase for insight into the Cloud Provider for Azure which is what's issuing the calls to ARM. |
For folks who are having issues w/ VMSS CSE timeouts (not necessarily related to throttling), there has been an identified CSE bug being triaged. This bug correlates with folks experiencing this issue last week. (This CSE bug has nothing to do w/ AKS Engine's CSE script(s).) If you have one or more VMSS instances in this state, please try manually re-imaging the instance. We've seen that workaround help restore nodes for folks. And please report back if that unblocks you. And apologies. :( |
Ok, so I managed to get our cluster back into a good state for now. Seems that running most operations that cause a disk unmount can trigger the problem again (@sylus can fill in more there) Basically to find out which disk is stuck, I looked into the controller-manager logs and saw a bunch of:
To fix it, I ran Once it was detached, the cluster slowly started to recover. |
Kubernetes Version: v1.15.5 (1 master, 4 nodes)
Re: The comment above. The disk (un)mount does seem to be the problem. I was able to reproduce a base case by just deleting a few pods and / or running a helm upgrade deployment and then this started to trigger the following errors related to a disk unmount almost right away. I1023 23:49:18.644247 1 pv_controller.go:1270] isVolumeReleased[pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c]: volume is released
I1023 23:49:18.644258 1 pv_controller.go:1270] isVolumeReleased[pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf]: volume is released
E1023 23:49:19.821907 1 goroutinemap.go:150] Operation for "delete-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c[52a16d3e-b5dd-4cc1-a64e-b03f6d61948b]" failed. No retries permitted until 2019-10-23 23:51:21.821868366 +0000 UTC m=+12222.629161185 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""
E1023 23:49:19.826714 1 goroutinemap.go:150] Operation for "delete-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf[4c641582-1171-4c98-8189-29185623fc1c]" failed. No retries permitted until 2019-10-23 23:51:21.826677075 +0000 UTC m=+12222.633969894 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\"" The suspicion is as we have multiple teams (re)-deploying their apps etc that enough of these disk failure issues eventually make us hit the rate limits set by Azure. Therefore other operations against the VMSS don't succeed once it gets into this state. Then as mentioned above we need to stop the controller-manager pod for a while to clear the rate limit as illustrated below. The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min) In the controller-manager it will list the PVC and instance ID of the disk that can't be unattached and we can use the PVC number to find the LUN using the command below. az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' We then have to for ALL disks listed in controller manager that have this problem need to run the following. az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id $ID --lun $LUN The cluster is then now back in a working state until the next deployment which will trigger the PVC issue and then rinse / repeat ^_^ Related Issuesa) All this is explained in further detail over at AKS issue by Microsoft: Azure/AKS#1278 (comment) b) Additionally this looks directly related as well although we have a order of magnitude smaller cluster: kubernetes-sigs/cloud-provider-azure#247 |
@sylus @zachomedia in your failure scenarios, are you ever encountering this error:
And if so, are you observing that the disk it's complaining about seems, in fact, to be totally unattached? We encountered this with another customer, and were able to follow your guidance to manually detach the offending disk (even though it wasn't attached to anything! — we detached it from the vmss instance id that it was trying to attach itself to 🤷♂ ). In any event, FYI for folks continuing to struggle with this. |
@jackfrancis Yeah, we've seen that error in our logs too. I don't think we've ever checked if it was actually attached or not, usually we just detach it through the cli. We did have a weird state today where apparently one of our instances had a disk attached that no longer existed so all other disk operations failed. Once we removed that attachment, it recovered. |
@zachomedia How did you get the lun number in the case where the disk is not actually attached to any VMSS instances? In our troubleshooting the following command didn't yield the lun in such a scenario:
|
@jackfrancis Oh, I see, for all of our cases the disk was in the list. So I guess that means it was attached. |
@sylus @zachomedia do you think this is an appropriate repro?:
(note the ratio of nodes-replicas is 1:1)
I wonder if the above will induce the weird zombie detach state we're seeing. |
I would say that you should have a pretty reasonable chance at reproducing the issue with that setup. Our cluster is much smaller (about 5 nodes) and usually just a couple of pods with disks being deleted can trigger it. |
@zachomedia Node count is static? I.e., disk re-attachment operations aren't happening as a result of underlying VMSS instances disappearing, re-appearing, etc? |
@jackfrancis That's correct, node count is static. |
(So far unable to repro, but will keep trying.) It's also possible that disk detach/attach operations during throttle events are the edge case causing this behavior (my test cluster is not being actively throttled atm). |
@jackfrancis So something you can try.. Seems most of our problems now stem from PVCs being deleted (one of our teams deletes their deployments and re-recreates it right now). We seem to get two things:
|
@zachomedia You need to drain |
Problems again today with 5 of our 6 VM's in a failed state and can't even reimage as get this error. I also did launch a few AKS clusters over the weekend and as soon as turned off over night, they all are in failed state with disk issues. Really hoping a fix is forthcoming, this is plainly reproducible. Failed to reimage virtual machine instances k8s-linuxpool1-12345678-vmss_12, k8s-linuxpool1-12345678-vmss_10, k8s-linuxpool1-12345678-vmss_9, k8s-linuxpool1-12345678-vmss_11. Error: The processing of VM 'k8s-linuxpool1-12345678-vmss_10' is halted because of one or more disk processing errors encountered by VM 'k8s-linuxpool1-12345678-vmss_12' in the same Availability Set. Please resolve the error with VM 'k8s-linuxpool1-12345678-vmss_12' before retrying the operation. |
@sylus is the StatefulSet spec here not a viable repro input for inducing this symptom on a test cluster? As the issue describes I was able to witness some badness (described in the issue w/ the working remediation steps I came up with at the time), but I haven't been able to reliably repeat all the badness so many folks are seeing now. Would love to get that repro process so that we can more effectively help drive fixes. Thanks for hanging in there. :/ |
A VMSS bug in the update on Oct 17 was identified and remediated globally over Oct 28 and 29. Disks should no longer be stuck in the 'detaching' state in VMSS, and so any Kubernetes operations should now be able to proceed without running into this issue. If you observe any new instances of this same problem please reopen this bug and I'll work to determine the cause. |
We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday," |
I'm facing the same error while nothing being changed to the cluster,really need some advice with this. |
Advice is switch to a better cloud provider. Our entire company had to
switch because of this issue 3 years ago. Surprised they still have these
problems.
…On Thu, 3 Nov 2022, 02:01 Lu1234, ***@***.***> wrote:
We're getting error with version 1.22.6: "MessageProvisioning of VM
extension vmssCSE has timed out. Extension provisioning has taken too long
to complete. The extension last reported "Plugin enabled". More information
on troubleshooting is available at
https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"
I'm facing the same error while nothing being changed to the
cluster,really need some advice with this.
—
Reply to this email directly, view it on GitHub
<#1860 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG6KEUKCCUAEGWR4SJ7TQXLWGMFFTANCNFSM4IPTUGQA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi all, I suggest opening a new issue in https://github.com/Azure/AKS/issues with details of the problem/error you are facing. I want to make sure you're getting the help you need. This is a closed issue from 3 years ago in a deprecated project (https://github.com/Azure/aks-engine#project-status) so commenting on here likely won't get the right people to look into it. |
What happened:
VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."
As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:
What you expected to happen:
No CSE errors and the service should bind to a given public IP resource with no errors.
How to reproduce it (as minimally and precisely as possible):
No real steps. Happens at random when the cluster attempts to scale and add a new VM
Anything else we need to know?:
AKS engine version is 0.28.1
Environment:
kubectl version
): 1.12.2While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.
The text was updated successfully, but these errors were encountered: