Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Provisioning of VM extension 'vmssCSE' has timed out #1860

Closed
vijaygos opened this issue Aug 26, 2019 · 82 comments
Closed

Provisioning of VM extension 'vmssCSE' has timed out #1860

vijaygos opened this issue Aug 26, 2019 · 82 comments
Labels
bug Something isn't working

Comments

@vijaygos
Copy link

What happened:
VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."

As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:

NAME...................TYPE..................... CLUSTER-IP.......EXTERNAL-IP......PORT(S) ........................................AGE

my-service-1---ClusterIP -------10.0.52.61-----<none> ---------8125/UDP -------------------57m

my-service-2---LoadBalancer---10.0.165.41 ---<pending>----80:31330/TCP,443:30354/TCP ---46m

What you expected to happen:
No CSE errors and the service should bind to a given public IP resource with no errors.
How to reproduce it (as minimally and precisely as possible):
No real steps. Happens at random when the cluster attempts to scale and add a new VM

Anything else we need to know?:
AKS engine version is 0.28.1
Environment:

  • Kubernetes version (use kubectl version): 1.12.2
  • Size of cluster (how many worker nodes are in the cluster?) 31
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
  • Others: Azure US Government environment

While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.

@vijaygos vijaygos added the bug Something isn't working label Aug 26, 2019
@welcome
Copy link

welcome bot commented Aug 26, 2019

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

@vijaygos

This comment has been minimized.

@CecileRobertMichon
Copy link
Contributor

@vijaygos are you sure these are logs from an instance with a failure? These logs look like a successful run to me. (Custom script finished successfully)

@vijaygos
Copy link
Author

@CecileRobertMichon , I can see the VMSS status as "Failed". However, I am not sure how to determine which is the "failed instance". Is there any way to find the failed instance from the portal?

@CecileRobertMichon
Copy link
Contributor

@vijaygos I don't think the portal shows which instance failed unfortunately. We have an issue open to improve extension logs to print the hostname but in the meantime there's no easy way to get the instance ID that I know of. See #1496. If you are able to repro the issue with scaling a few instances at a time that might be the easiest way to know which instance to get the logs from.

@CecileRobertMichon
Copy link
Contributor

Apparently you can also get to the extension status in the portal from the instance status:

MicrosoftTeams-image

@vijaygos
Copy link
Author

Ah! Thanks @CecileRobertMichon for pointing me to that. This is rather strange. I have looked at the status for all the 31 VMs in our cluster and they all show "ProvisioningState/succeeded". However, the VMSS Overview page shows a Failed status.
image

@chpspiir
Copy link

We are experiencing the exact same thing. Any updates on what could be wrong?

@rsingh612
Copy link

Also, Experiencing the same issue after updating ServicePrincipal credentials
'Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.'

@sylus
Copy link
Contributor

sylus commented Oct 16, 2019

Also getting this issue today though no changes from cluster perspective. We were getting a rate limit issue earlier.

@jackfrancis
Copy link
Member

Is there a common pattern in terms of cluster operations? Are these new cluster buildouts? Or a result of scale events on existing clusters?

@zachomedia
Copy link
Contributor

Answering for @sylus - it's an existing cluster. We haven't made changes to the scale set ourselves - the last scale operation was last week. We found the issue when we had teams reporting that pods with disks weren't coming up.

It now seems that it's unable to mount the disks because it can't unmount them from the old nodes.

We seemed to hit a subscription write limit earlier today - though I'm not sure if that's related to this issue or not (if it was retrying too often).

@MoergJ
Copy link

MoergJ commented Oct 17, 2019

We experience the same issue. Existing cluster, no changes regarding scale sets, but pods can't mount data disks and keep hanging in 'Init'.

@lorenzo-biava
Copy link
Member

Same here; it worked fine last evening and today no VM is able to attach any disks, displaying the Provisioning of VM extension 'vmssCSE' has timed out. failure state in the VMSS (the AKS cluster is in westeurope).

@sylus
Copy link
Contributor

sylus commented Oct 17, 2019

We filed an Azure support ticket and supposed to get a call back this morning will post back any info.

@rsingh612
Copy link

I was able to restart our VMSS nodes gracefully by running the following az cli command. Previously restarting the vmss node using the GUI also resulted in Provisioning of VM extension 'vmssCSE' has timed out.

Command:
az vmss update-instances -g {RESOUCE_GROUP} --name {VMSS_AGENTPOOL} --instance-id {INSTANCE_ID}

And the Cluster status resolved, All hung up Disks freed themselves and reattach without an issue.

@jarobar435
Copy link

Unfortunately, in my case, the outcome of the above wasn't any different than the GUI triggered process:

$ az vmss update-instances -g {group} --name {vmss} --instance-id {id}
Deployment failed. Correlation ID: {id}. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

@sharkymcdongles
Copy link

We have the same issue except it affects both vmssCSE and RunCommandLinux extensions. We didn't enable or disable any extensions. I tried to do the update command but it never completes successfully failing with this:

cli.azure.cli.core.util : Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

I tried to reinstall the extension with force and it also failed. This is a critical bug and is affecting our production. How has this not been addressed? We don't need this extension. Nobody needs to run linux commands via a cli extension and this should not be there by default. It's a bad design period to force this on people.

@jackfrancis
Copy link
Member

How many folks on this thread can correlate these failures with active Azure API rate limiting events?

@rsingh612, were you able to determine operational differences between your original failure events, and your successful (if manual) az vmss update-instances remediation?

@sharkymcdongles
Copy link

@jackfrancis we saw this:

failed with failed to get azure instance id for node "aks-e16sv3-79416110-vmss00000w" (compute.VirtualMachineScaleSetVMsClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\"GetVMScaleSetVM30Min\",\"startTime\":\"2019-10-18T08:32:36.6723441+00:00\",\"endTime\":\"2019-10-18T08:47:36.6723441+00:00\",\"allowedRequestCount\":2500,\"measuredRequestCount\":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"})

I assumed what happened was a vm extension autoupdated leading to the vmss enterring a failed state. While in a failed state disk attach detach operations fail but will keep retrying. My hypothesis was maybe because of the failed state the disks retried triggering the api limit, but maybe that is an incorrect hypothesis.

@jackfrancis
Copy link
Member

@feiskyer @khenidak will cloudprovider backoff v2 help recover from these events?

@sylus
Copy link
Contributor

sylus commented Oct 18, 2019

We are now getting this on a completely different cluster that has been up for months after resolving issue in another cluster in separate subscription. Are we 100% there was no roll out change? It seems suspicious with all the people here having same problem and over in:

Azure/AKS#1278

Obviously something has had to change, is Microsoft looking into this?

It almost is if all the extension status for the VMSS have been lost all of a sudden.

Failed to upgrade virtual machine instance 'k8s-linuxpool1-13820493-vmss_1'. Error: Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

On the instance itself however it shows this:

vmssCSE (Microsoft.Azure.Extensions.CustomScript, 2.0.7)
Provisioning succeeded
Info
ProvisioningState/succeeded

Can't even enable boot diagnostics, going to try to slowly replace each instance.

@sylus
Copy link
Contributor

sylus commented Oct 18, 2019

We were able to add a node and that process seems to have fixed the scaleset at least for now. ^_^

@devigned
Copy link
Member

We are actively working on a resolution for this issue.

@sharkymcdongles
Copy link

Adding a node didn't work for us sadly.

@devigned Any idea on when you might expect a resolution to this and whether or not it will involve upgrading or migrating to a new aks / nodepool?

@devigned
Copy link
Member

It seems like most of these issue are related to Azure API request limits being exceeded. It would be extremely helpful if everyone experiencing this issue could ping back with the following:

  • Number of nodes in your cluster?
  • Number of disks attached to those nodes?
  • Is cluster auto scaler enabled?
  • What operation did you run when this issue was triggered?
  • What version of aks-engine?
  • What version of K8s?

Thank you all. We will do our best to root cause this and get a fix out as soon as possible.

@sharkymcdongles
Copy link

Number of nodes in your cluster?
30

Number of disks attached to those nodes?
varies but we have 131 pvcs with 5 or so a nod
Is cluster auto scaler enabled?
no it is disabled
What operation did you run when this issue was triggered?
we didn't do anything. the cluster just stopped mounting pvcs and said there was an api limit. we then checked and saw the vmss was in a failed state due to vm extension RunCommandLinux and vmssCSE
What version of aks-engine?
aksEngineVersion : v0.39.2-aks

What version of K8s?
1.14.6

@devigned
Copy link
Member

devigned commented Oct 18, 2019

@sharkymcdongles give this a run please:

kubectl logs {{ your master named like: kube-controller-manager-k8s-master-XXXXXXX-0 }} \
-n kube-system | grep azure

Please report back any interesting rate limiting logs you find.

@sharkymcdongles
Copy link

@devigned this is aks, I dont have access to the masters.

this is the affected vmss in azure:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourceGroups/MC_prod-eu-mip-aks-rg_euprod_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-e16sv3-79416110-vmss

and aks:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourcegroups/prod-eu-mip-aks-rg/providers/Microsoft.ContainerService/managedClusters/euprod

@zachomedia
Copy link
Contributor

@jackfrancis We upgraded to 1.15.5 with aks-engine 0.42.2 and enabled the v2 backoff. But it seems that when it hits a rate limit scenario it just hammers the Azure API and the only way to recover is to turn off controller-manager for a while to clear it

For example:

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

(We also hit the HighCostGetVMScaleSet3Min one too)

@AustinSmart
Copy link

Will the fixes being addressed over at the AKS issue be applicable here?

@sylus
Copy link
Contributor

sylus commented Oct 23, 2019

Yeah as @zachomedia said it appears worse and we barely get 15-20 mins of proper Kubernetes operation before rate limited again. @devigned @jackfrancis can we arbitratrily increase our limit?

Our clients are getting pretty insistent lol at the state of things this week and is putting a lot of pressure on us. Really worried will result in us having to move workloads somewhere else maybe not with VMSS but love the rolling upgrades so don't particularly want. Have been stalled for the past 2-3 days. We can maybe increase our support ticket to priority.

Note: Do seem to be able to find the disk potentially causing problem but lot of moving parts so trying to further isolate.

@devigned
Copy link
Member

/cc @aramase for insight into the Cloud Provider for Azure which is what's issuing the calls to ARM.

@jackfrancis
Copy link
Member

For folks who are having issues w/ VMSS CSE timeouts (not necessarily related to throttling), there has been an identified CSE bug being triaged. This bug correlates with folks experiencing this issue last week. (This CSE bug has nothing to do w/ AKS Engine's CSE script(s).)

If you have one or more VMSS instances in this state, please try manually re-imaging the instance. We've seen that workaround help restore nodes for folks.

And please report back if that unblocks you.

And apologies. :(

@zachomedia
Copy link
Contributor

Ok, so I managed to get our cluster back into a good state for now. Seems that running most operations that cause a disk unmount can trigger the problem again (@sylus can fill in more there)

Basically to find out which disk is stuck, I looked into the controller-manager logs and saw a bunch of:

I1023 20:02:29.782144       1 attacher.go:89] Attach volume "/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" to instance "k8s-linuxpool1-12345678-vmss00000c" failed with disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)
I1023 20:02:29.782187       1 actual_state_of_world.go:322] Volume "kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" is already added to attachedVolume list to node "k8s-linuxpool1-12345678-vmss_10", update device path ""
E1023 20:02:29.782390       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\"" failed. No retries permitted until 2019-10-23 20:02:30.282260017 +0000 UTC m=+391.358857249 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-$PVC_NAME\" (UniqueName: \"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\") from node \"k8s-linuxpool1-12345678-vmss00000c\" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)"
I1023 20:02:29.782447       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"adp", Name:"appname-7587887f5d-8pndl", UID:"2b777c28-859e-45cd-984f-af2734a436a5", APIVersion:"v1", ResourceVersion:"10305952", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-$PVC_NAME" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be
attached to node(k8s-linuxpool1-12345678-vmss00000c)

To fix it, I ran az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' | less, found the disk and got the lun number. I then ran az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id 12 --lun 3 to detach it.

Once it was detached, the cluster slowly started to recover.

@sylus
Copy link
Contributor

sylus commented Oct 24, 2019

Kubernetes Version: v1.15.5 (1 master, 4 nodes)
AKS Engine: v0.42.2
Kernel Version: 5.0.0-1023-azure
OS Image: Ubuntu 18.04.3 LTS
Date Issue Started: Last week (Oct 15, 2019) on clusters that running w/out issue for 3 months

Note: We don't seem to have the CSE issue anymore but still keeping this here for the moment since all seems related. We did also add as @jackfrancis indicated cloudProviderBackoffMode: v2 and removed the cloudProviderBackoffExponent and cloudProviderBackoffJitter.

Re: The comment above. The disk (un)mount does seem to be the problem. I was able to reproduce a base case by just deleting a few pods and / or running a helm upgrade deployment and then this started to trigger the following errors related to a disk unmount almost right away.

I1023 23:49:18.644247       1 pv_controller.go:1270] isVolumeReleased[pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c]: volume is released
I1023 23:49:18.644258       1 pv_controller.go:1270] isVolumeReleased[pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf]: volume is released
E1023 23:49:19.821907       1 goroutinemap.go:150] Operation for "delete-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c[52a16d3e-b5dd-4cc1-a64e-b03f6d61948b]" failed. No retries permitted until 2019-10-23 23:51:21.821868366 +0000 UTC m=+12222.629161185 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""
E1023 23:49:19.826714       1 goroutinemap.go:150] Operation for "delete-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf[4c641582-1171-4c98-8189-29185623fc1c]" failed. No retries permitted until 2019-10-23 23:51:21.826677075 +0000 UTC m=+12222.633969894 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""

The suspicion is as we have multiple teams (re)-deploying their apps etc that enough of these disk failure issues eventually make us hit the rate limits set by Azure. Therefore other operations against the VMSS don't succeed once it gets into this state. Then as mentioned above we need to stop the controller-manager pod for a while to clear the rate limit as illustrated below.

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

In the controller-manager it will list the PVC and instance ID of the disk that can't be unattached and we can use the PVC number to find the LUN using the command below.

az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' 

We then have to for ALL disks listed in controller manager that have this problem need to run the following.

az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id $ID --lun $LUN

The cluster is then now back in a working state until the next deployment which will trigger the PVC issue and then rinse / repeat ^_^

Related Issues

a) All this is explained in further detail over at AKS issue by Microsoft: Azure/AKS#1278 (comment)

b) Additionally this looks directly related as well although we have a order of magnitude smaller cluster: kubernetes-sigs/cloud-provider-azure#247

@jackfrancis
Copy link
Member

@sylus @zachomedia in your failure scenarios, are you ever encountering this error:

Cannot attach data disk '<disk_id>' to VM '<vmss_id>' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

And if so, are you observing that the disk it's complaining about seems, in fact, to be totally unattached?

We encountered this with another customer, and were able to follow your guidance to manually detach the offending disk (even though it wasn't attached to anything! — we detached it from the vmss instance id that it was trying to attach itself to 🤷‍♂ ).

In any event, FYI for folks continuing to struggle with this.

@zachomedia
Copy link
Contributor

@jackfrancis Yeah, we've seen that error in our logs too. I don't think we've ever checked if it was actually attached or not, usually we just detach it through the cli.

We did have a weird state today where apparently one of our instances had a disk attached that no longer existed so all other disk operations failed. Once we removed that attachment, it recovered.

@jackfrancis
Copy link
Member

@zachomedia How did you get the lun number in the case where the disk is not actually attached to any VMSS instances?

In our troubleshooting the following command didn't yield the lun in such a scenario:

az vmss list-instances --resource-group <rg> --name <vmss_name> --query '[].[name, storageProfile.dataDisks[]]'

@zachomedia
Copy link
Contributor

@jackfrancis Oh, I see, for all of our cases the disk was in the list. So I guess that means it was attached.

@jackfrancis
Copy link
Member

@sylus @zachomedia do you think this is an appropriate repro?:

  1. Install this statefulset on a cluster w/ 50 VMSS nodes:
$ cat statefulset.yaml 
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  podManagementPolicy: "Parallel"
  replicas: 50
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-standard"
      resources:
        requests:
          storage: 1Gi

(note the ratio of nodes-replicas is 1:1)

  1. When pods come online, start deleting them
  2. Statefulset reconciliation will spin up more pods to fulfill replica count

I wonder if the above will induce the weird zombie detach state we're seeing.

@zachomedia
Copy link
Contributor

@jackfrancis

I would say that you should have a pretty reasonable chance at reproducing the issue with that setup. Our cluster is much smaller (about 5 nodes) and usually just a couple of pods with disks being deleted can trigger it.

@jackfrancis
Copy link
Member

@zachomedia Node count is static? I.e., disk re-attachment operations aren't happening as a result of underlying VMSS instances disappearing, re-appearing, etc?

@zachomedia
Copy link
Contributor

@jackfrancis That's correct, node count is static.

@jackfrancis
Copy link
Member

(So far unable to repro, but will keep trying.)

It's also possible that disk detach/attach operations during throttle events are the edge case causing this behavior (my test cluster is not being actively throttled atm).

@zachomedia
Copy link
Contributor

@jackfrancis So something you can try.. Seems most of our problems now stem from PVCs being deleted (one of our teams deletes their deployments and re-recreates it right now). We seem to get two things:

  1. A disk is still attached to a VMSS instance but apparently no longer exists. No disk operations succeed until we detach that disk (the error appears when we attempt to manually detach a different disk that was failing - sorry, I forgot to save the message)
  2. Some deletes seem to never unmount the disk (maybe it passed some timeouts and gave up trying to unmount). See a bunch of:
I1025 15:21:21.911563       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC1]: volume is released
I1025 15:21:21.911914       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC2]: volume is released
I1025 15:21:21.913827       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC3]: volume is released
E1025 15:21:23.028284       1 goroutinemap.go:150] Operation for "delete-pvc-PVC2[5dcfd7e1-e61b-4206-9cac-6e7758e3366c]" failed. No retries permitted until 2019-10-25 15:23:25.028194036 +0000 UTC m=+154545.835486755 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC2 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.029986       1 goroutinemap.go:150] Operation for "delete-pvc-PVC3[3507cbda-868e-4198-b771-620234d258b5]" failed. No retries permitted until 2019-10-25 15:23:25.029908242 +0000 UTC m=+154545.837201061 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC3 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.069595       1 goroutinemap.go:150] Operation for "delete-pvc-PVC1[4d9dbfa3-1dba-4478-9eae-f475d033257a]" failed. No retries permitted until 2019-10-25 15:23:25.069557061 +0000 UTC m=+154545.876849880 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC1 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""

@andyzhangx
Copy link
Contributor

@zachomedia You need to drain k8s-linuxpool1-12345678-vmss_12 and then reimage k8s-linuxpool1-12345678-vmss_12 to clean that vmss instance state as a workaround.

@sylus
Copy link
Contributor

sylus commented Oct 29, 2019

Problems again today with 5 of our 6 VM's in a failed state and can't even reimage as get this error. I also did launch a few AKS clusters over the weekend and as soon as turned off over night, they all are in failed state with disk issues. Really hoping a fix is forthcoming, this is plainly reproducible.

Failed to reimage virtual machine instances k8s-linuxpool1-12345678-vmss_12, k8s-linuxpool1-12345678-vmss_10, k8s-linuxpool1-12345678-vmss_9, k8s-linuxpool1-12345678-vmss_11. Error: The processing of VM 'k8s-linuxpool1-12345678-vmss_10' is halted because of one or more disk processing errors encountered by VM 'k8s-linuxpool1-12345678-vmss_12' in the same Availability Set. Please resolve the error with VM 'k8s-linuxpool1-12345678-vmss_12' before retrying the operation.

@jackfrancis
Copy link
Member

@sylus is the StatefulSet spec here not a viable repro input for inducing this symptom on a test cluster?

#2221

As the issue describes I was able to witness some badness (described in the issue w/ the working remediation steps I came up with at the time), but I haven't been able to reliably repeat all the badness so many folks are seeing now. Would love to get that repro process so that we can more effectively help drive fixes.

Thanks for hanging in there. :/

@craiglpeters
Copy link
Contributor

A VMSS bug in the update on Oct 17 was identified and remediated globally over Oct 28 and 29. Disks should no longer be stuck in the 'detaching' state in VMSS, and so any Kubernetes operations should now be able to proceed without running into this issue. If you observe any new instances of this same problem please reopen this bug and I'll work to determine the cause.

@mbarry-msdn
Copy link

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

@Lu1234
Copy link

Lu1234 commented Nov 3, 2022

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

I'm facing the same error while nothing being changed to the cluster,really need some advice with this.

@sharkymcdongles
Copy link

sharkymcdongles commented Nov 3, 2022 via email

@CecileRobertMichon
Copy link
Contributor

Hi all, I suggest opening a new issue in https://github.com/Azure/AKS/issues with details of the problem/error you are facing.

I want to make sure you're getting the help you need. This is a closed issue from 3 years ago in a deprecated project (https://github.com/Azure/aks-engine#project-status) so commenting on here likely won't get the right people to look into it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests