-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating #1278
Comments
I'm also seeing this in west-europe. I can occasionally get them to connect over time, but it's hit or miss. Also using a VMSS backed AKS cluster |
I have a similar problem. |
I'm using a VMSS cluster, 1.14.6, australia south east. I have this same issue trying to start the redis helm chart. I made a simple yaml and also get the same problem.
|
The same problem appears in europe-north, after pvc were deleted and then reinstatiated |
Same issue: |
Also have this issue. Discussed there: Azure/aks-engine#1860 but was sent back here. Number of nodes in your cluster? Number of disks attached to those nodes? What version of K8s? |
this could be a vmss issue due to rate limit, would you try Update Scale Set VM manually:
|
Even after running the command. its the same issue. |
@jayakishorereddy could you |
@andyzhangx I tried this and it led to an internal execution error and couldn't complete. The cluster has only one VMSS and only 4 nodes show this error and are in failed state. Other nodes cannot mount the disks when moved off the failed nodes because azure says they are still attached to the failed node and throws a multiattach error and times out saying it is still attached to the failed node when it doesn't show this in the portal or anywhere and the node is now deallocated. After deallocating, I am unable to start the node back as it fails with with an internal execution error. |
Here is the
|
Please lemme know ASAP because our cluster is in a bad state because of this. |
I deleted the failed nodes now manually and was able to at least get the vmss update command to run through. Disks that were on the drained and deleted nodes still show as attached to them despite them not existing. I was able to fix a few by running:
|
Have the same issue with Prometheus. But the solution to delete the PVC doesn't help.
With PVC everything is ok. |
To update on the issue I was facing, first I wanted to mention that I noticed that all error messages involved a single volume.
This lead me to understand that there was a single disk culprit that I should try to detach manually. I couldn't find this GUID anywhere (looking in
Between each step, I tried killing the two remaining I suspect this was due to the second Azure-portal upgrade of the VM instance that helped - either that or scaling back up after scaling down (I started doing so hoping to drain the original node, but ended up not need to). One weird thing that happened with respect to the upgrade, is that after the first and second upgrade, Azure portal reported that the sole instance of the VMSS ( I would conclude that a workaround for this issue, for my case, might be (This is all still voodoo):
I'm sure some steps here are extraneous, and I don't know if it'll really work the next time I encounter this problem, but it's worth writing it down in hopes that it will help me or someone else in the future... This, of course, doesn't solve the issue, as it doesn't explain how we got here in the first place. And, truthfully, having to scale down + back up is very uncomfortable. Better than losing the PVCs, but still not good enough. Would be happy to receive any updates regarding this issue (will upgrading to the newer 1.15 preview version of Kubernetes work?). |
@kwikwag sounds super simple and straightforward and 100% production ready.
:|
…On Sun, 20 Oct 2019, 6:04 PM kwikwag ***@***.***> wrote:
To update on the issue I was facing, first I wanted to mention that I
noticed that all error messages involved a single volume.
Cannot attach data disk 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-xxxxxxxx-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.
This lead me to understand that there was a single disk culprit that I
should try to detach manually. I couldn't find this GUID anywhere (looking
in az disk list -g MC_xxx and az vmss show -g MC_xxx -n
aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query storageProfile.dataDisks;
BTW the command from the docs
<https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/virtual-machine-scale-sets/tutorial-use-disks-cli.md#list-attached-disks>,
gave me an empty list, until I explicity queried the instance with
--instance-id). However, two disks belonging to the Zookeeper (identified
by their Azure tags) showed up as Attached in the MC_ resource group.
Since all StatefulSets were scaled down (expect for one, which wasn't
Zookeeper, and whose pods were still stuck on ContainerCreating), I
figured detaching them manually would be safe (and might help). That didn't
do the trick, but it got me one step forward (finally, something finished
successfully!) and set me on the right path. Here's a log of what I did
after, with (approximate) times:
- 15:42 detach disks with Azure CLI
- 16:14 manual VMSS upgrade Azure CLI
- 16:25 restart VMSS via the Azure portal
- 16:28 cluster same-version upgrade
- 16:31 manual VMSS VM instance upgrade Azure portal
- 16:44 scaled up K8S cluster from 1 to 2
Between each step, I tried killing the two remaining StatefulSet-related
pods to allow them to re-attach. Finally, at 16:47, the pods finally came
out of ContainerCreating and I saw Running for this first time in ages...
Scaling up all StatefulSets everything started slowly going back to
normal.
I suspect this was due to the second Azure-portal upgrade of the VM
instance that helped - either that or scaling back up after scaling down (I
started doing so hoping to drain the original node, but ended up not need
to). One weird thing that happened with respect to the upgrade, is that
after the first and second upgrade, Azure portal reported that the sole
instance of the VMSS (Standard_DS3_v2 size) to be running the "Latest
model", but after things started running (possibly only after scaling?)
again "Latest model" showed "No".
I would conclude that a workaround for this issue, for my case, might be
(This is all still voodoo):
1. Scale down all StatefulSets to 0 (kubectl -n namespace scale
--all=true statefulset --replicas=0 for each namespace)
2. Scale down to 1 node (az aks scale -g MC_xxx --name
aks-nodepool1-xxxxxxxx-vmss --node-count 1)
3. Ensure all VMSS disks are detached:
3.1. List attached volumes with az disk list -g MC_xxx --query
"[?diskState=='Attached'].name"
3.2. Cross-reference the LUNs with az vmss show -g MC_xxx -n
aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query
"storageProfile.dataDisks[].{name: name, lun: lun}"
3.3. Detach them with az vmss disk detach -g MC_xxx -n aks-nodepool1
--instance-id 0 --lun x (for each LUN).
4. Update the node again (az vmss update-instances -g MC_xxx --name
aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
5. Perform a forced same-version upgrade (az aks upgrade -g xxx --name
xxx-k8s --kubernetes-version 1.14.6)
6. Update this node again (az vmss update-instances -g MC_xxx --name
aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
7. Scale K8S cluster back up (kubectl -n namespace scale --all=true
statefulset --replicas=x)
I'm sure some steps here are extraneous, and I don't know if it'll really
work the next time I encounter this problem, but it's worth writing it down
in hopes that it will help me or someone else in the future...
This, of course, doesn't solve the issue, as it doesn't explain how we got
here in the first place. And, truthfully, having to scale down + back up is
very uncomfortable. Better than losing the PVCs, but still not good enough.
Would be happy to receive any updates regarding this issue (will upgrading
to the newer 1.15 preview version of Kubernetes work?).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1278?email_source=notifications&email_token=AG6KEUKCS6W3ESB76MDGQF3QPR6PRA5CNFSM4JBXHILKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYNNIY#issuecomment-544265891>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG6KEUMOPBBG6AAEZQVT3YTQPR6PRANCNFSM4JBXHILA>
.
|
Yeah I'm about to put a bunch of aks into prod. Not inspiring any confidence. Support ticket every week with some random failure |
There a few different possible issues here in play. A few of these clusters are being throttled by the VMSS API, see below, which can cause this, you might need update to the latest k8s patch versions. I see some of you have opened tickets already, could you share the numbers? Jorge.Palma [at] microsoft.com Details=[{"code":"TooManyRequests","message":"{"operationGroup":"GetVMScaleSetVM30Min","startTime":"2019-10-18T08:32:36.6723441+00:00","endTime":"2019-10-18T08:47:36.6723441+00:00","allowedRequestCount":2500,"measuredRequestCount":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"}) |
@palma21 119101823000010 |
@CecileRobertMichon have you noticed any useful info from the above
|
detach the PVC disk which has problem is always ok, AKS would try to attach back in a loop when pod is scheduling on the new node. |
After migrating the entirety of production via velero to a 1.15.4 cluster it seems to be okay for now in the new 1.15.4 cluster. No idea how long this will last though. It'd be nice to know if we can expect this to happen again or if a resolution is found. I am tempted to set all vmextensions for my vmss to autoupdate false after this as I suspect that was the root cause. |
Can we get some sort of statement from Azure into what is going on? I think there are enough people this issue is happening to warrant some sort of explanation? I am a bit super concerned about this obviously know you guys are busy ^_^ |
It is pretty clear that scalesets are completely broken. We have had this issue now on every cluster across different subscriptions and even affects VMSS that are windows nodes and not just Linux. I am super worried that this has been about a week now with no clear answer as to what is going on or what changed. Obviously know Azure folks are doing best just a bit concerned :) |
I provided a preliminar explanation above, without knowing your clusters and looking into them I can't make a definitive one. Could you reach out (provide options above) with your cluster details/case numbers so I can check if that is happening for your case as well and provide needed recommendations/fixes. |
I still have both broken clusters provisioned in my subscription if anyone wants to take a look. @palma21 |
I do, please open a ticket and send me the ticket number and I will take a look. |
Closing after all fixes from #1278 (comment) are rolled out and a week pinned to avoid mixing any new questions/issues. |
Doesn't seem to be resolved on my AKS cluster v1.15.5.
Is there any follow-up issue since this one was closed? |
pls file a support ticket for this issue, per the events provided, I don't see there is |
@andyzhangx After waiting an hour or so, this is the output:
The whole cluster currently only has 3 PVCs, while 1 is unmounted (just used by a cronjob), and two should be mounted by a Pod each, but both are stuck. |
Coming from 1.14.6 I upgraded two of my (luckily test) clusters today to 1.15.7 and start having these errors. No changes were made to the deployed manifests.
|
@stromvirvel could you run following command, looks like one of your vmss instance is in limbo state:
|
@skinny looks like there is disk detach issue when upgrade from 1.14.6 to 1.15.7, could you find which vm is |
Surprisingly the disk shows no attachment to a VM in the Azure portal. I did try to run the I have tried it several times by deleting the pod, scaling the Statefulset and waiting for periods between a few minutes and over an hour but still the same result |
@skinny then could you run following command to update
|
Ok, ran it for all four nodes in the cluster :
No direct change visible, but do I need to redeploy the Pod ? Current pod events :
|
@skinny you could wait, it would retry attach, and if you want to make it faster, redeploy would be better, thanks. |
Unfortunately the same timeout error appears some two minutes after redeployment. Will leave it at this for another bit. Last week I had the same kind of issues on another (1.14.x) cluster which I resolved with a lot of these manual steps. I hoped upgrading to 1.15.7 would finally solve these issues but this time even the manual steps are not helping |
@skinny I was a little wrong, this Update: it even could not to go to disk attach process due to that issue, are you trying to run two pods with same disk PVC? Could you file another new issue, paste your |
@andyzhangx That Multi-Attach error is always followed by a timeout message a couple of minutes later. Is there a way to clean this up ? |
@skinny |
@andyzhangx thanks, I will delete the deployment for now and let it "rest" for a bit. Tonight I'll deploy the statefulset again and see if it works then.
|
@skinny I suspect one of your node status of
Update: if you found one volume should not be in the |
@andyzhangx It's getting weirder by the minute
|
I missed your answer - anyway meanwhile I decided to destroy the whole cluster and recreate it from scratch, because there were a lot of disks (PVCs, not OS disks) still shown in the portal, sometimes in attached, sometimes in detached state, even though I entirely removed all PVCs in all namespaces and all PVs. |
@skinny if it's already attached to one node successfully, why there are second attach which lead to |
@andyzhangx nope, deleted all deployments/pod and only left the PVC intact. The grep for attached disks comes up empty before I try to use the PVC. Then for a few moments the disk actually is showing in the “attachedVolumes” output before disappearing again and showing the multi attach error in the events history |
We are also encountering these issues listed above. (aks 1.14.8 with vm scaleset) Support request: 120022522001011 |
I had exactly the same problem, but was able to solved with the following steps:
Cheers |
Hi, This is leads to deletion of data right? Facing the same issue in 1.16.9 version and need the data back of those disks. Any other option please suggest? |
Yes, it will delete all the data in your PVC. Cheers, |
pls provide your k8s version, pod events, the original disk attaching issue on AKS was fixed around Dec. 2019 |
Currently running in 1.16.9 |
What happened:
Pods with PVCs are stuck in
ContainerCreating
state, due to a problem with attachment/mounting.I am using a VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks) cluster. Following a crash for the Kafka pods (using Confluent helm charts v5.3.1; see configuration below, under Environment), 2 of the 3 got stuck in the ContainerCreating state. The dashboard seems to show that all the PVCs are failing to mount because of one volume that has not been detached properly:
Running
kubectl get pvc
shows the PVC in Bound state (full YAML-JSON from Dashboard below in Environment):I tried scaling the Kafka
StatefulSet
down to 0, then wait a long while, then scale back to 3, but they didn't recover.Then I tried to scale all
Deployments
andStatefulSets
down to 0, and do a same-version upgrade the K8S cluster. Unfortunately, because of a problem (reported here) with theVMAccessForLinux
extension I installed on the VMSS (following this guide to update SSH credentials on the nodes), the upgrade failed, 2.5 hours later, and the cluster remained in a Failed state. Now all of the pods with PVCs got stuck inContainerCreating
. I tried adding a second nodepool successfully, but pods placed on the new nodes still reported the same error, so I removed the second nodepool and scaled down the first nodepool to 1. I then tried to reboot the node using the Azure portal and from within an SSH connection. They all fail because of the issue with the extesnion. I then tried to gradually scale down allStatefulSets
(I had to uninstall the prometheus-operator helm since it insisted on scaling thealertmanager
StatefulSet
back up), and enable only the loggingStatefulSets
, as they are smaller. It didn't help.After taking down all
StatefulSets
, when runningkubectl get nodes --output json | jq '.items[].status.volumesInUse'
I getnull
.What you expected to happen:
Pods with PVCs should start normally, and if mounting fails, it should eventually (and somewhat quickly) retry and succeed.
How to reproduce it (as minimally and precisely as possible):
I have no idea. This happens randomly.
Up to now, we have worked around it by removing our PVCs, but I don't want to do this any more, I need a solution.
Anything else we need to know?:
This is similar to the following issues, reported on Kubernetes and AKS. All of them have been closed, but none with a real solution AFAIK.
I replaced the GUIDs to anonimize the logs, but I kept it so that GUIDs are kept distinct.
Environment:
kubectl version
): VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks)kubectl get pvc xxx --output json
):The text was updated successfully, but these errors were encountered: