Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: detach azure disk issue using dangling error #81266

Merged
merged 2 commits into from
Aug 20, 2019

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Aug 11, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix: detach azure disk issue using dangling error

I think this PR could fix most of the disk attach issue due to detach disk not succeed(could be due to many reasons), this PR could do the recover automatically, I will cherry pick the fix to all old releases.

Actually this PR fixed two issues:

  • disk not found issue when detaching an azure disk
    This is fixed by compare disk URI with case insensitive (strings.EqualFold), error logs are like following(without this PR):
azure_controller_standard.go:134] detach azure disk: disk  not found, diskURI: /subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db
  • azure disk could not be attached to another node since disk is still attached to the original node
    This is fixed by return dangling error, success events logs are like following:
Events:
  Type     Reason                  Age                    From                               Message
  ----     ------                  ----                   ----                               -------
  Normal   Scheduled               <unknown>              default-scheduler                  Successfully assigned default/nginx-azuredisk to k8s-agentpool-32686255-1
  Warning  FailedAttachVolume      2m46s (x7 over 3m18s)  attachdetach-controller            AttachVolume.Attach failed for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" : disk(/subscriptions/xxx/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/xxx-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db) already attached to node(k8s-agentpool-32686255-0), could not be attached to node(k8s-agentpool-32686255-1)
  Normal   SuccessfulAttachVolume  2m4s                   attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db"
  Warning  FailedMount             75s                    kubelet, k8s-agentpool-32686255-1  Unable to attach or mount volumes: unmounted volumes=[disk01], unattached volumes=[disk01 default-token-xw626]: timed out waiting for the condition
  Normal   Pulling                 59s                    kubelet, k8s-agentpool-32686255-1  Pulling image "nginx"
  Normal   Pulled                  59s                    kubelet, k8s-agentpool-32686255-1  Successfully pulled image "nginx"
  Normal   Created                 59s                    kubelet, k8s-agentpool-32686255-1  Created container nginx-azuredisk
  Normal   Started                 59s                    kubelet, k8s-agentpool-32686255-1  Started container nginx-azuredisk

Which issue(s) this PR fixes:

Fixes #81079

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix: detach azure disk issue using dangling error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/kind bug
/assign @gnufied
/priority important-soon
/sig cloud-provider
/area provider/azure

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/azure Issues or PRs related to azure provider labels Aug 11, 2019
@k8s-ci-robot k8s-ci-robot added area/cloudprovider sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 11, 2019
@andyzhangx
Copy link
Member Author

hi @gnufied I followed your guidance with #55491, this PR seems not working (detach volume never happen) even it could enter into volerr.NewDanglingError(attachErr ..., is there sth. else I have missed?

@gnufied
Copy link
Member

gnufied commented Aug 12, 2019

@andyzhangx just to double check, the volume will only be detached if another pod and comes around and tries to use it. If no pod is attempting to use it, the volume will remain attached to the old node. Is that what happened?

If volume is still not detaching even if another pod tries to use it - that will be kinda weird. Can we get controller-manager logs when this happened?

@andyzhangx
Copy link
Member Author

I used following way to "repro" this issue:

  1. attach the disk to node#A manually
  2. create a pod with PVC mount which uses that disk, and that pod would run on node#B

So the disk would be detached from node#A, right? @gnufied

@gnufied
Copy link
Member

gnufied commented Aug 13, 2019

@andyzhangx I think so yeah. Can you post controller-manager logs with v4 verbosity?

@andyzhangx
Copy link
Member Author

andyzhangx commented Aug 13, 2019

@gnufied here is the useful logging, detach volume is skipped:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

It's due to

if attachedVolume.MountedByNode && !timeout {
klog.V(5).Infof(attachedVolume.GenerateMsgDetailed("Cannot detach volume because it is still mounted", ""))
continue

At this time, the volume is in use by node#B.

And if I hack the above code by comment out the continue statement, it would go into

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

And you could find all controller manager logs here: https://gist.githubusercontent.com/andyzhangx/6d6460ad8802f78016b0b4cbce292399/raw/3ea7142feadaf8f7bb9da7c7e75016ce7559771d/controller-manager.log

@gnufied
Copy link
Member

gnufied commented Aug 13, 2019

okay.. so:

I0813 05:33:22.709564       1 reconciler.go:200] Cannot detach volume because it is still mounted for volume "pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" (UniqueName: "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db") on node "k8s-agentpool-32686255-0"

This condition will automatically go away after maxWaitForUnmountDuration has expired, which is fine because since we are going to detach a volume which is in "uncertain" state.

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" from node "k8s-agentpool-32686255-0" with: volume "kubernetes.io/azure-disk//subscriptions/b9d2281e-dcd5-4dfd-9a97-0d50377cdf76/resourceGroups/andy-mg1160alpha3/providers/Microsoft.Compute/disks/andy-1160alpha3-dynamic-pvc-41a31580-f5b9-4f08-b0ea-0adcba15b6db" does not exist in volumesToReportAsAttached list or node "k8s-agentpool-32686255-0" does not exist in nodesToUpdateStatusFor list

I think this is a bug that you have discovered. When I made #78595 change - to add dangling volumes as uncertain, we stopped reporting the volume to node's status and hence it is failing. But this looks like a general problem with detaching volumes which are in uncertain state, I will open a PR to fix this. This needs better coverage to prevent regression too.

@andyzhangx andyzhangx force-pushed the fix-detach-azuredisk-issue branch from 002c072 to 10a7eae Compare August 14, 2019 01:47
@gnufied
Copy link
Member

gnufied commented Aug 15, 2019

@andyzhangx actually part of my comment was not true, following error log:

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list

Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2019
@andyzhangx andyzhangx force-pushed the fix-detach-azuredisk-issue branch from 224133b to 1fe12be Compare August 18, 2019 03:43
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2019
@andyzhangx andyzhangx changed the title [WIP]fix: detach azure disk issue using dangling error fix: detach azure disk issue using dangling error Aug 18, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2019
@andyzhangx
Copy link
Member Author

andyzhangx commented Aug 18, 2019

@andyzhangx actually part of my comment was not true, following error log:

I0813 06:25:46.662343       1 reconciler.go:208] RemoveVolumeFromReportAsAttached failed while removing volume oes not exist in nodesToUpdateStatusFor list

Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached..

@gnufied thanks a lot for your help! Finally I figured out that this is another issue in detaching disk, actually detach disk operation happens, which it could not find that disk in the list, this PR fixed these two issues.

So this PR is ready for code review.

cc @MSSedusch @dkistner @vlerenc
/assign @khenidak @feiskyer

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@andyzhangx
Copy link
Member Author

/hold
hold for one day more for anyone else to review, comments, thanks

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2019
@andyzhangx
Copy link
Member Author

/hold cancel
ready to merge, thanks

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2019
@k8s-ci-robot k8s-ci-robot merged commit 1719ce7 into kubernetes:master Aug 20, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 20, 2019
@MSSedusch
Copy link

@andyzhangx Can you cherry pick that fix to 1.15?

@andyzhangx
Copy link
Member Author

@andyzhangx Can you cherry pick that fix to 1.15?

it's been cherry picked to 1.13, 1.14, 1.15

k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.13

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.15

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
k8s-ci-robot added a commit that referenced this pull request Aug 21, 2019
…1266-upstream-release-1.14

Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
@andyzhangx
Copy link
Member Author

below are the fixed versions:

k8s version fixed version
v1.12 no fix
v1.13 1.13.11
v1.14 1.14.7
v1.15 1.15.4
v1.15 1.16.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disks are not detached on Azure
7 participants