UPSTREAM: 40417: Always detach volumes in operator executor #13251

gnufied · 2017-03-06T16:30:07Z

This fixes a important race condition in mounting a volume and if pods gets deleted while mount was in progress.

BZ- https://bugzilla.redhat.com/show_bug.cgi?id=1432949

Instead of marking a volume as detached immediately in Kubelet's reconciler, delegate the marking asynchronously to the operator executor. This is necessary to prevent race conditions with other operations mutating the same volume state. An example of one such problem: pod is created, volume is added to desired state of the world reconciler process starts reconciler starts MountVolume, which is kicked off asynchronously via operation_executor.go MountVolume mounts the volume, but hasn't yet marked it as mounted pod is deleted, volume is removed from desired state of the world reconciler reaches detach volume section, detects volume is no longer in desired state of world, removes it from volumes in use MountVolume tries to mark mount, throws an error because volume is no longer in actual state of world list. After this, kubelet isn't aware of the mount so doesn't try to unmount again. controller-manager tries to detach the volume, this fails because it is still mounted to the OS. EBS gets stuck indefinitely in busy state trying to detach.

gnufied · 2017-03-06T16:32:20Z

cc @jsafrane @childsb

gnufied · 2017-03-06T16:32:50Z

[test]

jsafrane · 2017-03-07T08:07:09Z

LGTM for the code, however... shouldn't there be a link to BZ?

pweil- · 2017-03-07T20:19:41Z

@childsb or @eparis please sign off on this from a risk perspective

@gnufied I see where this was targeted for 1.5 upstream picks but didn't see the actual 1.5 pick. Since the PR was from an external contributor we may want to follow up on that.

gnufied · 2017-03-08T17:08:50Z

@pweil- okay I will follow up on original PR for cherry pick.

I think for now - lets hold off on merging this. I will drop the milestone target.

gnufied · 2017-03-14T21:19:49Z

I am going to push for this to get merged. Upstream PR for cherry picking to 1.5 was opened - kubernetes/kubernetes#41555

eparis · 2017-03-15T15:09:26Z

[test] but like jan said, where's the bZ?

openshift-bot · 2017-03-15T15:21:53Z

Evaluated for origin test up to e34c005

openshift-bot · 2017-03-15T16:45:41Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/239/) (Base Commit: 4bdf7ce)

childsb · 2017-03-16T01:40:21Z

[merge] test flaked on #9309

openshift-bot · 2017-03-16T01:45:23Z

Evaluated for origin merge up to e34c005

openshift-bot · 2017-03-16T04:13:37Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/101/) (Base Commit: 0eb20f4) (Image: devenv-rhel7_6072)

eparis · 2017-03-16T12:24:12Z

@childsb @gnufied where is the bz?

gnufied · 2017-03-16T12:54:40Z

The openshift bug https://bugzilla.redhat.com/show_bug.cgi?id=1429318

Also upstream bug - kubernetes/kubernetes#32881

Having said that, the bug reported in bugzilla doesn't have reproducible steps. This is bit of a problem because - we don't know why BZ 1429318 happened. We do know, we have an existing race condition in volume detach/umount code path so we are fixing that here. But we don't know for sure, if that will fix BZ 1429318. I will keep BZ open for now, until we observe if it is really fixed.

eparis · 2017-03-16T12:55:51Z

I need a BZ, in MODIFIED, which explains what we changed and why. For QA to try to pound on what changes you made at the last minute.

gnufied · 2017-03-16T20:51:42Z

@eparis updated the issue description with appropriate BZ.

gnufied added this to the 1.5.0 milestone Mar 6, 2017

gnufied added the component/storage label Mar 6, 2017

gnufied requested review from ajsmith and jsafrane and removed request for ajsmith March 6, 2017 16:32

pweil- added the priority/P1 label Mar 7, 2017

gnufied modified the milestone: 1.5.0 Mar 8, 2017

openshift-bot merged commit a98a33d into openshift:release-1.5 Mar 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM: 40417: Always detach volumes in operator executor #13251

UPSTREAM: 40417: Always detach volumes in operator executor #13251

gnufied commented Mar 6, 2017 •

edited

Loading

gnufied commented Mar 6, 2017

gnufied commented Mar 6, 2017

jsafrane commented Mar 7, 2017

pweil- commented Mar 7, 2017

gnufied commented Mar 8, 2017

gnufied commented Mar 14, 2017

eparis commented Mar 15, 2017

openshift-bot commented Mar 15, 2017

openshift-bot commented Mar 15, 2017

childsb commented Mar 16, 2017 •

edited

Loading

openshift-bot commented Mar 16, 2017

openshift-bot commented Mar 16, 2017 •

edited

Loading

eparis commented Mar 16, 2017

gnufied commented Mar 16, 2017

eparis commented Mar 16, 2017

gnufied commented Mar 16, 2017

UPSTREAM: 40417: Always detach volumes in operator executor #13251

UPSTREAM: 40417: Always detach volumes in operator executor #13251

Conversation

gnufied commented Mar 6, 2017 • edited Loading

gnufied commented Mar 6, 2017

gnufied commented Mar 6, 2017

jsafrane commented Mar 7, 2017

pweil- commented Mar 7, 2017

gnufied commented Mar 8, 2017

gnufied commented Mar 14, 2017

eparis commented Mar 15, 2017

openshift-bot commented Mar 15, 2017

openshift-bot commented Mar 15, 2017

childsb commented Mar 16, 2017 • edited Loading

openshift-bot commented Mar 16, 2017

openshift-bot commented Mar 16, 2017 • edited Loading

eparis commented Mar 16, 2017

gnufied commented Mar 16, 2017

eparis commented Mar 16, 2017

gnufied commented Mar 16, 2017

gnufied commented Mar 6, 2017 •

edited

Loading

childsb commented Mar 16, 2017 •

edited

Loading

openshift-bot commented Mar 16, 2017 •

edited

Loading