Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes test "multiple PV pointing to the same storage on the same node" fails #1913

Closed
jsafrane opened this issue Aug 9, 2022 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jsafrane
Copy link
Contributor

jsafrane commented Aug 9, 2022

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

Running Kubernetes e2e tests, this test fails about 50% of time:

External Storage [Driver: csi.vsphere.vmware.com] [Testpattern: Dynamic PV (default fs)] provisioning should mount multiple PV pointing to the same storage on the same node

What you expected to happen:

The test passes

How to reproduce it (as minimally and precisely as possible):

Run Kubernetes 1.24 CSI tests with vSphere CSI driver.

Anything else we need to know?:

The test is quite complicated, here are individual steps:

  1. Create PVC1 and dynamically provisions PV1 for it and run Pod1 with it.
  2. Inspect PV1 and manually create PV2 that points to the same volume. I.e. there are two PVs with the same volumeHandle.
  3. Create PVC2 for PV2 and run Pod2 with it, on the same node as Pod1.
  4. Both Pod1 and Pod2 are running at this time on the same node, each with its own PVC+PV, but actually with the same volumeHandle. So far so good
  5. Delete Pod2 and PVC2 and PV2. PV2 has reclaimPolicy: Retain, so no deletion in the storage backend should happen. Again, so far so good.
  6. Delete Pod1.

At this time, the CSI driver is not able to detach PV1 from the node, because of this error:

    detachError:
      message: 'rpc error: code = Internal desc = volumeID "ca71413b-6f03-48c9-aaa3-533545cc2d26"
        not found in QueryVolume'

I was able to see that in step 5 (after PV2 is deleted), syncer deletes the volume from CNS:

PVDeleted: PV: ... PV2
PVDeleted: vSphere CSI Driver is deleting volume ... PV2
DeleteVolume: volumeID: "ca71413b-6f03-48c9-aaa3-533545cc2d26",
DeleteVolume: Volume deleted successfully. volumeID: "ca71413b-6f03-48c9-aaa3-533545cc2d26"
internalDeleteVolume: returns fault "" for volume "ca71413b-6f03-48c9-aaa3-533545cc2d26" 

But PV1 still exists at this time and the volume is still attached to a node. The attacher is then not able to find + detach the volume.

The test was added in 1.24 in this PR to test for a regression.

Environment:

  • csi-vsphere version: 2.4.0, 2.6.0
  • vsphere-cloud-controller-manager version: n/a?
  • Kubernetes version: 1.24
  • vSphere version: 7.0.3
  • Install tools: OpenShift
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 9, 2022
@divyenpatel
Copy link
Member

This is expected behavior.
when you delete PV1 with retaining volume, we do de-register this volume from vCenter. back end VMDK and FCD is present on the vCenter, but volume is de-registered.

for PV2, after CSI full sync determines that this volume needs to be registered and container volume, it registered it, and then the volume will be available in the query volume call. until that happens detach with fail.

vSphere CSI driver does not support creating multiple PVs with the same volume handle.

We recommend customers use RWM volume if we have a use case to use the same volume across many pods.

@divyenpatel
Copy link
Member

divyenpatel commented Aug 9, 2022

DeleteVolume: Volume deleted successfully. volumeID: "ca71413b-6f03-48c9-aaa3-533545cc2d26"

this delete is just a de-registration of volume as a Container volume. FCD and VMDK are not deleted from the back end.

@jsafrane
Copy link
Contributor Author

/close
I am making the failing test optional in kubernetes/kubernetes#113046

@k8s-ci-robot
Copy link
Contributor

@jsafrane: Closing this issue.

In response to this:

/close
I am making the failing test optional in kubernetes/kubernetes#113046

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingxu97
Copy link

jingxu97 commented Feb 18, 2023

So should it be considered as a regression from intree to CSI because intree test can pass? Is there public document about this issue?

@divyenpatel
@jsafrane
@xing-yang

@xing-yang
Copy link
Contributor

@jingxu97 CSI driver supports CNS volume. The volume handle in a PV points to a FCD UUID for a CSI driver. When the PV is deleted with retain policy, the volume is deregistered from CNS. Since the volume is no longer a CNS volume, detach will fail until full sync happens, as explained here.

The in-tree volume plugin does not support CNS volume. The volume handle in a PV for the in-tree plugin points to a VMDK path. When the PV is deleted with retain policy, there isn't a step to deregister from CNS.

It is not a regression. It works as expected. It is just that the in-tree plugin and CSI driver have very different architecture.

@jingxu97
Copy link

I think it would be good to have a document mentioning vSphere does not support this use case "multiple PV pointing to the same storage on the same node"?

@jingxu97
Copy link

I created an issue for it #2248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants