Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Volume staging path contains PV name #105899

Closed
jsafrane opened this issue Oct 26, 2021 · 9 comments · Fixed by #107065
Closed

CSI Volume staging path contains PV name #105899

jsafrane opened this issue Oct 26, 2021 · 9 comments · Fixed by #107065
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jsafrane
Copy link
Member

jsafrane commented Oct 26, 2021

What happened?

When two CSI PVs point to the same volume in the storage backend, kubelet can't start pods that use these two PVs on the same node.

Both PVs have the same VolumeHandle and kubelet is smart enough to recognize that they're the same volume (because they have the same unique volume ID). It calls NodeStage only once, however, the staging path contains name of one of the PVs, say PV1: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0/globalmount/

return filepath.Join(plugin.host.GetPluginDir(plugin.GetPluginName()), persistentVolumeInGlobalPath, pvName, globalMountInGlobalPath), nil

One of the CSI volume SetUpAt calls then fails, because it uses PV2 to compute the staging path and that path does not exist.

MountVolume.SetUp failed for volume "pv2" : rpc error: code = InvalidArgument desc = volume "1fdb001d-3634-11ec-aa06-0242ac110003" was staged at [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0/globalmount], not "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv2/globalmount"

What did you expect to happen?

Both pods should just start.

How can we reproduce it (as minimally and precisely as possible)?

Using CSI HostPath volume, PV1 (dynamically provisioned for simplicity):

  apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io
    name: pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 500Mi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: myclaim
      namespace: default
      resourceVersion: "524"
      uid: f5fa91c3-bf91-42a9-a8a3-da516e0371f0
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: 1635235714073-8081-hostpath.csi.k8s.io
      volumeHandle: 1fdb001d-3634-11ec-aa06-0242ac110003
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.hostpath.csi/node
            operator: In
            values:
            - 127.0.0.1
    persistentVolumeReclaimPolicy: Delete
    storageClassName: csi-hostpath-sc
    volumeMode: Filesystem
  status:
    phase: Bound

PV2 (basically edited PV1):

  apiVersion: v1
  kind: PersistentVolume
  metadata:
    name: pv2
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 500Mi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: myclaim2
      namespace: default
      resourceVersion: "664"
      uid: 89932338-c301-425b-9cef-dec861f4bdaf
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: 1635235714073-8081-hostpath.csi.k8s.io
      volumeHandle: 1fdb001d-3634-11ec-aa06-0242ac110003
    persistentVolumeReclaimPolicy: Retain
    storageClassName: csi-hostpath-sc
    volumeMode: Filesystem
  status:
    phase: Bound

PVC1, PVC2, Pod1 and Pod2 are trivial, nothing special there.

Kubernetes version

Today's master, dba9975
@jsafrane jsafrane added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 26, 2021
@k8s-ci-robot
Copy link
Contributor

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 26, 2021
@jsafrane
Copy link
Member Author

/sig storage
/priority backlog

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 26, 2021
@jsafrane
Copy link
Member Author

I suggest that if two volumes have the same unique volume ID (computed from CSI VolumeHandle), they must have also the same global mount path. Kubelet already has code to call NodeStage once and bind-mount it to different pods, regardless how many different PVs they use.

These are several issues changing the global mount path:

  1. UnmountDevice uses PV parsed from the global path to get the PV and to get enough info to call NodeUnstage:

    func getDriverAndVolNameFromDeviceMountPath(k8s kubernetes.Interface, deviceMountPath string) (string, string, error) {
    // deviceMountPath structure: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}/globalmount
    dir := filepath.Dir(deviceMountPath)
    if file := filepath.Base(deviceMountPath); file != globalMountInGlobalPath {
    return "", "", errors.New(log("getDriverAndVolNameFromDeviceMountPath failed, path did not end in %s", globalMountInGlobalPath))
    }
    // dir is now /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}
    pvName := filepath.Base(dir)
    // Get PV and check for errors
    pv, err := k8s.CoreV1().PersistentVolumes().Get(context.TODO(), pvName, meta.GetOptions{})
    if err != nil {

    However, this parsing is used only when the volume plugin is not able to find/parse its json it has saved during MountDevice:

    func (c *csiAttacher) UnmountDevice(deviceMountPath string) error {
    klog.V(4).Info(log("attacher.UnmountDevice(%s)", deviceMountPath))
    // Setup
    var driverName, volID string
    dataDir := filepath.Dir(deviceMountPath)
    data, err := loadVolumeData(dataDir, volDataFileName)
    if err == nil {
    driverName = data[volDataKey.driverName]
    volID = data[volDataKey.volHandle]
    } else {
    klog.Error(log("UnmountDevice failed to load volume data file [%s]: %v", dataDir, err))
    // The volume might have been mounted by old CSI volume plugin. Fall back to the old behavior: read PV from API server
    driverName, volID, err = getDriverAndVolNameFromDeviceMountPath(c.k8s, deviceMountPath)
    if err != nil {

    Can we safely assume that the json is always there?

  2. Changing the global path in a running cluster is tricky.

    • We can require nodes are drained when kubelet with new global mount path is installed. This way, no CSI volume is mounted to a global path when the new kubelet starts. Minor cluster upgrades (1.23 -> 1.24) are required to drain their nodes already, also documented in kubeadm.
    • Or we could fix kubelet to use the old global path when it already exists. This looks fragile to me.
  3. The staging dir needs to be unique enough not to conflict with other CSI volumes. It should be based on the driver name + volume handle, but that could be quite long. Can we compute a hash and hope it won't conflict?

@jingxu97
Copy link
Contributor

I think in-tree driver can handle this case correctly because global mountpath is using unique volume id instead of pv name. With csi-migration, does it break backward compatibility?

@jsafrane
Copy link
Member Author

jsafrane commented Nov 2, 2021

We require users to drain nodes before enabling migration or when updating to a Kubernetes version that have migration enabled by default. All volumes should be therefore unmounted.

@mattcary
Copy link
Contributor

mattcary commented Dec 9, 2021

@jingxu97 probably yes this breaks backward compatibility, but since only migration for PD is being done maybe it's more rare? We saw a customer case for this situation with NFS shares using the csi filestore driver.

For PD I think it would only come up when trying to ROX a PD volume across namespaces, I'm not sure if that's common.

@saikat-royc
Copy link
Member

saikat-royc commented Dec 10, 2021

Capturing some thoughts here in line to understand your suggestions @jsafrane

  1. If we use a staging path like /var/lib/kubelet/plugins/kubernetes.io/csi/<driver name>/{some-transformed<volumeId>}, the kubelet can locate the both the driver name and the volumeId, needed during UnmountDevice call.
    e.g for pd csi driver /var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/0a39950a8a2ec8f88fcb1be558d66b8ba7648dd4981af6a166dce39e97de6432/globalmount which is the sha256{projects/mattcary-saikatroyc-test/zones/us-central1-c/disks/pvc-4284ef52-f97d-43cd-a5f1-e0c4d43856b9}. This means the volume Id needs to persisted (we already do that in vol_data.json today code), so that the volume Id can be retrieved for UnmountDevice call.

2.1. if we decide to make the change at a minor boundary, then a node drain mandate would unstage any volumes before the kubelet is upgraded. this ensures any staged path for csi volume is using the new format

2.2 If we do not want to enforce a node drain, kubelet can first lookup json data (the reasoning is if this file does not exist, we will not be able to retrieve the volume handle from the new handle anyway), else fallback to try the old format of staging path - /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}/globalmount. TBD: Need to check mount reference logic.

  1. csidriver + sha(volumeId) is expected to be unique in the cluster, if I understand. (based on the assumption any storage resource provisioned by a given driver is expected to be uniquely identifiable)

@saikat-royc
Copy link
Member

/assign

@jsafrane
Copy link
Member Author

@saikat-royc yes, your summary is correct. We persist a json, please make sure it contains anything that NodeUnstage may need.
It seems to me that people prefer option 2.2.

csidriver + sha(volumeId) is expected to be unique in the cluster,

We already do sha256(node name, driver name, volume ID) to uniquely identify VolumeAttachment. So far nobody complained about any conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants