CSI Volume staging path contains PV name #105899

jsafrane · 2021-10-26T08:20:58Z

What happened?

When two CSI PVs point to the same volume in the storage backend, kubelet can't start pods that use these two PVs on the same node.

Both PVs have the same VolumeHandle and kubelet is smart enough to recognize that they're the same volume (because they have the same unique volume ID). It calls NodeStage only once, however, the staging path contains name of one of the PVs, say PV1: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0/globalmount/

kubernetes/pkg/volume/csi/csi_attacher.go

Line 661 in 3e6d122

    
           return filepath.Join(plugin.host.GetPluginDir(plugin.GetPluginName()), persistentVolumeInGlobalPath, pvName, globalMountInGlobalPath), nil

One of the CSI volume SetUpAt calls then fails, because it uses PV2 to compute the staging path and that path does not exist.

MountVolume.SetUp failed for volume "pv2" : rpc error: code = InvalidArgument desc = volume "1fdb001d-3634-11ec-aa06-0242ac110003" was staged at [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0/globalmount], not "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv2/globalmount"

What did you expect to happen?

Both pods should just start.

How can we reproduce it (as minimally and precisely as possible)?

Using CSI HostPath volume, PV1 (dynamically provisioned for simplicity):

  apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io
    name: pvc-f5fa91c3-bf91-42a9-a8a3-da516e0371f0
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 500Mi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: myclaim
      namespace: default
      resourceVersion: "524"
      uid: f5fa91c3-bf91-42a9-a8a3-da516e0371f0
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: 1635235714073-8081-hostpath.csi.k8s.io
      volumeHandle: 1fdb001d-3634-11ec-aa06-0242ac110003
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.hostpath.csi/node
            operator: In
            values:
            - 127.0.0.1
    persistentVolumeReclaimPolicy: Delete
    storageClassName: csi-hostpath-sc
    volumeMode: Filesystem
  status:
    phase: Bound

PV2 (basically edited PV1):

  apiVersion: v1
  kind: PersistentVolume
  metadata:
    name: pv2
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 500Mi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: myclaim2
      namespace: default
      resourceVersion: "664"
      uid: 89932338-c301-425b-9cef-dec861f4bdaf
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: 1635235714073-8081-hostpath.csi.k8s.io
      volumeHandle: 1fdb001d-3634-11ec-aa06-0242ac110003
    persistentVolumeReclaimPolicy: Retain
    storageClassName: csi-hostpath-sc
    volumeMode: Filesystem
  status:
    phase: Bound

PVC1, PVC2, Pod1 and Pod2 are trivial, nothing special there.

Kubernetes version

Today's master, dba9975

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-10-26T08:21:06Z

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2021-10-26T09:41:03Z

/sig storage
/priority backlog

jsafrane · 2021-10-26T10:00:21Z

I suggest that if two volumes have the same unique volume ID (computed from CSI VolumeHandle), they must have also the same global mount path. Kubelet already has code to call NodeStage once and bind-mount it to different pods, regardless how many different PVs they use.

These are several issues changing the global mount path:

UnmountDevice uses PV parsed from the global path to get the PV and to get enough info to call NodeUnstage:

kubernetes/pkg/volume/csi/csi_attacher.go

Lines 664 to 675 in dba9975

    
           func getDriverAndVolNameFromDeviceMountPath(k8s kubernetes.Interface, deviceMountPath string) (string, string, error) { 
        
           	// deviceMountPath structure: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}/globalmount 
        
           	dir := filepath.Dir(deviceMountPath) 
        
           	if file := filepath.Base(deviceMountPath); file != globalMountInGlobalPath { 
        
           		return "", "", errors.New(log("getDriverAndVolNameFromDeviceMountPath failed, path did not end in %s", globalMountInGlobalPath)) 
        
           	} 
        
           	// dir is now /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname} 
        
           	pvName := filepath.Base(dir) 
        
           	// Get PV and check for errors 
        
           	pv, err := k8s.CoreV1().PersistentVolumes().Get(context.TODO(), pvName, meta.GetOptions{}) 
        
           	if err != nil {

However, this parsing is used only when the volume plugin is not able to find/parse its json it has saved during MountDevice:

kubernetes/pkg/volume/csi/csi_attacher.go

Lines 573 to 588 in dba9975

    
           func (c *csiAttacher) UnmountDevice(deviceMountPath string) error { 
        
           	klog.V(4).Info(log("attacher.UnmountDevice(%s)", deviceMountPath)) 
        
           	// Setup 
        
           	var driverName, volID string 
        
           	dataDir := filepath.Dir(deviceMountPath) 
        
           	data, err := loadVolumeData(dataDir, volDataFileName) 
        
           	if err == nil { 
        
           		driverName = data[volDataKey.driverName] 
        
           		volID = data[volDataKey.volHandle] 
        
           	} else { 
        
           		klog.Error(log("UnmountDevice failed to load volume data file [%s]: %v", dataDir, err)) 
        
           		// The volume might have been mounted by old CSI volume plugin. Fall back to the old behavior: read PV from API server 
        
           		driverName, volID, err = getDriverAndVolNameFromDeviceMountPath(c.k8s, deviceMountPath) 
        
           		if err != nil {

Can we safely assume that the json is always there?

Changing the global path in a running cluster is tricky.
- We can require nodes are drained when kubelet with new global mount path is installed. This way, no CSI volume is mounted to a global path when the new kubelet starts. Minor cluster upgrades (1.23 -> 1.24) are required to drain their nodes already, also documented in kubeadm.
- Or we could fix kubelet to use the old global path when it already exists. This looks fragile to me.
The staging dir needs to be unique enough not to conflict with other CSI volumes. It should be based on the driver name + volume handle, but that could be quite long. Can we compute a hash and hope it won't conflict?

jingxu97 · 2021-10-27T17:01:04Z

I think in-tree driver can handle this case correctly because global mountpath is using unique volume id instead of pv name. With csi-migration, does it break backward compatibility?

jsafrane · 2021-11-02T15:13:04Z

We require users to drain nodes before enabling migration or when updating to a Kubernetes version that have migration enabled by default. All volumes should be therefore unmounted.

mattcary · 2021-12-09T23:03:56Z

@jingxu97 probably yes this breaks backward compatibility, but since only migration for PD is being done maybe it's more rare? We saw a customer case for this situation with NFS shares using the csi filestore driver.

For PD I think it would only come up when trying to ROX a PD volume across namespaces, I'm not sure if that's common.

saikat-royc · 2021-12-10T08:07:44Z

Capturing some thoughts here in line to understand your suggestions @jsafrane

If we use a staging path like /var/lib/kubelet/plugins/kubernetes.io/csi/<driver name>/{some-transformed<volumeId>}, the kubelet can locate the both the driver name and the volumeId, needed during UnmountDevice call.
e.g for pd csi driver /var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/0a39950a8a2ec8f88fcb1be558d66b8ba7648dd4981af6a166dce39e97de6432/globalmount which is the sha256{projects/mattcary-saikatroyc-test/zones/us-central1-c/disks/pvc-4284ef52-f97d-43cd-a5f1-e0c4d43856b9}. This means the volume Id needs to persisted (we already do that in vol_data.json today code), so that the volume Id can be retrieved for UnmountDevice call.

2.1. if we decide to make the change at a minor boundary, then a node drain mandate would unstage any volumes before the kubelet is upgraded. this ensures any staged path for csi volume is using the new format

2.2 If we do not want to enforce a node drain, kubelet can first lookup json data (the reasoning is if this file does not exist, we will not be able to retrieve the volume handle from the new handle anyway), else fallback to try the old format of staging path - /var/lib/kubelet/plugins/kubernetes.io/csi/pv/{pvname}/globalmount. TBD: Need to check mount reference logic.

csidriver + sha(volumeId) is expected to be unique in the cluster, if I understand. (based on the assumption any storage resource provisioned by a given driver is expected to be uniquely identifiable)

saikat-royc · 2021-12-17T07:29:04Z

/assign

jsafrane · 2021-12-17T09:55:40Z

@saikat-royc yes, your summary is correct. We persist a json, please make sure it contains anything that NodeUnstage may need.
It seems to me that people prefer option 2.2.

csidriver + sha(volumeId) is expected to be unique in the cluster,

We already do sha256(node name, driver name, volume ID) to uniquely identify VolumeAttachment. So far nobody complained about any conflict.

jsafrane added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2021

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 26, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 26, 2021

jsafrane mentioned this issue Oct 26, 2021

[static PVC] UnstageVolume is not sent to the CSI driver when all pods using PVC are deleted #105323

Closed

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 26, 2021

jsafrane mentioned this issue Nov 2, 2021

multi PV with same unique volume name mount failed #105868

Closed

saikat-royc mentioned this issue Dec 16, 2021

change node staging path for csi driver to PV agnostic #107065

Merged

k8s-ci-robot assigned saikat-royc Dec 17, 2021

k8s-ci-robot closed this as completed in #107065 Jan 21, 2022

Chaunceyctx mentioned this issue Nov 30, 2022

the volume is not detached after the pod and PVC objects are deleted #114207

Closed

Chaunceyctx mentioned this issue Apr 14, 2023

the same device is mounted twice and generates two different globalmountpath #117336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Volume staging path contains PV name #105899

CSI Volume staging path contains PV name #105899

jsafrane commented Oct 26, 2021 •

edited

Loading

k8s-ci-robot commented Oct 26, 2021

jsafrane commented Oct 26, 2021

jsafrane commented Oct 26, 2021

jingxu97 commented Oct 27, 2021

jsafrane commented Nov 2, 2021

mattcary commented Dec 9, 2021

saikat-royc commented Dec 10, 2021 •

edited

Loading

saikat-royc commented Dec 17, 2021

jsafrane commented Dec 17, 2021

CSI Volume staging path contains PV name #105899

CSI Volume staging path contains PV name #105899

Comments

jsafrane commented Oct 26, 2021 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Kubernetes version

k8s-ci-robot commented Oct 26, 2021

jsafrane commented Oct 26, 2021

jsafrane commented Oct 26, 2021

jingxu97 commented Oct 27, 2021

jsafrane commented Nov 2, 2021

mattcary commented Dec 9, 2021

saikat-royc commented Dec 10, 2021 • edited Loading

saikat-royc commented Dec 17, 2021

jsafrane commented Dec 17, 2021

jsafrane commented Oct 26, 2021 •

edited

Loading

saikat-royc commented Dec 10, 2021 •

edited

Loading