Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed mount after kubernetes worker node upgrade from v1.23.15 to v1.24.9 #14

Closed
mceronja opened this issue Jan 2, 2023 · 2 comments
Closed

Comments

@mceronja
Copy link

mceronja commented Jan 2, 2023

Hi!

So after upgrading half of my worker nodes to new kubernetes (v1.24.9) I noticed that some of the pods got stuck in failed mount.

Warning  FailedMount  15s (x6 over 31s)  kubelet
MountVolume.MountDevice failed for volume "pvc-5bc91a74" : rpc error: code = 
Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/874cf8f302b0da66de76a4edb4ca3f7e0c5f7a6f25ad368e8ce8fda969225eb5/globalmount: no such file or directory

To get them up and running again I forced them to use nodes with the old kubernetes version (v1.23.15) and that works.

Versions:

  • BeeGFS: v7.3.2
  • CSI Driver: v1.3.0

Regarding the csi driver deployment I am using the k8s one from the repo.
Config:

config:
  beegfsClientConf:
    connClientPortUDP: "8028"
    connDisableAuthentication: "true"
    logType: "helperd"

And the only modification I had to make was in csi-beegfs-node.yaml where I set the plugins-mount-dir to /var/lib/kubelet/plugins/kubernetes.io/csi/pv instead of /var/lib/kubelet/plugins/kubernetes.io/csi

The kubernetes 1.23.15 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
└── pv
    ├── pvc-01ba9661
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-03357f3e
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

The kubernetes 1.24.9 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
├── beegfs.csi.netapp.com
└── pv
    ├── pvc-090f23e1
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-14ba4b44
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

So for some reason the node with the newer kubernetes version has an empty beegfs.csi.netapp.com directory.
Why are the pods on the "new" nodes trying to mount this other location? Is the v1.3.0 version of the driver incompatible with kubernetes 1.24.9? Should I upgrade the driver to v1.4.0?

Please say if you need any more info.

Thanks in advance!

@mceronja
Copy link
Author

mceronja commented Jan 3, 2023

Hi,

I upgraded beegfs-csi-driver to v1.4.0 and I removed /pv from plugins-mount-dir and it works now. I guess the upgrade of the version did the trick.

tree -L 4
.
├── beegfs.csi.netapp.com
│   └── 443016c18d78ef863d31e5904c346489c800bf6f3014713ce8a694a0fdce7bd6
│       ├── globalmount
│       │   ├── beegfs-client.conf
│       │   └── mount
│       └── vol_data.json
└── pv
    ├── pvc-0023367f
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

I will close the issue.

Thanks!

@mceronja mceronja closed this as completed Jan 3, 2023
@ejweber
Copy link
Collaborator

ejweber commented Jan 9, 2023

Thanks for opening this issue @mceronja. Apologies for the delayed response. It appears you have resolved the issue, but I thought I'd provide a bit of color anyway.

Kubernetes changed the staging paths for persistent volumes in kubernetes/kubernetes#107065. Because they are no longer staged under .../csi/pv (and instead under the more general .../csi), and our old deployment manifests only give the driver purview over .../csi/pv, our old manifests are not compatible with Kubernetes >=1.24. That being said, our updated manifests (since >=1.3.0) use the more general .../csi and should be backwards compatible (as .../csi/pv is a subdirectory of .../csi).

It's hard to understand the exact cause of your issue without diving in pretty deep, but it seems likely to be upgrade flow related. It would make a lot more sense to me if you were coming from v1.2.2 (which had incompatible base manifests), but that doesn't appear to be the case. To anyone else arriving here for some similar reason:

  1. Be sure you are using the base manifests and overlays as described in https://github.com/NetApp/beegfs-csi-driver/tree/master/deploy/k8s#basics so you will always pick up the latest base manifests when upgrading the driver.
  2. Do not attempt to use driver version <1.3.0 with Kubernetes version >=1.24.0, as there is a hard incompatibility related to the discussion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants