FreeBSD 13+ : NFS access of snapshot returns stale file handle; server zfs commands hang #13974

eborisch · 2022-09-30T14:01:39Z

System information

Type	Version/Name
Distribution Name	FreeBSD
Distribution Version	13.1-RELEASE-p2
Kernel Version	13.1-RELEASE-p2
Architecture	amd64
OpenZFS Version	2.1.4

Attempting to access to a snapshot over NFS fails (stale file handle error); deleting the snapshot fails at this point and blocks usage of the zfs tools. The filesystem itself is still alive and well, fulfilling requests from NFS and locally, but any attempt to issue a zfs commands fails (hangs).

On systems with snapshots being created/deleted, like many with automated frequent/hourly/... snapshots, and remote NFS users, this means a remote user can wedge the server's ZFS management interfaces (for any purpose, not just on the particular dataset) just by listing the contents of a snapshot that is later scheduled for deletion.

I initially (June) ran into this with automated snapshot expiration and (attempted) deletion, where I directly observed the issue due to zfs sends no longer working; I didn't connect the dots between the failed NFS access, later snapshot deletion, and subsequent wedging of the server's zfs commands until Michel's [bug report].(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266236).

Reproducing

Mount NFS-exported zfs filesystem on client.
Try to enter a snapshot (.zfs/snapshot/foo) directory on client -> Stale file handle error.
Additional steps I checked at this point to see if it illuminated anything; not required to reproduce:
a) Unmount on client; stop nfsd on server
b) mount -v on server shows the requested snapshot as mounted
c) Try explicit unmount of the snapshot path -> umount hangs (but the snapshot path no longer shows up in mount -v)
Try deleting snapshot -> zfs hangs

procstat -k $hung_unmount_pid: 

  PID    TID COMM                TDNAME              KSTACK                       
 5260 101043 umount              -                   mi_switch _sleep rms_wlock zfsvfs_teardown zfs_umount dounmount kern_unmount amd64_syscall fast_syscall_common 


procstat -k $hung_zfs_destroy:

  PID    TID COMM                TDNAME              KSTACK                       
 5826 101058 zfs                 -                   mi_switch _sleep vfs_busy zfs_vfs_ref getzfsvfs_impl getzfsvfs zfsctl_snapshot_unmount zfs_ioc_destroy_snaps zfsdev_ioctl_common zfsdev_ioctl devfs_ioctl vn_ioctl devfs_ioctl_f kern_ioctl sys_ioctl amd64_syscall fast_syscall_common

At this point no zfs or zpool commands succeed. (Or at least, none that I tried; all hang.)

Restart required to unwedge.

Edit: This system had been running (prior to 13.1 upgrade) on 12.1 (and earlier) with these actions (user NFS snapshot access, which is very useful for users to be able to recover files, snapshot rotations; etc.) all working beautifully for years.

Additional context

I initially experienced this on a custom kernel, but have reproduced with GENERIC; users on irc have reproduced on CURRENT. Reported by multiple other users as well on the FreeBSD bug report.

FreeBSD bug report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266236
Initial report: https://lists.freebsd.org/archives/freebsd-fs/2022-June/001126.html and follow-up: https://lists.freebsd.org/archives/freebsd-fs/2022-June/001127.html

A suggestion was made on the FreeBSD bugzilla to have an OpenZFS bug report, so here I am.

The text was updated successfully, but these errors were encountered:

- Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: 43dbf88 ("FreeBSD: vfsops: use setgen for error case") Closes openzfs#14001 Closes openzfs#13974

- Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: 43dbf88 ("FreeBSD: vfsops: use setgen for error case") Closes openzfs#14001 Closes openzfs#13974 (cherry picked from commit ed566bf)

- Add a zfs_exit() call in an error path, otherwise a lock is leaked. - Remove the fid_gen > 1 check. That appears to be Linux-specific: zfsctl_snapdir_fid() sets fid_gen to 0 or 1 depending on whether the snapshot directory is mounted. On FreeBSD it fails, making snapshot dirs inaccessible via NFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Fixes: 43dbf88 ("FreeBSD: vfsops: use setgen for error case") Closes #14001 Closes #13974 (cherry picked from commit ed566bf)

eborisch added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 30, 2022

eborisch mentioned this issue Oct 7, 2022

FreeBSD: Fix a pair of bugs in zfs_fhtovp() #14001

Merged

13 tasks

behlendorf closed this as completed in ed566bf Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FreeBSD 13+ : NFS access of snapshot returns stale file handle; server zfs commands hang #13974

FreeBSD 13+ : NFS access of snapshot returns stale file handle; server zfs commands hang #13974

eborisch commented Sep 30, 2022 •

edited

Loading

FreeBSD 13+ : NFS access of snapshot returns stale file handle; server zfs commands hang #13974

FreeBSD 13+ : NFS access of snapshot returns stale file handle; server zfs commands hang #13974

Comments

eborisch commented Sep 30, 2022 • edited Loading

System information

Reproducing

Additional context

eborisch commented Sep 30, 2022 •

edited

Loading