rare zfsctl_expire_snapshot deadlock #1527

behlendorf · 2013-06-18T16:23:50Z

Rare deadlock observed between a zfs destroy and an expiring automounted snapshot. The fix looks straight forward enough, move the taskq_cancel_id which may block outside the mutex. We don't require this operation to be locked.

[1096642.886453] INFO: task z_unmount/0:2153 blocked for more than 120 seconds.
[1096642.887045] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1096642.887708] z_unmount/0     D ffffffff81806240     0  2153      2 0x00000000
[1096642.887714]  ffff8803298dfce0 0000000000000046 0000000000000008 0000000100000000
[1096642.887721]  ffff8803298dffd8 ffff8803298dffd8 ffff8803298dffd8 00000000000137c0
[1096642.887727]  ffffffff81c0d020 ffff880325cdae00 ffff8803298dfd40 ffff8803d6fb84b8
[1096642.887732] Call Trace:
[1096642.887744]  [<ffffffff8165d76f>] schedule+0x3f/0x60
[1096642.887750]  [<ffffffff8165e577>] __mutex_lock_slowpath+0xd7/0x150
[1096642.887754]  [<ffffffff8165e18a>] mutex_lock+0x2a/0x50
[1096642.887802]  [<ffffffffa0237276>] zfsctl_unmount_snapshot+0x46/0xc0 [zfs]
[1096642.887806]  [<ffffffff8165d11c>] ? __schedule+0x3cc/0x6f0
[1096642.887842]  [<ffffffffa023731d>] zfsctl_expire_snapshot+0x2d/0x80 [zfs]
[1096642.887855]  [<ffffffffa0130776>] taskq_thread+0x236/0x4b0 [spl]
[1096642.887861]  [<ffffffff81060670>] ? try_to_wake_up+0x200/0x200
[1096642.887872]  [<ffffffffa0130540>] ? task_done+0x160/0x160 [spl]
[1096642.887878]  [<ffffffff8108b48c>] kthread+0x8c/0xa0
[1096642.887885]  [<ffffffff81669e34>] kernel_thread_helper+0x4/0x10
[1096642.887889]  [<ffffffff8108b400>] ? flush_kthread_worker+0xa0/0xa0
[1096642.887893]  [<ffffffff81669e30>] ? gs_change+0x13/0x13
[1096642.887931] INFO: task zfs:10690 blocked for more than 120 seconds.
[1096642.888567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1096642.889228] zfs             D ffffffff81806240     0 10690  10656 0x00000000
[1096642.889232]  ffff8800a8555c58 0000000000000082 ffff8800a8555c68 00000000f715b194
[1096642.889238]  ffff8800a8555fd8 ffff8800a8555fd8 ffff8800a8555fd8 00000000000137c0
[1096642.889243]  ffff88032bdcc500 ffff880324df4500 ffff8800a8555c68 ffff88062a5f9c00
[1096642.889249] Call Trace:
[1096642.889252]  [<ffffffff8165d76f>] schedule+0x3f/0x60
[1096642.889263]  [<ffffffffa0130247>] taskq_wait_id+0xa7/0x160 [spl]
[1096642.889273]  [<ffffffffa01300a9>] ? taskq_find+0x169/0x260 [spl]
[1096642.889277]  [<ffffffff8108bf30>] ? add_wait_queue+0x60/0x60
[1096642.889287]  [<ffffffffa0130af5>] taskq_cancel_id+0x105/0x1e0 [spl]
[1096642.889295]  [<ffffffff81163e3b>] ? kfree+0x3b/0x140
[1096642.889331]  [<ffffffffa02367d5>] __zfsctl_unmount_snapshot.isra.2+0xe5/0x110 [zfs]
[1096642.889368]  [<ffffffffa02372ab>] zfsctl_unmount_snapshot+0x7b/0xc0 [zfs]
[1096642.889402]  [<ffffffffa02043ac>] ? rrw_enter+0x15c/0x190 [zfs]
[1096642.889438]  [<ffffffffa0240261>] zfs_unmount_snap+0x101/0x130 [zfs]
[1096642.889474]  [<ffffffffa024031f>] zfs_ioc_destroy_snaps_nvl+0x8f/0x130 [zfs]
[1096642.889509]  [<ffffffffa0211573>] ? spa_open+0x13/0x20 [zfs]
[1096642.889546]  [<ffffffffa02444ec>] zfsdev_ioctl+0xdc/0x1b0 [zfs]
[1096642.889554]  [<ffffffff8118beda>] do_vfs_ioctl+0x8a/0x340
[1096642.889559]  [<ffffffff81144513>] ? do_munmap+0x1f3/0x2f0
[1096642.889563]  [<ffffffff8118c221>] sys_ioctl+0x91/0xa0
[1096642.889568]  [<ffffffff81667cc2>] system_call_fastpath+0x16/0x1b

The text was updated successfully, but these errors were encountered:

It is possible for an automounted snapshot which is expiring to deadlock with a manual unmount of the snapshot. This can occur because taskq_cancel_id() will block if the task is currently executing until it completes. But it will never complete because zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which zfsctl_expire_snapshot() must acquire. ---------------------- z_unmount/0:2153 --------------------- mutex_lock <blocking on zsb->z_ctldir_lock> zfsctl_unmount_snapshot zfsctl_expire_snapshot taskq_thread ------------------------- zfs:10690 ------------------------- taskq_wait_id <waiting for z_unmount to exit> taskq_cancel_id __zfsctl_unmount_snapshot zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock> zfs_unmount_snap zfs_ioc_destroy_snaps_nvl zfsdev_ioctl do_vfs_ioctl We resolve the deadlock by dropping the zsb->z_ctldir_lock before calling __zfsctl_unmount_snapshot(). The lock is only there to prevent concurrent modification to the zsb->z_ctldir_snaps AVL tree. Moreover, we're careful to remove the zfs_snapentry_t from the AVL tree before dropping the lock which ensures no other tasks can find it. On failure it's added back to the tree. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#1527

It is possible for an automounted snapshot which is expiring to deadlock with a manual unmount of the snapshot. This can occur because taskq_cancel_id() will block if the task is currently executing until it completes. But it will never complete because zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which zfsctl_expire_snapshot() must acquire. ---------------------- z_unmount/0:2153 --------------------- mutex_lock <blocking on zsb->z_ctldir_lock> zfsctl_unmount_snapshot zfsctl_expire_snapshot taskq_thread ------------------------- zfs:10690 ------------------------- taskq_wait_id <waiting for z_unmount to exit> taskq_cancel_id __zfsctl_unmount_snapshot zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock> zfs_unmount_snap zfs_ioc_destroy_snaps_nvl zfsdev_ioctl do_vfs_ioctl We resolve the deadlock by dropping the zsb->z_ctldir_lock before calling __zfsctl_unmount_snapshot(). The lock is only there to prevent concurrent modification to the zsb->z_ctldir_snaps AVL tree. Moreover, we're careful to remove the zfs_snapentry_t from the AVL tree before dropping the lock which ensures no other tasks can find it. On failure it's added back to the tree. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Closes openzfs#1527

behlendorf mentioned this issue Jul 11, 2013

Fix zfsctl_expire_snapshot() deadlock #1586

Closed

behlendorf closed this as completed in 7635167 Jul 12, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rare zfsctl_expire_snapshot deadlock #1527

rare zfsctl_expire_snapshot deadlock #1527

behlendorf commented Jun 18, 2013

rare zfsctl_expire_snapshot deadlock #1527

rare zfsctl_expire_snapshot deadlock #1527

Comments

behlendorf commented Jun 18, 2013