Adding ZVol as log device of seprate pool causes deadlock. #6065

kenthinson · 2017-04-25T08:47:30Z

System information

Type	Version/Name
Distribution Name	Proxmox
Distribution Version	4.4
Linux Kernel	4.4.35-1-pve
Architecture	x64
ZFS Version	0.6.5.8-1
SPL Version	0.6.5.8-1

Describe the problem you're observing

I have 1 pool that is all SSD. I have anothe pool that is all spinning disks. I tried making a ZVol on the ssd pool then using that zvol as a log for the spinning disk pool when running the zpool add command the io freezes to the spinning disk pool. The ssd pool contiunes to work. Trying to run any other zpool commands they will not respond example zpool iostat

Describe how to reproduce the problem

see above

Include any warning/errors/backtraces from the system logs

No errors on screen just a hang. However if you want me to run some tests or pull up log files and attach them i'm happy to if you tell me what ones you want.

EDIT:
As a workaround I made a ZFS file system on the SSD pool. Then created a raw image file. After that I used the file instead of a zvol. Worked without issue. But I dont see why the ZVol shouldnt work. It's suposed to be like any other raw disk.

Fabian-Gruenbichler · 2017-04-25T09:15:56Z

PVE developer hat on

please report (potential) bugs in PVE to the Proxmox bugtracker at https://bugzilla.proxmox.com first. We forward bug reports and fixes to our upstreams as needed.

You are running an outdated version of PVE 4 with known issues. While I am not sure whether your proposed setup is even supposed to work (zvols as vdevs without a layer of abstraction like KVM inbetween have caused problems in the past), an uptodate PVE 4.4 installation (with ZFS 0.6.5.9 and kernel 4.4.49-1-pve) does not show the issue you described.

koplover · 2017-04-28T11:44:30Z

I have seen a similar problem with recent performance analysis runs on 0.7.0 RC1 and then 0.7.0 RC3.

We run a virtualized environment (under Xen) with our guest VMs running over zvols managed from our driver domain. One workload function is fileserver, which runs a separate zpool on top of the zvols provided.

This fileserver zpool is created within the driver domain immediately on top of a couple of zvols, one for the useable disk partition and another for a SLOG device (a performance enhancement). This works fine under 0.6.4.2 but causes ZFS to lock under 0.7.0.

I've tried various builds on 0.6.5.x set but can't remember if there was an issue with those versions.

I'll perform a bit more detail on the 0.7.0 issue when I finish my current performance runs, to understand if it is constrained to the use of zvol as a SLOG in an overlayed ZFS zpool, or occurs if there is a simple zpool with a single zvol comprising the pool with no log or cache device.

koplover · 2017-05-03T11:31:53Z

@behlendorf has there been any other reports of similar issues, and do you know if it is constrained to 0.7.0 stream or appeared in 0.6.5.x.

It is quite a big deal for us as our VM disk provisioning currently operates solely in our domain 0 hosting the zpool before any VMs are created. Of course it would work fine if the zvol was passed through to the VM and then zfs zpool created and maintained solely within that context, but that breaks our whole provisioning flow.

When I get a chance I'll see if I can pull a stack when this happens to show where the deadlock occurs. I thought I'd read some time back that there was consideration to prevent zpools being created directly over zvols - this is obviously not that due to deadlock rather than error return, but is there anything in the works around that we should be aware of (hopefully not as would break backwards compatibility)

behlendorf · 2017-05-03T17:18:25Z

@koplover layering one zpool on another like this is a difficult thing to automatically detect and so it hasn't been reliably handled in any of the releases. As you observed opening the devices can, but won't always, result in a deadlock.

There was a patch proposed in #5286 from @ryao which added a module option called zfs_vdev_parallel_open_enabled which could be set to force the vdevs to be opened sequentially which should prevent the deadlock. The downside being it increases pool import time. However, this patch never got finalized and wasn't merged.

I don't mind dusting off the original patch so we can get it included. I've opened PR #6093 with a refreshed version of the original patch. It would be great if you could verify if it does solve the issue your observing. With the patch applied you'll need to set the zfs_vdev_parallel_open_enabled=0 module option.

koplover · 2017-05-04T09:54:01Z

@behlendorf I've run into this issue each time I've deployed a 0.7.0 RC ZFS system, only a sample of three but reached deadlock 100% at the moment.

It sounds from the above that this is a race-condition which has been around for some time, dependent on the time of creating the underlying zvols and the zpool on top of them.

As a quick test I've therefore gone to our disk driver domain that hosts our primary zpool (holding zvols representing the volumes of our guest VMs), and created a couple of zvols directly, one to represent the overlying zpool (call this otank), and another for the log device. A couple of minutes later I run:

zpool create ztank /dev/zvol/diskconvm/test-data log /dev/zvol/diskconvm/test-slog

This hangs presumably deadlocked, it holds out any subsequent zpool / zfs admin commands. Dumping the virtual CPUs on this hosting domain does not reveal anything. The logs (below) show only the following which would seem to be an effect rather than cause, but may help pinpoint the lock causing this deadlock.

Happy to try the patch, just want to make sure we are on the same page. The issue I'm seeing seems to be completely repeatable, more than a race condition. I can certainly try the patch, thanks for digging it out, but I wonder if there is a different issue here?

May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388030] INFO: task l2arc_feed:347 blocked for more than 120 seconds.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388042]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388050] l2arc_feed      D ffff880288eafd08     0   347      2 0x00000080
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388054]  ffff880288eafd08 ffff88028faff5c0 0000000000013d80 ffff880288eaffd8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388057]  0000000000013d80 ffff88028f9c75c0 ffff88028faff5c0 ffffffff81787b80
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388059]  ffffffffc0726140 ffffffffc0726144 ffff88028faff5c0 00000000ffffffff
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388062] Call Trace:
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388072]  [<ffffffff81787b80>] ? _raw_spin_unlock_irqrestore+0x20/0x50
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388084]  [<ffffffff81784609>] schedule_preempt_disabled+0x29/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388086]  [<ffffffff81786295>] __mutex_lock_slowpath+0x95/0x100
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388092]  [<ffffffff810b2960>] ? prepare_to_wait_event+0x110/0x110
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388096]  [<ffffffff8178631b>] mutex_lock+0x1b/0x2f
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388126]  [<ffffffffc04e0805>] l2arc_feed_thread+0x175/0xa50 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388143]  [<ffffffffc04e0690>] ? l2arc_evict+0x2a0/0x2a0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388150]  [<ffffffffc03d25f1>] thread_generic_wrapper+0x71/0x80 [spl]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388156]  [<ffffffffc03d2580>] ? __thread_exit+0x20/0x20 [spl]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388160]  [<ffffffff81091539>] kthread+0xc9/0xe0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388162]  [<ffffffff81091470>] ? kthread_create_on_node+0x1c0/0x1c0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388166]  [<ffffffff817882d8>] ret_from_fork+0x58/0x90
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388169]  [<ffffffff81091470>] ? kthread_create_on_node+0x1c0/0x1c0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388238] INFO: task zpool:20215 blocked for more than 120 seconds.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388246]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388255] zpool           D ffff8801d7fe3918     0 20215  21124 0x00000080
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388259]  ffff8801d7fe3918 ffff880262ecbae0 0000000000013d80 ffff8801d7fe3fd8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388262]  0000000000013d80 ffff88028fa189d0 ffff880262ecbae0 ffff8801d7fe38f8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388265]  ffffffffc07ba760 ffffffffc07ba764 ffff880262ecbae0 00000000ffffffff
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388268] Call Trace:
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388275]  [<ffffffff81784609>] schedule_preempt_disabled+0x29/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388279]  [<ffffffff81786295>] __mutex_lock_slowpath+0x95/0x100
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388284]  [<ffffffff8178631b>] mutex_lock+0x1b/0x2f
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388333]  [<ffffffffc05a4413>] zvol_open+0x43/0x200 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388337]  [<ffffffff81370534>] ? get_gendisk+0x34/0x120
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388343]  [<ffffffff8136f200>] ? disk_map_sector_rcu+0x80/0x80
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388350]  [<ffffffff812199db>] __blkdev_get+0xcb/0x4b0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388353]  [<ffffffff81219fd6>] blkdev_get+0x216/0x330
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388358]  [<ffffffff812015c4>] ? mntput+0x24/0x40
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388363]  [<ffffffff811eae52>] ? path_put+0x22/0x30
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388366]  [<ffffffff8121a376>] blkdev_get_by_path+0x56/0x90
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388404]  [<ffffffffc055047c>] vdev_disk_open+0x36c/0x3e0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388408]  [<ffffffff812015c4>] ? mntput+0x24/0x40
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388412]  [<ffffffff811eae52>] ? path_put+0x22/0x30
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388415]  [<ffffffff811fcf6b>] ? iput+0x3b/0x180
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388453]  [<ffffffffc054d985>] vdev_open+0xe5/0x500 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388491]  [<ffffffffc054de00>] vdev_open_children+0x60/0x190 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388527]  [<ffffffffc055cec1>] vdev_root_open+0x51/0xf0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388562]  [<ffffffffc054d985>] vdev_open+0xe5/0x500 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388574]  [<ffffffffc03faee2>] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388612]  [<ffffffffc054df92>] vdev_create+0x22/0xb0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388646]  [<ffffffffc053b44b>] spa_create+0x35b/0x9a0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388657]  [<ffffffffc03faa98>] ? nvlist_add_uint64+0x38/0x40 [znvpair]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388694]  [<ffffffffc0578592>] zfs_ioc_pool_create+0x132/0x240 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388701]  [<ffffffffc03cff1c>] ? strdup+0x3c/0x60 [spl]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388735]  [<ffffffffc0575df6>] zfsdev_ioctl+0x4e6/0x530 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388740]  [<ffffffff811099c1>] ? audit_filter_rules.isra.8+0x481/0xf10
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388744]  [<ffffffff811f4a68>] do_vfs_ioctl+0x2f8/0x510
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388747]  [<ffffffff8110af64>] ? __audit_syscall_entry+0xb4/0x110
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388751]  [<ffffffff8102194c>] ? do_audit_syscall_entry+0x6c/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388754]  [<ffffffff811f4d01>] SyS_ioctl+0x81/0xa0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388757]  [<ffffffff8110b1f6>] ? __audit_syscall_exit+0x236/0x2e0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388762]  [<ffffffff8178838d>] system_call_fastpath+0x16/0x1b
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388765] INFO: task systemd-udevd:20760 blocked for more than 120 seconds.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388774]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388783] systemd-udevd   D ffff8801f1463928     0 20760  13649 0x00000084
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388787]  ffff8801f1463928 ffff880263ef89d0 0000000000013d80 ffff8801f1463fd8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388791]  0000000000013d80 ffffffff81c1d4e0 ffff880263ef89d0 ffff8801f1463a28
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388794]  ffffffffc0726140 ffffffffc0726144 ffff880263ef89d0 00000000ffffffff
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388797] Call Trace:
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388804]  [<ffffffff81784609>] schedule_preempt_disabled+0x29/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388808]  [<ffffffff81786295>] __mutex_lock_slowpath+0x95/0x100
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388811]  [<ffffffff8178631b>] mutex_lock+0x1b/0x2f
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388839]  [<ffffffffc053a3d8>] spa_open_common+0x58/0x430 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388868]  [<ffffffffc053a7d3>] spa_open+0x13/0x20 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388892]  [<ffffffffc051bd04>] dsl_pool_hold+0x24/0x60 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388912]  [<ffffffffc04f84a5>] dmu_objset_own+0x35/0xd0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388916]  [<ffffffff8136f200>] ? disk_map_sector_rcu+0x80/0x80
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388950]  [<ffffffffc05a454c>] zvol_open+0x17c/0x200 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388953]  [<ffffffff81370534>] ? get_gendisk+0x34/0x120
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388957]  [<ffffffff812199db>] __blkdev_get+0xcb/0x4b0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388960]  [<ffffffff81219f7d>] blkdev_get+0x1bd/0x330
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388963]  [<ffffffff8121a1af>] blkdev_open+0x5f/0x90
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388967]  [<ffffffff811de897>] do_dentry_open+0x1f7/0x340
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388969]  [<ffffffff8121a150>] ? blkdev_get_by_dev+0x60/0x60
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388972]  [<ffffffff811e00f7>] vfs_open+0x57/0x60
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388975]  [<ffffffff811effac>] do_last+0x4ec/0x1190
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388977]  [<ffffffff811f0cd0>] path_openat+0x80/0x600
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388980]  [<ffffffff811f23ea>] do_filp_open+0x3a/0x90
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388983]  [<ffffffff811fef47>] ? __alloc_fd+0xa7/0x130
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388985]  [<ffffffff811e0479>] do_sys_open+0x129/0x280
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388987]  [<ffffffff8102194c>] ? do_audit_syscall_entry+0x6c/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388990]  [<ffffffff81022f63>] ? syscall_trace_enter_phase1+0x123/0x180
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388992]  [<ffffffff811e05ee>] SyS_open+0x1e/0x20
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.388996]  [<ffffffff8178838d>] system_call_fastpath+0x16/0x1b
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389003] INFO: task zfs:25748 blocked for more than 120 seconds.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389008]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389015] zfs             D ffff88022d2d7d48     0 25748  25733 0x00000080
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389017]  ffff88022d2d7d48 ffff8802628e75c0 0000000000013d80 ffff88022d2d7fd8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389019]  0000000000013d80 ffffffff81c1d4e0 ffff8802628e75c0 0000000000000018
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389021]  ffffffffc0726140 ffffffffc0726144 ffff8802628e75c0 00000000ffffffff
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389024] Call Trace:
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389028]  [<ffffffff81784609>] schedule_preempt_disabled+0x29/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389030]  [<ffffffff81786295>] __mutex_lock_slowpath+0x95/0x100
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389036]  [<ffffffffc03f8813>] ? nvlist_xalloc.part.13+0x63/0xd0 [znvpair]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389039]  [<ffffffff8178631b>] mutex_lock+0x1b/0x2f
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389067]  [<ffffffffc053eb6f>] spa_all_configs+0x3f/0xf0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389100]  [<ffffffffc0572a9b>] zfs_ioc_pool_configs+0x1b/0x50 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389132]  [<ffffffffc0575df6>] zfsdev_ioctl+0x4e6/0x530 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389137]  [<ffffffff811f4a68>] do_vfs_ioctl+0x2f8/0x510
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389141]  [<ffffffff8110af64>] ? __audit_syscall_entry+0xb4/0x110
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389144]  [<ffffffff8102194c>] ? do_audit_syscall_entry+0x6c/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389147]  [<ffffffff811f4d01>] SyS_ioctl+0x81/0xa0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389152]  [<ffffffff8178838d>] system_call_fastpath+0x16/0x1b
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389155] INFO: task zpool:25842 blocked for more than 120 seconds.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389159]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389166] zpool           D ffff8801449cfd48     0 25842  25674 0x00000080
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389168]  ffff8801449cfd48 ffff88002a226220 0000000000013d80 ffff8801449cffd8
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389170]  0000000000013d80 ffff88028fa189d0 ffff88002a226220 0000000000000018
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389173]  ffffffffc0726140 ffffffffc0726144 ffff88002a226220 00000000ffffffff
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389175] Call Trace:
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389179]  [<ffffffff81784609>] schedule_preempt_disabled+0x29/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389182]  [<ffffffff81786295>] __mutex_lock_slowpath+0x95/0x100
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389187]  [<ffffffffc03f8813>] ? nvlist_xalloc.part.13+0x63/0xd0 [znvpair]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389190]  [<ffffffff8178631b>] mutex_lock+0x1b/0x2f
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389218]  [<ffffffffc053eb6f>] spa_all_configs+0x3f/0xf0 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389250]  [<ffffffffc0572a9b>] zfs_ioc_pool_configs+0x1b/0x50 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389280]  [<ffffffffc0575df6>] zfsdev_ioctl+0x4e6/0x530 [zfs]
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389283]  [<ffffffff811f4a68>] do_vfs_ioctl+0x2f8/0x510
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389286]  [<ffffffff8110af64>] ? __audit_syscall_entry+0xb4/0x110
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389288]  [<ffffffff8102194c>] ? do_audit_syscall_entry+0x6c/0x70
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389290]  [<ffffffff811f4d01>] SyS_ioctl+0x81/0xa0
May  4 10:28:26 zdiskdd0000-0026-00-00 kernel: [170280.389293]  [<ffffffff8178838d>] system_call_fastpath+0x16/0x1b
May  4 10:30:01 zdiskdd0000-0026-00-00 CRON[4614]: (root) CMD (/usr/sbin/logrotate /etc/logrotate.d/zynstra-firewall > /dev/null 2>&1)
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388024] INFO: task l2arc_feed:347 blocked for more than 120 seconds.
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388036]       Tainted: P           OE  3.19.0-39-zdomu #44~14.04.1
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388040] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388044] l2arc_feed      D ffff880288eafd08     0   347      2 0x00000080
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388048]  ffff880288eafd08 ffff88028faff5c0 0000000000013d80 ffff880288eaffd8
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388050]  0000000000013d80 ffff88028f9c75c0 ffff88028faff5c0 ffffffff81787b80
May  4 10:30:26 zdiskdd0000-0026-00-00 kernel: [170400.388052]  ffffffffc0726140 ffffffffc0726144 ffff88028faff5c0 00000000ffffffff
etc etc

behlendorf · 2017-05-04T21:28:12Z

Let me summarize the posted stacks so we're all on the same page. The deadlock encountered here is due to a lock inversion. When layering pools the following is possible. What needs to happened to properly address this is to break up the zvol_state_lock such that it only protects insertion/removal from the zvol_state_list. A new lock could be added per zvol_state_t to protect its contents.

@tuxoko @ryao @bprotopopov have all been working on the zvol code fairly recently. Do any of you have the time to tackle this? In short we need to never call dmu_objset_own under the global zvol_state_lock.

zpool                                           systemd-udevd
---------------------                           ---------------------
do_vfs_ioctl                                    do_sys_open
zfsdev_ioctl                                    do_filp_open
zfs_ioc_pool_create                             path_openat
spa_create <- take spa_namespace_lock            do_last
vdev_create                                     vfs_open
vdev_open                                       do_dentry_open
vdev_root_open                                  blkdev_open
vdev_open_children                              blkdev_get
vdev_open                                       __blkdev_get
vdev_disk_open                                  zvol_open <- take zvol_state_lock
blkdev_get_by_path                              dmu_objset_own
blkdev_get                                      dsl_pool_hold
__blkdev_get                                    spa_open
zvol_open <- wait zvol_state_lock               spa_open_common <- wait spa_namespace_lock

@koplover you're right, the proposed patch in #6093 won't fully address this issue. I'll close that PR.

bprotopopov · 2017-05-05T01:06:25Z

I'll see what I can do :) But I must ask - why is this s good idea :) to use a zvol as a log device ? Is this a test setup of some sort? Typos courtesy of my iPhone On May 4, 2017, at 5:28 PM, Brian Behlendorf <notifications@github.com<mailto:notifications@github.com>> wrote: Let me summarize the posted stacks so we're all on the same page. The deadlock encountered here is due to a lock inversion. When layering pools the following is possible. What needs to happened to properly address this is to break up the zvol_state_lock such that it only protects insertion/removal from the zvol_state_list. A new lock could be added per zvol_state_t to protect its contents. @tuxoko<https://github.com/tuxoko> @ryao<https://github.com/ryao> @bprotopopov<https://github.com/bprotopopov> have all been working on the zvol code fairly recently. Do any of you have the time to tackle this? In short we need to never call dmu_objset_own under the global zvol_state_lock. zpool systemd-udevd --------------------- --------------------- do_vfs_ioctl do_sys_open zfsdev_ioctl do_filp_open zfs_ioc_pool_create path_openat spa_create <- take spa_namespace_lock do_last vdev_create vfs_open vdev_open do_dentry_open vdev_root_open blkdev_open vdev_open_children blkdev_get vdev_open __blkdev_get vdev_disk_open zvol_open <- take zvol_state_lock blkdev_get_by_path dmu_objset_own blkdev_get dsl_pool_hold __blkdev_get spa_open zvol_open <- wait zvol_state_lock spa_open_common <- wait spa_namespace_lock @koplover<https://github.com/koplover> you're right, the proposed patch in #6093<#6093> won't fully address this issue. I'll close that PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#6065 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uWzZQWZVGmtqa6Y-VCyjBfyU5L1uks5r2kL-gaJpZM4NHLLu>.

bprotopopov · 2017-05-05T01:34:09Z

One issue with this approach is that there is a dependency being created between two pools, and if the SSD pool is not available, one cannot import the spinning disk pool without -m option, which might result in data loss. Another issue that comes to mind is what happens if the SSD pool runs out of space. I think these issues need to be considered carefully before going with this type of setup in production.

A somewhat similar situation arises if one deploys L2ARC that is shared by several pools - no one does this to avoid cross pool dependencies due to device availability, to my knowledge.

koplover · 2017-05-05T08:49:53Z

@bprotopopov This is a production configuration

Bear in mind in our setup we have a heavily virtualised environment (Xen) with many different functions realised as separate virtualised guests - around 20-30. All of these guests are supported by zvols provided from the base zpool (vdevs full hard disks in mirror configuration, and log / cache device from SSD).

One of these virtualised workloads (VMs) benefits greatly from running a ZFS filesystem, a file server function, where snapshotting provides important features for user restore, backup etc. Essentially, it is 'luck' that we choose ZFS in both places in the architecture (due to the fine features ZFS provides.

So, the next question is how to provide this in the most performant and architecturally sound manner. We have tried various configurations, including passing through the zvol in different ways to the overlying VM, and having just a plain vdev comprising the overlying (VM) zpool within the zvol log.

However, this too has given rise to performance issues. We now have the 'data' zvol as nosynch, and the 'log' zvol as synch which performs well and does not risk the data.

In truth if the underlying provider zvol corrupts in any way we are in a bad way and need to rebuild as all the guest VMs are hit. The restore of the overlying file server zpool is just one instance of restoring this data.

Given the above, are you still concerned of the relavance of this scenario? In terms of SSD pool what are you considering here - why would it run out of space specifically as the overlying zvol is just a another zvol from the perspective of the supporting primary zpool - the 'log' zvol just has an optimum block size?

bprotopopov · 2017-05-05T13:55:23Z

@koplover Going back to the issue #6065, there are two sets of concerns: 1) the deadlock that arises from a conceptually valid use of ZFS 2) the practicality of the conceptually valid use of ZFS The first set of issue should be addressed for correctness. This is not being debated. The second set of issues relates to practicality of using a zvol on top of SSD pool as a log device for other pools. There are several issues to consider. I have mentioned the dependencies created between the pools: 1) should the SSD pool become unavailable (even temporarily), all the pools that use zvols from the SSD pool will be affected, e.g. will not be importable without potential loss of data. 2) should the SSD pool's capacity utilization approach 85%-90% or above (including fragmentation), all the pools that use zvols from the SSD pool will be affected, e.g. the performance will greatly suffer. I assume here that there is more that one zvol provisioned in the SSD pool, because if there is only one such zvol, then the SSDs could simply be used (as log vdev) by the pool that uses this zvol as log device, 'with no loss of generality'. Having many zvols as log devices for many pools in many VMs gives rise to the capacity utilization concerns. Another important concern is the system resource and performance penalty one pays for layering ZFS pools over other pools, e.g. logs over zvols. Conceptually, this approach results in two nested levels of intent logging - one in the SSD pool for the 'log' zvol itself, and one on top of that, in the spinning disk pool that uses the zvol as log device. I could go into details here, but essentially, I/O processing goes through many software layers, re-scheduling many times to many different threads, performing extra I/Os, caching the same data many times (while copying that data between the caches), paying penalty in CPU, memory, storage capacity, and latency of the I/Os under processing. Hope that helps. P.S. Without knowing details of your stack, I cannot advise, but I wonder if you could simply use files from two filesystems from the same spinning disk pool with SSD-based log device(s), one filesystem with no sync and one with sync, passed to your VMs as virtual devices. That seems to be more in line with conventional virtualization approaches. Barring that, again, you could provision zvols from the same pool, and use sync/no sync approach. I am sure you have tried this, although I don't know why the performance would be worse than layering logs on top of zvols.

…

________________________________ From: koplover <notifications@github.com> Sent: Friday, May 5, 2017 4:50:10 AM To: zfsonlinux/zfs Cc: Boris Protopopov; Mention Subject: Re: [zfsonlinux/zfs] Adding ZVol as log device of seprate pool causes deadlock. (#6065) @bprotopopov<https://github.com/bprotopopov> This is a production configuration Bear in mind in our setup we have a heavily virtualised environment (Xen) with many different functions realised as separate virtualised guests - around 20-30. All of these guests are supported by zvols provided from the base zpool (vdevs full hard disks in mirror configuration, and log / cache device from SSD). One of these virtualised workloads (VMs) benefits greatly from running a ZFS filesystem, a file server function, where snapshotting provides important features for user restore, backup etc. Essentially, it is 'luck' that we choose ZFS in both places in the architecture (due to the fine features ZFS provides. So, the next question is how to provide this in the most performant and architecturally sound manner. We have tried various configurations, including passing through the zvol in different ways to the overlying VM, and having just a plain vdev comprising the overlying (VM) zpool within the zvol log. However, this too has given rise to performance issues. We now have the 'data' zvol as nosynch, and the 'log' zvol as synch which performs well and does not risk the data. In truth if the underlying provider zvol corrupts in any way we are in a bad way and need to rebuild as all the guest VMs are hit. The restore of the overlying file server zpool is just one instance of restoring this data. Given the above, are you still concerned of the relavance of this scenario? In terms of SSD pool what are you considering here - why would it run out of space specifically as the overlying zvol is just a another zvol from the perspective of the supporting primary zpool - the 'log' zvol just has an optimum block size? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#6065 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4ueszDVK8-cDB_6IxzYluOJRMcpxfks5r2uLCgaJpZM4NHLLu>.

koplover · 2017-05-05T14:45:11Z

@bprotopopov Thanks for getting back to me with such a detailed answer

In terms of the two overriding concerns at the start:

yes the SSD pool has many zvols - well over 100 - they are the basis of every VM we run, some have 4 or 5, and we have > 20 VMs - so you can see if the SSD pool goes down we are in more trouble for the disk subsystem than concerns on this 'log' zvol alone - each root binary disk etc, configuration appears on the zvols (we need to pass a block device through to the VMs, we could loop but zvols are made for this job)
We ensure that the SSD pool does not go over 90% for the reported performance issues - we allow provisioning of only up to 90% of the SSD pool assigned space

Now if we consider the function of the file service VM (a single VM, no other VM runs a zpool of its own as their filesystem, but are ext4 for Linux VMs or ntfs for windows) in isolation, if this was a standalone server we would want ZFS for various functional integrations this supports. It is in this case just a 'file system' (yes I know its more than that) that we choose for this VM for its capabilities.

So, this then leaves us with how best to achieve this. We require that writes are synchronous. We want file writes to be reliable and consistent - no loss. My first attempt was therefore the obvious, a synch zvol passed through, and a simple pool defined which was used to hold files. There is no 'no synch' functionality required for the solution - all file writes should be synchronous.

Performance testing found large copies in this configuration to be slow. This is where the alternative thoughts come from, we want to amalgamate writes on the underlying SSD zpool to make this most efficient, How can this be achieved.

The thinking was how about passing through a second zvol and make that synchronous to act a log device in that pool, so that all writes are guaranteed to the log and are most efficiently passed on to the SSD pool log. As the log writes are the guaranteed, we can change the core file zvol asynch to use that log without file loss.

What I really don't understand is why having the zvol acting as the log of the overlying zpool is any different to any other zvol with a file system on top. From the perspective of the underlying pool, there is just block writes occurring from the overlying zvol. Perhaps I am missing something here?

For clarity this is the only zvol we use as a log device over the >100 we have on our server. This is to handle a problem we have seen in performance when we have an overlying pool consisting only of a synch zvol defining the single vdev of that pool (no log or cache).

bprotopopov · 2017-05-05T20:16:49Z

@koplover, if you feel that this is your preferred solution, that's what you have to use. Even though conceptually, double-virtualizing storage using zvol-as-vdev approach seems wasteful, l recommend measurements. To assess the performance hit in question, compare your setup with passing straight SSD to your in-VM pool as log device. Maybe your workload is such that this is not an issue.

…

________________________________ From: koplover <notifications@github.com> Sent: Friday, May 5, 2017 10:45:31 AM To: zfsonlinux/zfs Cc: Boris Protopopov; Mention Subject: Re: [zfsonlinux/zfs] Adding ZVol as log device of seprate pool causes deadlock. (#6065) @bprotopopov<https://github.com/bprotopopov> Thanks for getting back to me with such a detailed answer In terms of the two overriding concerns at the start: 1. yes the SSD pool has many zvols - well over 100 - they are the basis of every VM we run, some have 4 or 5, and we have > 20 VMs - so you can see if the SSD pool goes down we are in more trouble for the disk subsystem than concerns on this 'log' zvol alone - each root binary disk etc, configuration appears on the zvols (we need to pass a block device through to the VMs, we could loop but zvols are made for this job) 2. We ensure that the SSD pool does not go over 90% for the reported performance issues - we allow provisioning of only up to 90% of the SSD pool assigned space Now if we consider the function of the file service VM (a single VM, no other VM runs a zpool of its own as their filesystem, but are ext4 for Linux VMs or ntfs for windows) in isolation, if this was a standalone server we would want ZFS for various functional integrations this supports. It is in this case just a 'file system' (yes I know its more than that) that we choose for this VM for its capabilities. So, this then leaves us with how best to achieve this. We require that writes are synchronous. We want file writes to be reliable and consistent - no loss. My first attempt was therefore the obvious, a synch zvol passed through, and a simple pool defined which was used to hold files. There is no 'no synch' functionality required for the solution - all file writes should be synchronous. Performance testing found large copies in this configuration to be slow. This is where the alternative thoughts come from, we want to amalgamate writes on the underlying SSD zpool to make this most efficient, How can this be achieved. The thinking was how about passing through a second zvol and make that synchronous, so that all writes are guaranteed to the log and are most efficiently passed on to the SSD pool log. As the log writes are the guaranteed, we can change the core file zvol asynch to use that log without file loss. What I really don't understand is why having the zvol acting as the log of the overlying zpool is any different to any other zvol with a file system on top. From the perspective of the underlying pool, there is just block writes occurring from the overlying zvol. Perhaps I am missing something here? For clarity this is the only zvol we use as a log device over the >100 we have on our server. This is to handle a problem we have seen in performance when we have an overlying pool consisting only of a synch zvol defining the single vdev of that pool (no log or cache). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#6065 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uauwaWt8fw_AazEF2KUm8ebOyIu0ks5r2zYLgaJpZM4NHLLu>.

koplover · 2017-05-08T09:03:38Z

@bprotopopov Just for clarity, your concern on performance is purely zvol as log device, or more generally zvol acting as vdev for zpool data?

Our benchmarks have found performance to be OK in this configuration

bprotopopov · 2017-05-08T16:15:50Z

I would like to avoid making general statements as far as performance is concerned. I was pointing out the inefficiencies of using zvols as vdevs. One could still find this approach yielding "acceptable" performance in many settings. Yet to see what price is being paid in terms of resources and performance, one needs to run experiments and take measurements. Typos courtesy of my iPhone On May 8, 2017, at 5:04 AM, koplover <notifications@github.com<mailto:notifications@github.com>> wrote: @bprotopopov<https://github.com/bprotopopov> Just for clarity, your concern on performance is purely zvol as log device, or more generally zvol acting as vdev for zpool data? Our benchmarks have found performance to be OK in this configuration — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#6065 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACX4uSmkSAQByRxICgs9AuAWdOCNkTHGks5r3tpzgaJpZM4NHLLu>.

bprotopopov · 2017-05-09T19:50:31Z

Hi, @behlendorf, I am reviewing the code to prototype the proposed change (per-zvol_state_t lock), but in commit 35d3e32, you seem to have added a code path in zvol_set_volsize() that updates zvol's size even though zvol has not been opened yet (there is no zvol_state_t struct associated with it). Can you please explain what the use case is when this is meaningful ?

behlendorf · 2017-05-09T20:14:25Z

@bprotopopov it's possible that a zvol dataset can exist without a matching zvol_state_t. This can happen if the minor hasn't been asynchronous created yet, if the module option zvol_inhibit_dev=1 is set, or if the zvol is a snapshot and the property snapdev=hidden is set. In all of these cases (except snapshots) we still require that zfs set volsize works as expected.

bprotopopov · 2017-05-09T21:40:44Z

Hm, @behlendorf :)
then the same logic should apply to other properties, e.g. block size ?

bprotopopov · 2017-05-09T21:49:07Z

Hi, @behlendorf, if you could refresh my recollection, are zvol_open() and zvol_release() are called from add_disk() and del_gendisk() only, which is why current code checks for owning the zvol_state_lock first, or there are other call paths in zvol code that call zvol_open() and zvol_release() with zvol_state_lock held ?

behlendorf · 2017-05-09T22:17:10Z

then the same logic should apply to other properties, e.g. block size ?

Yes, for any of the special properties in zfs_prop_set_special() which call zvol_set_*.

zvol_open() gets called by the kernel from __blkdev_get(), and zvol_release() from __blkdev_put(). Once we register the minor by calling add_disk() then zvol_open() could be called at any time. It would be great if this could be reworked such that the zvol_state_lock only protects against addition/remove from the zvol_state_list (or zvol_state_tree if you want to get ambitious...)

bprotopopov · 2017-05-10T21:06:41Z

I have to confess, I sort of saw it coming :) but still wanted to see this in action.

@behlendorf, it appears that with zvol_state_lock moved out of the way, we are now deadlocking on the bdev->bd_mutex and spa_namespace_lock taken in the opposite order:

PID: 12591 TASK: ffff88007d041520 CPU: 0 COMMAND: "zpool"
#0 [ffff880049783928] schedule at ffffffff815489d0
#1 [ffff880049783a00] __mutex_lock_slowpath at ffffffff8154a3c6
#2 [ffff880049783a70] mutex_lock at ffffffff81549eeb
#3 [ffff880049783a90] __blkdev_get at ffffffff811d7bb3
#4 [ffff880049783af0] blkdev_get at ffffffff811d7f30
#5 [ffff880049783b00] open_bdev_exclusive at ffffffff811d803f
#6 [ffff880049783b30] vdev_disk_open at ffffffffa0a5a1cc [zfs]
#7 [ffff880049783ba0] vdev_open at ffffffffa0a54a85 [zfs]
#8 [ffff880049783c00] vdev_open_children at ffffffffa0a55c8b [zfs]
#9 [ffff880049783c40] vdev_root_open at ffffffffa0a6a7f5 [zfs]
#10 [ffff880049783c90] vdev_open at ffffffffa0a54a85 [zfs]
#11 [ffff880049783cf0] vdev_create at ffffffffa0a55077 [zfs]
#12 [ffff880049783d30] spa_vdev_add at ffffffffa0a38279 [zfs]
#13 [ffff880049783da0] zfs_ioc_vdev_add at ffffffffa0a898ad [zfs]
#14 [ffff880049783de0] zfsdev_ioctl at ffffffffa0a8b624 [zfs]
#15 [ffff880049783e60] vfs_ioctl at ffffffff811af562
#16 [ffff880049783ea0] do_vfs_ioctl at ffffffff811af704
#17 [ffff880049783f30] sys_ioctl at ffffffff811afc81
#18 [ffff880049783f80] system_call_fastpath at ffffffff8100b0d2

where zfs_ioc_vdev_add() calls spa_open() as a first order of business, then opens the device, trying to take bdev->bd_mutex in __blkdev_get() for the zv->zv_disk associated with the log zvol
and

PID: 12699 TASK: ffff880068435520 CPU: 2 COMMAND: "blkid"
#0 [ffff88005f76f8a8] schedule at ffffffff815489d0
#1 [ffff88005f76f980] __mutex_lock_slowpath at ffffffff8154a3c6
#2 [ffff88005f76f9f0] mutex_lock at ffffffff81549eeb
#3 [ffff88005f76fa10] spa_open_common at ffffffffa0a3f442 [zfs]
#4 [ffff88005f76fa80] spa_open at ffffffffa0a3f873 [zfs]
#5 [ffff88005f76fa90] dsl_pool_hold at ffffffffa0a1066b [zfs]
#6 [ffff88005f76fad0] dmu_objset_hold at ffffffffa09e272b [zfs]
#7 [ffff88005f76fb20] dsl_prop_get at ffffffffa0a15b3f [zfs]
#8 [ffff88005f76fb80] dsl_prop_get_integer at ffffffffa0a15bae [zfs]
#9 [ffff88005f76fb90] zvol_setup_zv at ffffffffa0acae65 [zfs]
#10 [ffff88005f76fbd0] zvol_open at ffffffffa0acb223 [zfs]
#11 [ffff88005f76fc30] __blkdev_get at ffffffff811d7c0e
#12 [ffff88005f76fc90] __blkdev_get at ffffffff811d7e42
#13 [ffff88005f76fcf0] blkdev_get at ffffffff811d7f30
#14 [ffff88005f76fd00] blkdev_open at ffffffff811d7fb1
#15 [ffff88005f76fd30] __dentry_open at ffffffff81196c52
#16 [ffff88005f76fd90] nameidata_to_filp at ffffffff81196fc4
#17 [ffff88005f76fdb0] do_filp_open at ffffffff811aceb0
#18 [ffff88005f76ff20] do_sys_open at ffffffff811969f7
#19 [ffff88005f76ff70] sys_open at ffffffff81196b00
#20 [ffff88005f76ff80] system_call_fastpath at ffffffff8100b0d2

where sys_open() first takes the bdev->bd_mutex and then tries to get the spa_namespace_lock.

This seems like a pretty fundamental issue, not just the global nature of the zvol_state_lock, that makes it difficult to use zvols as log devices.

bprotopopov · 2017-05-10T21:13:01Z

I seem to observe the same deadlock with trying to add a zvol as a regular vdev.

Is there any evidence that this has ever worked in zfsonlinux ?

behlendorf · 2017-05-10T22:33:19Z

This has never worked perfectly reliably under Linux but it used to be better.

@bprotopopov so we used to have a little bit of code is zvol_first_open() which did its best to handle this. It was reverted in commit 1ee159f, we could reintroduce this change which worked in conjunction with the vdev_open_children()->vdev_used->zvols() check.

The idea was that we could preemptively take the spa_namespace lock in zvol_open() since we're going to need it latter and spa_open_common() already handled the case where the caller is holding the lock. Then we could detect when a pool was layered directly on zvols and switch to calling vdev_open() sequentially while we're holding the namespace lock. The code in 1ee159f would then detect this and not take the lock again. The code also did a little trylock/ERESTART dance in case we failed to detect this layering but I'd be inclined to switch that back to a normal mutex_enter() and flat out not try to support more complicated block device layerings. We could fold in the #6093 change to this PR for anyone who really wants to do that.

We also ran in to various related deadlocks due to how broad the zvol_state_lock is. But with those issues behind us with your work the original workaround from 1ee159f might be sufficient (although admittedly not pretty). This major downside of course is the serialization through the spa_namespace_lock.

koplover · 2017-05-11T08:29:18Z

@bprotopopov @behlendorf I can confirm that with 0.6.4.2 we never saw this issue, obviously our system preparation code luckily avoided any race conditions within the code

We have an automated deploy / test stack that completely re-images on average 3-4 servers a night, every night, the the latest test builds, as well as deploying production systems as required. So over the last year > 1000 deploys and have never seen this issue. Any similar deployment against 0.7.0 and indeed manual creation of the overlying zpool immediately exhibits the issue every time.

bprotopopov · 2017-05-11T14:08:44Z

Yes, reverting 1ee159f should help.

bprotopopov · 2017-05-11T14:17:42Z

But I wonder if it would be cleaner to implement a special vdev type based on zvols ?

bprotopopov · 2017-05-11T14:21:28Z

Yeah, seems to work much better :)

[root@centos-6 build]# zpool add test_pool /dev/zvol/log_pool/zvol0
[root@centos-6 build]# zpool add test_pool log /dev/zvol/log_pool/log0
[root@centos-6 build]#
[root@centos-6 build]# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
log_pool 1008M 1.29M 1007M - 0% 0% 1.00x ONLINE -
sdb 1008M 1.29M 1007M - 0% 0%
test_pool 1.97G 227K 1.97G - 0% 0% 1.00x ONLINE -
sdd 1008M 107K 1008M - 0% 0%
zvol0 1008M 120K 1008M - 0% 0%
log - - - - - -
log0 1008M 0 1008M - 0% 0%
[root@centos-6 build]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
log_pool 2.30M 878M 23K /log_pool
log_pool/log0 1.04M 878M 1.04M -
log_pool/zvol0 1.16M 878M 1.16M -
test_pool 120K 1.84G 23K /test_pool

bprotopopov · 2017-05-11T16:40:00Z

So, generally, things seem to be working OK (did not do much I/O testing though), but:

[root@centos-6 ~]# zpool import
pool: zvol_pool
id: 10174184330791637035
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

zvol_pool   ONLINE
  sde       ONLINE
  sdd       ONLINE

pool: test_pool
id: 2189983178721653012
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://zfsonlinux.org/msg/ZFS-8000-6X
config:

test_pool    UNAVAIL  missing device
  sdb        ONLINE
  sdc        ONLINE
cache
  zvol1

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.

the zvol-based test_pool shows up as UNAVAIL until the underlying zvol_pool is imported:[root@centos-6 ~]# zpool import zvol_pool
[root@centos-6 ~]# zpool import
pool: test_pool
id: 2189983178721653012
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

test_pool   ONLINE
  sdb       ONLINE
  sdc       ONLINE
  zvol2     ONLINE
cache
  zvol1
logs
  zvol0     ONLINE

which I think is expected.

behlendorf · 2017-05-11T17:17:54Z

Great news. @bprotopopov that's the behavior I'd expect.

Once you're happy with the patch it would be great to open a PR so we can get you some feedback and additional testing. You might want to consider enabling the following test cases in the ZTS which depend on this layered pool functionality, That will get us some additional testing and ensure we don't regress on this in the future: zpool_add_004_pos, zpool_destroy_001_pos, zpool_expand_001_pos, zpool_expand_003_neg. There are probably more.

bprotopopov · 2017-05-11T18:18:08Z

@behlendorf, sure I will look into it.

bprotopopov · 2017-05-12T01:31:25Z

@behlendorf, I have looked into enabling the tests you have mentioned above and found that there are several things to be addressed, unrelated to support for zvol-based vdevs, before they can be enabled. I'll have to deal with this in a separate commit.

bprotopopov · 2017-05-12T02:02:18Z

@behlendorf, @koplover, I am still thinking about a new vdev type - zvol-based - as an alternative to the trylock() in zvol_first_open(). The custom vdev_open() would not have to go through the blkdev_get() and would therefore avoid the deadlock shown above. In fact, we could allow one to specify such vdevs by their dataset name (pool/zvol), so one could use zvols even if the device nodes for them are not available (zvol_inhibit_dev=1).

This approach would also allow one to bypass all the complexities of interacting with the block device layer and therefore, make zvol-based vdev support more portable (openzfs). Plus, we could probably gain some efficiencies by shorting the trip through the block layer.

Still, one fundamentally unpleasant issue with the approach of building pools on top of zvols is that it might be possible to get into a circular dependency situation where pools would use zvols from each other as their vdevs. Needless to say, this would be unfortunate. It could be expensive to perform dependency checks of this kind as the time of vdev add.

Enable about 50 additional test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * threadsappend_001_pos, write_dirs_002_pos - No changes required. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enable about 50 additional test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * threadsappend_001_pos, write_dirs_002_pos - No changes required. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos - Skip with log_unsupported. * rollback_003_pos, snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. * xattr_* - Updated for Linux and enabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix lock order inversion with zvol_open() as it did not account for use of zvols as vdevs. The latter use cases resulted in the lock order inversion deadlocks that involved spa_namespace_lock and bdev->bd_mutex. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6065 Issue #6134

Enable about 50 additional test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. Disabled until all the builders pass reliably. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * threadsappend_001_pos, write_dirs_002_pos - No changes required. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos - Skip with log_unsupported. * rollback_003_pos, snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. * xattr_* - Updated for Linux and enabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enable most of the remaining test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. Disabled until all the builders pass reliably. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * threadsappend_001_pos, write_dirs_002_pos - No changes required. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos - Skip with log_unsupported. * rollback_003_pos, snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. * xattr_* - Updated for Linux and enabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2017-05-18T18:04:07Z

I am still thinking about a new vdev type - zvol-based

@bprotopopov that's an interesting idea. There's no reason why we couldn't add a new vdev leaf type which could layer directly on a zvol bypassing the block layer. You could do a similar thing and layer on files in a ZFS filesystem bypassing the VFS. It would be an interesting thing to prototype to determine what practical advantages there are. Would it really help performance, etc.

Back to this specific issue after merging 5559ba0 layering directly on zvols is now working well locally for me. However, I have noticed a couple suspicious buildbot failures where tests zpool_add_004 and zpool_upgrade_004 which require this functionality are killed after 10 minutes. I suspect there are still some lurking edge cases where deadlocks are possible which we'll want to run down. But thus far I've been unable to reproduce the issue to collect any backtraces.

behlendorf · 2017-05-18T18:29:12Z

@kpande good thought. I've definitely observed the suspicious failures on systems with 16K stacks so I wouldn't think so, but it's a possibility.

bprotopopov · 2017-05-18T18:55:28Z

@behlendorf, if there is more info on the hangs, I can take a look

Enable most of the remaining test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. Disabled until all the builders pass reliably. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * threadsappend_001_pos, write_dirs_002_pos - No changes required. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos - Skip with log_unsupported. * rollback_003_pos, snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2017-05-19T00:43:21Z

@bprotopopov thanks, I'll definitely let you know and open an issue with the details if I'm able to get some backtracks.

Enable most of the remaining test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. Disabled until all the builders pass reliably. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes required. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg, write_dirs_002_pos - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos, rollback_003_pos, threadsappend_001_pos - Skip with log_unsupported. * snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2017-05-19T17:41:32Z

@bprotopopov I was able to easily reproduce the issue under Amazon Linux in ec2 with the zpool_destroy_001_pos test case. I've filed #6145 with the details on how to reproduce it but I'm not going to have a chance to dig any deeper. If you can take a look that would be great.

Enable most of the remaining test cases which were previously disabled. The required fixes are as follows: * cache_001_pos - No changes required. * cache_010_neg - Updated to use losetup under Linux. Loopback cache devices are allowed, ZVOLs as cache devices are not. Disabled until all the builders pass reliably. * cachefile_001_pos, cachefile_002_pos, cachefile_003_pos, cachefile_004_pos - Set set_device_dir path in cachefile.cfg, updated CPATH1 and CPATH2 to reference unique files. * zfs_clone_005_pos - Wait for udev to create volumes. * zfs_mount_007_pos - Updated mount options to expected Linux names. * zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required. * zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos - Updated to expect -f to not unmount busy mount points under Linux. * rsend_019_pos - Observed to occasionally take a long time on both 32-bit systems and the kmemleak builder. * zfs_written_property_001_pos - Switched sync(1) to sync_pool. * devices_001_pos, devices_002_neg - Updated create_dev_file() helper for Linux. * exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated test case to expect EPERM from Linux as described by mmap(2). * grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh scripts from OpenZFS. * grow_replicas_001_pos.ksh - Added missing $SLICE_* variables. * history_004_pos, history_006_neg, history_008_pos - Fixed by previous commits and were not enabled. No changes required. * zfs_allow_010_pos - Added missing spaces after assorted zfs commands in delegate_common.kshlib. * inuse_* - Illumos dump device tests skipped. Remaining test cases updated to correctly create required partitions. * large_files_001_pos - Fixed largest_file.c to accept EINVAL as well as EFBIG as described in write(2). * largest_pool_001_pos - Skipped until layering pools on zvols is supported. * link_count_001 - Added nproc to required commands. * umountall_001 - Updated to use umount -a. * online_offline_001_* - Pull in OpenZFS change to file_trunc.c to make the '-c 0' option run the test in a loop. Included online_offline.cfg file in all test cases. * rename_dirs_001_pos - Updated to use the rename_dir test binary, pkill restricted to exact matches and total runtime reduced. * slog_013_neg, write_dirs_002_pos - No changes required. * slog_013_pos.ksh - Updated to use losetup under Linux. * slog_014_pos.ksh - ZED will not be running, manually degrade the damaged vdev as expected. * nopwrite_varying_compression, nopwrite_volume - Forced pool sync with sync_pool to ensure up to date property values. * Fixed typos in ZED log messages. Refactored zed_* helper functions to resolve all-syslog exit=1 errors in zedlog. * zfs_copies_003_pos, zfs_copies_005_neg, zfs_get_004_pos, clone_001_pos, cache_010_neg.ksh, zpool_add_004_pos, zpool_destroy_001_pos, largest_pool_001_pos - Enable tests which create pools layed on ZVOLs, resolved by PR openzfs#6065. * largest_pool_001_pos - Limited to 7eb pool, maximum supported size in 8eb-1 on Linux. * zpool_expand_001_pos, zpool_expand_003_neg - Updated skip reason. * zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup busy mount points under Linux between test loops. * privilege_001_pos, privilege_003_pos, rollback_003_pos, threadsappend_001_pos - Skip with log_unsupported. * snapshot_016_pos - No changes required. * snapshot_008_pos - Increased LIMIT from 512K to 2M and added sync_pool to avoid false positives. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf closed this as completed in 5559ba0 May 16, 2017

Adding ZVol as log device of seprate pool causes deadlock. #6065

Adding ZVol as log device of seprate pool causes deadlock. #6065

Comments

kenthinson commented Apr 25, 2017 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Fabian-Gruenbichler commented Apr 25, 2017

koplover commented Apr 28, 2017

koplover commented May 3, 2017

behlendorf commented May 3, 2017

koplover commented May 4, 2017

behlendorf commented May 4, 2017

bprotopopov commented May 5, 2017 via email

bprotopopov commented May 5, 2017

koplover commented May 5, 2017

bprotopopov commented May 5, 2017 via email

koplover commented May 5, 2017 • edited Loading

bprotopopov commented May 5, 2017 via email

koplover commented May 8, 2017

bprotopopov commented May 8, 2017 via email

bprotopopov commented May 9, 2017

behlendorf commented May 9, 2017 • edited Loading

bprotopopov commented May 9, 2017

bprotopopov commented May 9, 2017

behlendorf commented May 9, 2017

bprotopopov commented May 10, 2017 • edited Loading

bprotopopov commented May 10, 2017

behlendorf commented May 10, 2017

koplover commented May 11, 2017

bprotopopov commented May 11, 2017

bprotopopov commented May 11, 2017

bprotopopov commented May 11, 2017 • edited Loading

bprotopopov commented May 11, 2017

behlendorf commented May 11, 2017

bprotopopov commented May 11, 2017

bprotopopov commented May 12, 2017

bprotopopov commented May 12, 2017

behlendorf commented May 18, 2017

behlendorf commented May 18, 2017

bprotopopov commented May 18, 2017

behlendorf commented May 19, 2017

behlendorf commented May 19, 2017

kenthinson commented Apr 25, 2017 •

edited

Loading

koplover commented May 5, 2017 •

edited

Loading

behlendorf commented May 9, 2017 •

edited

Loading

bprotopopov commented May 10, 2017 •

edited

Loading

bprotopopov commented May 11, 2017 •

edited

Loading