-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS/zpool hang on zfs destroying a ZVOL with a snapshot #1948
Comments
After a little while more, considerably more noise appears: [102123.730810] INFO: task spl_system_task:1447 blocked for more than 120 seconds. |
I have no idea as to whether this is relevant, but the Debian system at hand runs as dom0 under Xen version 4.1.4 (Debian 4.1.4-4). The reason for mentioning this is that I see a couple of xen_context_switch related lines in the above kernel messages. |
There's a decent chance this was caused by a stack overrun. The stack usage for this call path was already reduced in master by commit a168788. |
I can give that a try; can I patch that into a clean 0.6.2, or does it depend on additional changes? |
@niekbergboer It should apply cleanly and it's safe the cherry pick. |
@behlendorf I reproduced this today on my system. I do not think a stack overrun is the cause. I did a replication stream to a new pool, killed it mid-way, destroyed the new pool and then tried destroying a dataset that was obsolete before restarting the stream. Something during this triggered the following:
|
I seem to be able to reproduce this reliably when trying to migrate to a new root pool. FreeBSD has a possible fix: freebsd/freebsd-src@4995789cde5 I am testing it on my system now. |
That FreeBSD patch does not fix this problem. |
Closing, this should be resolved in 0.6.5.x. |
Scenario:
Debian Jessie system kernel Linux batalix 3.11-2-amd64 #1 SMP Debian 3.11.8-1 (2013-11-13) x86_64 GNU/Linux . Running ZoL 0.6.2.
I moved a ZVOL by snapshotting it, zfs sending it elsewhere, after which I deleted the ZVOL from the source pool:
$ sudo zfs destroy -r -v vmvol/sb
will destroy vmvol/sb@move
... after which nothing happens anymore. zpool and zfs commands hang, although I can access files on already-mounted ZFS datasets. After a while, the following appears in dmesg:
[102003.731621] INFO: task spl_system_task:1447 blocked for more than 120 seconds.
[102003.731624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[102003.731625] spl_system_task D ffff880310994360 0 1447 2 0x00000000
[102003.731629] ffff880310994040 0000000000000246 0000000000014300 ffff88030e8dffd8
[102003.731631] 0000000000014300 ffff88030e8dffd8 ffff88030e0d3468 ffff88030e0d3440
[102003.731633] ffff88030e0d3470 0000000000000000 0000000000000002 0000000000000000
[102003.731635] Call Trace:
[102003.731649] [] ? cv_wait_common+0xe5/0x1a0 [spl]
[102003.731654] [] ? wake_up_atomic_t+0x30/0x30
[102003.731667] [] ? traverse_prefetcher+0x8b/0x140 [zfs]
[102003.731677] [] ? traverse_visitbp+0x2d7/0x6d0 [zfs]
[102003.731684] [] ? arc_read+0x549/0x8d0 [zfs]
[102003.731693] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731702] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731711] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731720] [] ? traverse_dnode+0x78/0x130 [zfs]
[102003.731729] [] ? traverse_visitbp+0x504/0x6d0 [zfs]
[102003.731738] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731747] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731755] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731764] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731773] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731781] [] ? traverse_visitbp+0x414/0x6d0 [zfs]
[102003.731790] [] ? traverse_dnode+0x78/0x130 [zfs]
[102003.731798] [] ? traverse_visitbp+0x5bd/0x6d0 [zfs]
[102003.731802] [] ? xen_end_context_switch+0x9/0x20
[102003.731804] [] ? __switch_to+0x125/0x490
[102003.731813] [] ? traverse_prefetch_thread+0x86/0xc0 [zfs]
[102003.731822] [] ? dmu_recv_end+0x210/0x210 [zfs]
[102003.731826] [] ? taskq_thread+0x22c/0x4a0 [spl]
[102003.731829] [] ? wake_up_state+0x10/0x10
[102003.731833] [] ? taskq_cancel_id+0x1e0/0x1e0 [spl]
[102003.731835] [] ? kthread+0xaf/0xc0
[102003.731838] [] ? kthread_create_on_node+0x110/0x110
[102003.731841] [] ? ret_from_fork+0x7c/0xb0
[102003.731844] [] ? kthread_create_on_node+0x110/0x110
The text was updated successfully, but these errors were encountered: