Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync blocked for more than 120s #3613

Closed
sluitz opened this issue Jul 19, 2015 · 12 comments
Closed

txg_sync blocked for more than 120s #3613

sluitz opened this issue Jul 19, 2015 · 12 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@sluitz
Copy link

sluitz commented Jul 19, 2015

All ZFS read/writes blocked / hung. Did not clear up by itself for an hour (had to force hard reset of system). Possibly triggered by snapshot create or delete. Pool scrub was going on during problem.

Kernel: 2.6.32-504.30.3.el6.x86_64 (CentOS 6.x)
ZFS and SPL rpms built from git master (yesterday)
ZFS: zfs-dkms-0.6.4-164_g53b1d97.el6.noarch
SPL: spl-dkms-0.6.4-13_g37d7cd9.el6.noarch

System: Intel Atom C2750 with 32G of ECC RAM

This happened after an upgrade from git build zfs-dkms-0.6.3-155_g7b2d78a.el6.noarch.rpm / spl-dkms-0.6.3-50_g917fef2.el6.noarch.rpm which ran for many months (since November 2014) without any problems.

Steffen

Kernel log:

INFO: task txg_sync:2417 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync D 0000000000000001 0 2417 2 0x00000000
ffff8808123dd220 0000000000000046 0000000000000000 ffff8808123dd1e4
0000000000000001 ffff88082fc24300 00005ceb0411e34c ffff8800283d58c0
0000000000005750 0000000106110f28 ffff8808222ff068 ffff8808123ddfd8
Call Trace:
[] __mutex_lock_slowpath+0x96/0x210
[] mutex_lock+0x2b/0x50
[] cv_wait_common+0xb7/0x130 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] ? buf_hash_find+0x9f/0x180 [zfs]
[] __cv_wait+0x15/0x20 [spl]
[] arc_read+0xb5/0xa70 [zfs]
[] ? read_tsc+0x9/0x20
[] ? getrawmonotonic+0x34/0xb0
[] ? arc_getbuf_func+0x0/0x80 [zfs]
[] dsl_scan_visitbp+0x509/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] ? arc_read+0x3e1/0xa70 [zfs]
[] dsl_scan_visitbp+0x83e/0xb60 [zfs]
[] dsl_scan_visitds+0xe2/0x4c0 [zfs]
[] dsl_scan_sync+0x28f/0xbc0 [zfs]
[] spa_sync+0x3c7/0xb10 [zfs]
[] ? __wake_up_common+0x59/0x90
[] ? __wake_up+0x53/0x70
[] ? read_tsc+0x9/0x20
[] txg_sync_thread+0x389/0x620 [zfs]
[] ? account_entity_enqueue+0x7e/0x90
[] ? txg_sync_thread+0x0/0x620 [zfs]
[] ? txg_sync_thread+0x0/0x620 [zfs]
[] thread_generic_wrapper+0x68/0x80 [spl]
[] ? thread_generic_wrapper+0x0/0x80 [spl]
[] kthread+0x9e/0xc0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20

INFO: task zfs:25498 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs D 0000000000000001 0 25498 25487 0x00000080
ffff8805c5c3fa28 0000000000000082 ffff8805c5c3f978 ffffffffa022dfce
ffff8805c5c3faf8 ffffffffa02532b7 ffff8805c5c3f998 ffffffff00000000
ffff88072165fb00 ffff88081478e800 ffff88081e9cc5f8 ffff8805c5c3ffd8
Call Trace:
[] ? dmu_buf_rele+0xe/0x10 [zfs]
[] ? dsl_dataset_snapshot_check+0x117/0x3a0 [zfs]
[] ? prepare_to_wait_exclusive+0x4e/0x80
[] cv_wait_common+0x11d/0x130 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] __cv_wait+0x15/0x20 [spl]
[] txg_wait_synced+0x8b/0xd0 [zfs]
[] ? dsl_dataset_snapshot_check+0x0/0x3a0 [zfs]
[] dsl_sync_task+0x16a/0x250 [zfs]
[] ? dsl_dataset_snapshot_sync+0x0/0x1a0 [zfs]
[] ? dsl_dataset_snapshot_check+0x0/0x3a0 [zfs]
[] ? dsl_dataset_snapshot_sync+0x0/0x1a0 [zfs]
[] dsl_dataset_snapshot+0x139/0x2e0 [zfs]
[] ? nvlist_add_common+0x3eb/0x450 [znvpair]
[] ? __kmalloc_node+0x4d/0x60
[] ? spl_kmem_alloc_debug+0x9c/0x1e0 [spl]
[] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
[] zfs_ioc_snapshot+0x249/0x290 [zfs]
[] zfsdev_ioctl+0x1cf/0x4d0 [zfs]
[] vfs_ioctl+0x22/0xa0
[] do_vfs_ioctl+0x84/0x580
[] sys_ioctl+0x81/0xa0
[] system_call_fastpath+0x16/0x1b

@kernelOfTruth
Copy link
Contributor

any chance to upgrade to 0.6.4.2 ?

there were significant improvements that could help in your case:
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.1
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2

at least from that mirror 0.6.4.2 should be available: http://yum.tamu.edu/zfsonlinux/epel/6/x86_64/

@dclaisse
Copy link

Hi,

I'm seeing quite a similar issue since a few weeks here, but I'm running ZFS nightly on debian (updated yesterday evening to 0.6.4-21-53b1d9). It always seem to happen in the middle of a scrub, but this pool is a backup pool so there is no heavy I/O on it except during scrubs. I/O is completely stalled after these messages (tried to let the system recover by itself for 6+ hours with no effect).

Feel free to ask me any detail that may help troubleshooting the issue.

Kernel log is:

[34708.383167] INFO: task txg_sync:769 blocked for more than 120 seconds.
[34708.385273] Tainted: P O 3.16.0-4-amd64 #1
[34708.387380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[34708.389512] txg_sync D ffff8800a1c4ee38 0 769 2 0x00000000
[34708.391684] ffff8800a1c4e9e0 0000000000000046 0000000000012f00 ffff8800a1d07fd8
[34708.393874] 0000000000012f00 ffff8800a1c4e9e0 ffffffffa032a870 ffff8800a1d071d8
[34708.396103] ffffffffa032a874 ffff8800a1c4e9e0 00000000ffffffff ffffffffa032a878
[34708.398372] Call Trace:
[34708.400672] [] ? schedule_preempt_disabled+0x25/0x70
[34708.403039] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[34708.405429] [] ? mutex_lock+0x1b/0x2a
[34708.407858] [] ? cv_wait_common+0x98/0x100 [spl]
[34708.410311] [] ? prepare_to_wait_event+0xf0/0xf0
[34708.412848] [] ? arc_read+0xc5/0xa80 [zfs]
[34708.415420] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[34708.417982] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[34708.420541] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[34708.423134] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[34708.425702] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.428259] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.430778] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[34708.433271] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.435745] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.438168] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.440560] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.442906] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.445219] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[34708.447504] [] ? arc_read+0x478/0xa80 [zfs]
[34708.449776] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[34708.452053] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[34708.454309] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[34708.456556] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[34708.458737] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[34708.460918] [] ? spa_sync+0x3c3/0xb30 [zfs]
[34708.463081] [] ? txg_sync_thread+0x373/0x620 [zfs]
[34708.465209] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[34708.467319] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[34708.469396] [] ? __thread_exit+0x20/0x20 [spl]
[34708.471447] [] ? kthread+0xbd/0xe0
[34708.473473] [] ? kthread_create_on_node+0x180/0x180
[34708.475512] [] ? ret_from_fork+0x58/0x90
[34708.477555] [] ? kthread_create_on_node+0x180/0x180

@sluitz
Copy link
Author

sluitz commented Jul 20, 2015

The version I'm running (git master from a few days ago) is newer than
0.6.4.2. From what I can tell, all the relevant improvements in 0.6.4.x
are also in master.

On 7/19/2015 5:36 PM, kernelOfTruth aka. kOT, Gentoo user wrote:

any chance to upgrade to 0.6.4.2 ?

there were significant improvements that could help in your case:
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.1
https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2

at least from that mirror 0.6.4.2 should be available:
http://yum.tamu.edu/zfsonlinux/epel/6/x86_64/


Reply to this email directly or view it on GitHub
#3613 (comment).

@eolson78
Copy link

I to have seen some of these issues on My CENTOS 6.5
Kernel 3.14.4
latest Master ZFS code

Jul 17 14:46:45 kernel: INFO: task txg_sync:4521 blocked for more than 120 seconds.
Jul 17 14:46:45 kernel: Tainted: PF O 3.14.4 #1
Jul 17 14:46:45 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 17 14:46:45 kernel: txg_sync D 0000000000000000 0 4521 2 0x00000080
Jul 17 14:46:45 kernel: ffff88021fae7b58 0000000000000046 0000000000000086 ffff88021fae6010
Jul 17 14:46:45 kernel: 0000000000012c40 0000000000012c40 ffff880234188150 ffff8800b9d5b250
Jul 17 14:46:45 kernel: ffff88021fae7b58 ffff88023f232c40 ffff880234188150 ffff88021fae7b98
Jul 17 14:46:45 kernel: Call Trace:
Jul 17 14:46:45 kernel: [] schedule+0x29/0x70
Jul 17 14:46:45 kernel: [] io_schedule+0x8c/0xd0
Jul 17 14:46:45 kernel: [] cv_wait_common+0xa5/0x130 [spl]
Jul 17 14:46:45 kernel: [] ? bit_waitqueue+0xe0/0xe0
Jul 17 14:46:45 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Jul 17 14:46:45 kernel: [] zio_wait+0x11b/0x1e0 [zfs]
Jul 17 14:46:45 kernel: [] dsl_pool_sync+0xb3/0x430 [zfs]
Jul 17 14:46:45 kernel: [] spa_sync+0x346/0x880 [zfs]
Jul 17 14:46:45 kernel: [] ? __wake_up_common+0x59/0x90
Jul 17 14:46:45 kernel: [] txg_sync_thread+0x351/0x4a0 [zfs]
Jul 17 14:46:45 kernel: [] ? enqueue_task_fair+0x37d/0x490
Jul 17 14:46:45 kernel: [] ? txg_dispatch_callbacks+0x150/0x150 [zfs]
Jul 17 14:46:45 kernel: [] ? txg_dispatch_callbacks+0x150/0x150 [zfs]
Jul 17 14:46:45 kernel: [] ? __thread_create+0x160/0x160 [spl]
Jul 17 14:46:45 kernel: [] thread_generic_wrapper+0x78/0x90 [spl]
Jul 17 14:46:45 kernel: [] ? __thread_create+0x160/0x160 [spl]
Jul 17 14:46:45 kernel: [] kthread+0xce/0xf0
Jul 17 14:46:45 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Jul 17 14:46:45 kernel: [] ret_from_fork+0x7c/0xb0
Jul 17 14:46:45 kernel: [] ? kthread_freezable_should_stop+0x70/0x70

@kernelOfTruth
Copy link
Contributor

@sluitz yes, the changes first appear in master then end up partially in the point releases

okay, so those changes don't help

@dclaisse , @eolson78 could you please post some stats on your hardware and what kind of system configuration (anything special or out of the ordinary ?) you're running ?

@eolson78
Copy link

@kernelOfTruth 4 VCPU 8GB of RAM no other special configurations at all.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jul 20, 2015
@dclaisse
Copy link

@kernelOfTruth setup is quite standard here, 5x3TB Sata in a raidz pool, 8GB RAM, 4GB SSD ZIL.
No specific ZFS tuning, but I have a swap zvol on this pool. I've just launched another scrub after disabling swap to see if it matters.

@dclaisse
Copy link

@kernelOfTruth even with swap disabled, scrub stalls (tried 2 scrubs, had to reboot server using sysrq each time). @behlendorf, at some point system starts being unresponsive due to txg_sync threads stalled, so it's not only a performance issue. Last kernel stack trace is:

[50319.599843] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50319.599992] Tainted: P O 3.16.0-4-amd64 #1
[50319.600095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50319.600239] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50319.600383] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50319.600536] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50319.600689] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50319.600841] Call Trace:
[50319.600939] [] ? schedule_preempt_disabled+0x25/0x70
[50319.601066] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50319.601196] [] ? mutex_lock+0x1b/0x2a
[50319.601313] [] ? cv_wait_common+0x98/0x100 [spl]
[50319.601434] [] ? prepare_to_wait_event+0xf0/0xf0
[50319.601598] [] ? arc_read+0xc5/0xa80 [zfs]
[50319.601747] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50319.601913] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50319.602067] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50319.602239] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50319.602406] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50319.602574] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.602742] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.602909] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.603077] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.603244] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.603411] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50319.603629] [] ? arc_read+0x478/0xa80 [zfs]
[50319.603802] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50319.603959] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50319.604117] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50319.604285] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50319.604415] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50319.604591] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50319.604752] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50319.605028] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50319.605178] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50319.605318] [] ? __thread_exit+0x20/0x20 [spl]
[50319.605436] [] ? kthread+0xbd/0xe0
[50319.605534] [] ? kthread_create_on_node+0x180/0x180
[50319.605657] [] ? ret_from_fork+0x58/0x90
[50319.605765] [] ? kthread_create_on_node+0x180/0x180
[50439.697543] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50439.697693] Tainted: P O 3.16.0-4-amd64 #1
[50439.697796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50439.697939] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50439.701593] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50439.705290] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50439.708980] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50439.712586] Call Trace:
[50439.716069] [] ? schedule_preempt_disabled+0x25/0x70
[50439.719587] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50439.723060] [] ? mutex_lock+0x1b/0x2a
[50439.726476] [] ? cv_wait_common+0x98/0x100 [spl]
[50439.729857] [] ? prepare_to_wait_event+0xf0/0xf0
[50439.733259] [] ? arc_read+0xc5/0xa80 [zfs]
[50439.736645] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50439.740017] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50439.743350] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50439.746707] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50439.750071] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50439.753392] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.756648] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.759863] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.763031] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.766142] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.769171] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50439.772135] [] ? arc_read+0x478/0xa80 [zfs]
[50439.775161] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50439.778242] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50439.781351] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50439.784544] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50439.787753] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50439.791069] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50439.794425] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50439.797790] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50439.801099] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50439.804429] [] ? __thread_exit+0x20/0x20 [spl]
[50439.807741] [] ? kthread+0xbd/0xe0
[50439.811057] [] ? kthread_create_on_node+0x180/0x180
[50439.814380] [] ? ret_from_fork+0x58/0x90
[50439.817695] [] ? kthread_create_on_node+0x180/0x180
[50559.911391] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50559.914791] Tainted: P O 3.16.0-4-amd64 #1
[50559.918165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50559.921635] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50559.925143] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50559.928719] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50559.932299] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50559.935894] Call Trace:
[50559.939443] [] ? schedule_preempt_disabled+0x25/0x70
[50559.943004] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50559.946548] [] ? mutex_lock+0x1b/0x2a
[50559.950033] [] ? cv_wait_common+0x98/0x100 [spl]
[50559.953471] [] ? prepare_to_wait_event+0xf0/0xf0
[50559.956959] [] ? arc_read+0xc5/0xa80 [zfs]
[50559.960410] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50559.963846] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50559.967245] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50559.970643] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50559.974062] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50559.977442] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.980777] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.984051] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.987263] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.990399] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.993484] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50559.996500] [] ? arc_read+0x478/0xa80 [zfs]
[50559.999567] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50560.002659] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50560.005826] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50560.009069] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50560.012316] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50560.015670] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50560.019042] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50560.022433] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50560.025794] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50560.029167] [] ? __thread_exit+0x20/0x20 [spl]
[50560.032517] [] ? kthread+0xbd/0xe0
[50560.035855] [] ? kthread_create_on_node+0x180/0x180
[50560.039181] [] ? ret_from_fork+0x58/0x90
[50560.042509] [] ? kthread_create_on_node+0x180/0x180
[50680.137245] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50680.140671] Tainted: P O 3.16.0-4-amd64 #1
[50680.144064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50680.147557] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50680.151100] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50680.154705] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50680.158302] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50680.161911] Call Trace:
[50680.165471] [] ? schedule_preempt_disabled+0x25/0x70
[50680.169073] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50680.172608] [] ? mutex_lock+0x1b/0x2a
[50680.176092] [] ? cv_wait_common+0x98/0x100 [spl]
[50680.179546] [] ? prepare_to_wait_event+0xf0/0xf0
[50680.183040] [] ? arc_read+0xc5/0xa80 [zfs]
[50680.186505] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50680.189955] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50680.193360] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50680.196763] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50680.200188] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50680.203570] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.206909] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.210192] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.213421] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.216563] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.219665] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50680.222691] [] ? arc_read+0x478/0xa80 [zfs]
[50680.225771] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50680.228875] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50680.232057] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50680.235299] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50680.238546] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50680.241905] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50680.245301] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50680.248694] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50680.252050] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50680.255420] [] ? __thread_exit+0x20/0x20 [spl]
[50680.258770] [] ? kthread+0xbd/0xe0
[50680.262101] [] ? kthread_create_on_node+0x180/0x180
[50680.265449] [] ? ret_from_fork+0x58/0x90
[50680.268754] [] ? kthread_create_on_node+0x180/0x180
[50800.363074] INFO: task z_wr_int_2:334 blocked for more than 120 seconds.
[50800.366498] Tainted: P O 3.16.0-4-amd64 #1
[50800.369908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50800.373400] z_wr_int_2 D ffff8800360e6768 0 334 2 0x00000000
[50800.376916] ffff8800360e6310 0000000000000046 0000000000012f00 ffff880231e5ffd8
[50800.380503] 0000000000012f00 ffff8800360e6310 ffffffffa030da30 ffff880231e5fcb8
[50800.384092] ffffffffa030da34 ffff8800360e6310 00000000ffffffff ffffffffa030da38
[50800.387691] Call Trace:
[50800.391244] [] ? schedule_preempt_disabled+0x25/0x70
[50800.394813] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50800.398350] [] ? mutex_lock+0x1b/0x2a
[50800.401875] [] ? buf_hash_insert+0x77/0x190 [zfs]
[50800.405357] [] ? arc_write_done+0x91/0x3d0 [zfs]
[50800.408833] [] ? zio_done+0x2c9/0xe10 [zfs]
[50800.412241] [] ? mutex_lock+0xe/0x2a
[50800.415645] [] ? spa_config_exit+0x7c/0xc0 [zfs]
[50800.419034] [] ? zio_done+0x869/0xe10 [zfs]
[50800.422385] [] ? zio_execute+0xbf/0x170 [zfs]
[50800.425716] [] ? taskq_thread+0x200/0x450 [spl]
[50800.429038] [] ? wake_up_state+0x10/0x10
[50800.432354] [] ? taskq_cancel_id+0x110/0x110 [spl]
[50800.435683] [] ? kthread+0xbd/0xe0
[50800.438982] [] ? kthread_create_on_node+0x180/0x180
[50800.442280] [] ? ret_from_fork+0x58/0x90
[50800.445563] [] ? kthread_create_on_node+0x180/0x180
[50800.448870] INFO: task txg_sync:367 blocked for more than 120 seconds.
[50800.452236] Tainted: P O 3.16.0-4-amd64 #1
[50800.455653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50800.459218] txg_sync D ffff880231e1e828 0 367 2 0x00000000
[50800.462860] ffff880231e1e3d0 0000000000000046 0000000000012f00 ffff880035ec7fd8
[50800.466644] 0000000000012f00 ffff880231e1e3d0 ffff88023ec137b0 ffff88010ea62850
[50800.470485] ffff88010ea62890 0000000000000001 ffff88023351d800 0000000000000000
[50800.474426] Call Trace:
[50800.478297] [] ? io_schedule+0x99/0x120
[50800.482259] [] ? cv_wait_common+0x90/0x100 [spl]
[50800.486224] [] ? prepare_to_wait_event+0xf0/0xf0
[50800.490220] [] ? zio_wait+0x10b/0x1e0 [zfs]
[50800.494237] [] ? dsl_pool_sync+0xaa/0x460 [zfs]
[50800.498258] [] ? spa_sync+0x366/0xb30 [zfs]
[50800.502249] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50800.506265] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50800.510250] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50800.514235] [] ? __thread_exit+0x20/0x20 [spl]
[50800.518189] [] ? kthread+0xbd/0xe0
[50800.522121] [] ? kthread_create_on_node+0x180/0x180
[50800.526076] [] ? ret_from_fork+0x58/0x90
[50800.530017] [] ? kthread_create_on_node+0x180/0x180
[50800.533992] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50800.537936] Tainted: P O 3.16.0-4-amd64 #1
[50800.541860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50800.545825] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50800.549780] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50800.553751] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50800.557723] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50800.561693] Call Trace:
[50800.565591] [] ? schedule_preempt_disabled+0x25/0x70
[50800.569545] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50800.573505] [] ? mutex_lock+0x1b/0x2a
[50800.577444] [] ? cv_wait_common+0x98/0x100 [spl]
[50800.581387] [] ? prepare_to_wait_event+0xf0/0xf0
[50800.585366] [] ? arc_read+0xc5/0xa80 [zfs]
[50800.589351] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50800.593364] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50800.597359] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50800.601389] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50800.605431] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50800.609429] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.613386] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.617289] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.621137] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.624930] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.628669] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50800.632334] [] ? arc_read+0x478/0xa80 [zfs]
[50800.635970] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50800.639582] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50800.643149] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50800.646665] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50800.650091] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50800.653557] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50800.656982] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50800.660396] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50800.663759] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50800.667227] [] ? __thread_exit+0x20/0x20 [spl]
[50800.670610] [] ? kthread+0xbd/0xe0
[50800.673949] [] ? kthread_create_on_node+0x180/0x180
[50800.677295] [] ? ret_from_fork+0x58/0x90
[50800.680623] [] ? kthread_create_on_node+0x180/0x180
[50920.777077] INFO: task z_wr_int_2:334 blocked for more than 120 seconds.
[50920.780496] Tainted: P O 3.16.0-4-amd64 #1
[50920.783882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50920.787376] z_wr_int_2 D ffff8800360e6768 0 334 2 0x00000000
[50920.790900] ffff8800360e6310 0000000000000046 0000000000012f00 ffff880231e5ffd8
[50920.794499] 0000000000012f00 ffff8800360e6310 ffffffffa030da30 ffff880231e5fcb8
[50920.798089] ffffffffa030da34 ffff8800360e6310 00000000ffffffff ffffffffa030da38
[50920.801685] Call Trace:
[50920.805251] [] ? schedule_preempt_disabled+0x25/0x70
[50920.808827] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50920.812363] [] ? mutex_lock+0x1b/0x2a
[50920.815903] [] ? buf_hash_insert+0x77/0x190 [zfs]
[50920.819386] [] ? arc_write_done+0x91/0x3d0 [zfs]
[50920.822866] [] ? zio_done+0x2c9/0xe10 [zfs]
[50920.826275] [] ? mutex_lock+0xe/0x2a
[50920.829679] [] ? spa_config_exit+0x7c/0xc0 [zfs]
[50920.833077] [] ? zio_done+0x869/0xe10 [zfs]
[50920.836431] [] ? zio_execute+0xbf/0x170 [zfs]
[50920.839770] [] ? taskq_thread+0x200/0x450 [spl]
[50920.843100] [] ? wake_up_state+0x10/0x10
[50920.846416] [] ? taskq_cancel_id+0x110/0x110 [spl]
[50920.849744] [] ? kthread+0xbd/0xe0
[50920.853045] [] ? kthread_create_on_node+0x180/0x180
[50920.856327] [] ? ret_from_fork+0x58/0x90
[50920.859612] [] ? kthread_create_on_node+0x180/0x180
[50920.862927] INFO: task txg_sync:367 blocked for more than 120 seconds.
[50920.866303] Tainted: P O 3.16.0-4-amd64 #1
[50920.869733] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50920.873308] txg_sync D ffff880231e1e828 0 367 2 0x00000000
[50920.876994] ffff880231e1e3d0 0000000000000046 0000000000012f00 ffff880035ec7fd8
[50920.880752] 0000000000012f00 ffff880231e1e3d0 ffff88023ec137b0 ffff88010ea62850
[50920.884612] ffff88010ea62890 0000000000000001 ffff88023351d800 0000000000000000
[50920.888537] Call Trace:
[50920.892415] [] ? io_schedule+0x99/0x120
[50920.896375] [] ? cv_wait_common+0x90/0x100 [spl]
[50920.900341] [] ? prepare_to_wait_event+0xf0/0xf0
[50920.904357] [] ? zio_wait+0x10b/0x1e0 [zfs]
[50920.908425] [] ? dsl_pool_sync+0xaa/0x460 [zfs]
[50920.912471] [] ? spa_sync+0x366/0xb30 [zfs]
[50920.916492] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50920.920501] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50920.924492] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50920.928479] [] ? __thread_exit+0x20/0x20 [spl]
[50920.932433] [] ? kthread+0xbd/0xe0
[50920.936367] [] ? kthread_create_on_node+0x180/0x180
[50920.940326] [] ? ret_from_fork+0x58/0x90
[50920.944264] [] ? kthread_create_on_node+0x180/0x180
[50920.948235] INFO: task txg_sync:742 blocked for more than 120 seconds.
[50920.952179] Tainted: P O 3.16.0-4-amd64 #1
[50920.956094] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[50920.960052] txg_sync D ffff880231e859c8 0 742 2 0x00000000
[50920.964007] ffff880231e85570 0000000000000046 0000000000012f00 ffff88023002bfd8
[50920.967972] 0000000000012f00 ffff880231e85570 ffffffffa030da30 ffff88023002b388
[50920.971941] ffffffffa030da34 ffff880231e85570 00000000ffffffff ffffffffa030da38
[50920.975918] Call Trace:
[50920.979815] [] ? schedule_preempt_disabled+0x25/0x70
[50920.983760] [] ? __mutex_lock_slowpath+0xd3/0x1c0
[50920.987712] [] ? mutex_lock+0x1b/0x2a
[50920.991645] [] ? cv_wait_common+0x98/0x100 [spl]
[50920.995595] [] ? prepare_to_wait_event+0xf0/0xf0
[50920.999564] [] ? arc_read+0xc5/0xa80 [zfs]
[50921.003550] [] ? zio_taskq_dispatch+0x8b/0xa0 [zfs]
[50921.007567] [] ? dsl_scan_scrub_cb+0x485/0x530 [zfs]
[50921.011572] [] ? arc_buf_remove_ref+0x130/0x130 [zfs]
[50921.015609] [] ? dsl_scan_visitbp+0x22e/0xc40 [zfs]
[50921.019648] [] ? dsl_scan_visitbp+0x681/0xc40 [zfs]
[50921.023650] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.027611] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.031515] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.035360] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.039162] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.042901] [] ? dsl_scan_visitbp+0x375/0xc40 [zfs]
[50921.046567] [] ? arc_read+0x478/0xa80 [zfs]
[50921.050215] [] ? dsl_scan_visitbp+0x890/0xc40 [zfs]
[50921.053832] [] ? dbuf_dirty+0x43c/0x940 [zfs]
[50921.057405] [] ? dsl_scan_visitds+0xd5/0x4a0 [zfs]
[50921.060917] [] ? dsl_scan_sync+0x276/0xbc0 [zfs]
[50921.064357] [] ? spl_kmem_cache_free+0x16f/0x1d0 [spl]
[50921.067820] [] ? spa_sync+0x3c3/0xb30 [zfs]
[50921.071253] [] ? txg_sync_thread+0x373/0x620 [zfs]
[50921.074654] [] ? txg_quiesce_thread+0x360/0x360 [zfs]
[50921.077997] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[50921.081355] [] ? __thread_exit+0x20/0x20 [spl]
[50921.084668] [] ? kthread+0xbd/0xe0
[50921.087971] [] ? kthread_create_on_node+0x180/0x180
[50921.091296] [] ? ret_from_fork+0x58/0x90
[50921.094622] [] ? kthread_create_on_node+0x180/0x180

@sluitz
Copy link
Author

sluitz commented Jul 23, 2015

This morning txg_sync locked up again about 80% into the the weekly scrub. All ZFS I/O was completely stalled with no signs of recovery, so I ended up hard resetting the machine.

Jul 23 08:05:29 server kernel: INFO: task txg_sync:2446 blocked for more than 120 seconds.
Jul 23 08:05:29 server kernel: Tainted: P --------------- 2.6.32-504.30.3.el6.x86_64 #1
Jul 23 08:05:29 server kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 23 08:05:29 server kernel: txg_sync D 0000000000000005 0 2446 2 0x00000000
Jul 23 08:05:29 server kernel: ffff880801b00f50 0000000000000046 ffff880801b00f00 ffffffff8100969d
Jul 23 08:05:29 server kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8805c6587150
Jul 23 08:05:29 server kernel: 0000000000000000 0000000000000000 ffff88081e8d9068 ffff880801b01fd8
Jul 23 08:05:29 server kernel: Call Trace:
Jul 23 08:05:29 server kernel: [] ? __switch_to+0x7d/0x320
Jul 23 08:05:29 server kernel: [] ? thread_return+0x4e/0x7d0
Jul 23 08:05:29 server kernel: [] __mutex_lock_slowpath+0x96/0x210
Jul 23 08:05:29 server kernel: [] mutex_lock+0x2b/0x50
Jul 23 08:05:29 server kernel: [] cv_wait_common+0xb7/0x130 [spl]
Jul 23 08:05:29 server kernel: [] ? autoremove_wake_function+0x0/0x40
Jul 23 08:05:29 server kernel: [] ? buf_hash_find+0x9f/0x180 [zfs]
Jul 23 08:05:29 server kernel: [] __cv_wait+0x15/0x20 [spl]
Jul 23 08:05:29 server kernel: [] arc_read+0xb5/0xa70 [zfs]
Jul 23 08:05:29 server kernel: [] ? arc_getbuf_func+0x0/0x80 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x20b/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x6bb/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x324/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] ? arc_read+0x3e1/0xa70 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitbp+0x83e/0xb60 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_visitds+0xe2/0x4c0 [zfs]
Jul 23 08:05:29 server kernel: [] dsl_scan_sync+0x28f/0xbc0 [zfs]
Jul 23 08:05:29 server kernel: [] spa_sync+0x3c7/0xb10 [zfs]
Jul 23 08:05:29 server kernel: [] ? __wake_up_common+0x59/0x90
Jul 23 08:05:29 server kernel: [] ? __wake_up+0x53/0x70
Jul 23 08:05:29 server kernel: [] ? read_tsc+0x9/0x20
Jul 23 08:05:29 server kernel: [] txg_sync_thread+0x389/0x620 [zfs]
Jul 23 08:05:29 server kernel: [] ? account_entity_enqueue+0x7e/0x90
Jul 23 08:05:29 server kernel: [] ? txg_sync_thread+0x0/0x620 [zfs]
Jul 23 08:05:29 server kernel: [] ? txg_sync_thread+0x0/0x620 [zfs]
Jul 23 08:05:29 server kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Jul 23 08:05:29 server kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Jul 23 08:05:29 server kernel: [] kthread+0x9e/0xc0
Jul 23 08:05:29 server kernel: [] child_rip+0xa/0x20
Jul 23 08:05:29 server kernel: [] ? kthread+0x0/0xc0
Jul 23 08:05:29 server kernel: [] ? child_rip+0x0/0x20

@kernelOfTruth
Copy link
Contributor

Looks like a recent regression introduced by some new commits to me - it appears I encountered something similar:

#3628 [[ABD2 3441, sha256 opt 2351 stack][arc_adapt] ZFS stuck after running scrub for some time]

or it's a problem (not necessarily bad) that usually gets triggered not very easily

investigating ...

@kernelOfTruth
Copy link
Contributor

scrub equals heavy i/o load so referencing #3148 (issue, random call trace on heavy load #3148) here

Posting the link to reference list of related issues, pulls & commits here, too:

#3148 (comment)

@gmelikov
Copy link
Member

Close as stale.

If it's actual - feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants