-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txg_sync blocked for more than 120s #3613
Comments
any chance to upgrade to 0.6.4.2 ? there were significant improvements that could help in your case: at least from that mirror 0.6.4.2 should be available: http://yum.tamu.edu/zfsonlinux/epel/6/x86_64/ |
Hi, I'm seeing quite a similar issue since a few weeks here, but I'm running ZFS nightly on debian (updated yesterday evening to 0.6.4-21-53b1d9). It always seem to happen in the middle of a scrub, but this pool is a backup pool so there is no heavy I/O on it except during scrubs. I/O is completely stalled after these messages (tried to let the system recover by itself for 6+ hours with no effect). Feel free to ask me any detail that may help troubleshooting the issue. Kernel log is: [34708.383167] INFO: task txg_sync:769 blocked for more than 120 seconds. |
The version I'm running (git master from a few days ago) is newer than On 7/19/2015 5:36 PM, kernelOfTruth aka. kOT, Gentoo user wrote:
|
I to have seen some of these issues on My CENTOS 6.5 Jul 17 14:46:45 kernel: INFO: task txg_sync:4521 blocked for more than 120 seconds. |
@kernelOfTruth 4 VCPU 8GB of RAM no other special configurations at all. |
@kernelOfTruth setup is quite standard here, 5x3TB Sata in a raidz pool, 8GB RAM, 4GB SSD ZIL. |
@kernelOfTruth even with swap disabled, scrub stalls (tried 2 scrubs, had to reboot server using sysrq each time). @behlendorf, at some point system starts being unresponsive due to txg_sync threads stalled, so it's not only a performance issue. Last kernel stack trace is: [50319.599843] INFO: task txg_sync:742 blocked for more than 120 seconds. |
This morning txg_sync locked up again about 80% into the the weekly scrub. All ZFS I/O was completely stalled with no signs of recovery, so I ended up hard resetting the machine. Jul 23 08:05:29 server kernel: INFO: task txg_sync:2446 blocked for more than 120 seconds. |
Looks like a recent regression introduced by some new commits to me - it appears I encountered something similar: #3628 [[ABD2 3441, sha256 opt 2351 stack][arc_adapt] ZFS stuck after running scrub for some time] or it's a problem (not necessarily bad) that usually gets triggered not very easily investigating ... |
Close as stale. If it's actual - feel free to reopen. |
All ZFS read/writes blocked / hung. Did not clear up by itself for an hour (had to force hard reset of system). Possibly triggered by snapshot create or delete. Pool scrub was going on during problem.
Kernel: 2.6.32-504.30.3.el6.x86_64 (CentOS 6.x)
ZFS and SPL rpms built from git master (yesterday)
ZFS: zfs-dkms-0.6.4-164_g53b1d97.el6.noarch
SPL: spl-dkms-0.6.4-13_g37d7cd9.el6.noarch
System: Intel Atom C2750 with 32G of ECC RAM
This happened after an upgrade from git build zfs-dkms-0.6.3-155_g7b2d78a.el6.noarch.rpm / spl-dkms-0.6.3-50_g917fef2.el6.noarch.rpm which ran for many months (since November 2014) without any problems.
Steffen
Kernel log:
INFO: task txg_sync:2417 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync D 0000000000000001 0 2417 2 0x00000000
ffff8808123dd220 0000000000000046 0000000000000000 ffff8808123dd1e4
0000000000000001 ffff88082fc24300 00005ceb0411e34c ffff8800283d58c0
0000000000005750 0000000106110f28 ffff8808222ff068 ffff8808123ddfd8
Call Trace:
[] __mutex_lock_slowpath+0x96/0x210
[] mutex_lock+0x2b/0x50
[] cv_wait_common+0xb7/0x130 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] ? buf_hash_find+0x9f/0x180 [zfs]
[] __cv_wait+0x15/0x20 [spl]
[] arc_read+0xb5/0xa70 [zfs]
[] ? read_tsc+0x9/0x20
[] ? getrawmonotonic+0x34/0xb0
[] ? arc_getbuf_func+0x0/0x80 [zfs]
[] dsl_scan_visitbp+0x509/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] dsl_scan_visitbp+0x324/0xb60 [zfs]
[] ? arc_read+0x3e1/0xa70 [zfs]
[] dsl_scan_visitbp+0x83e/0xb60 [zfs]
[] dsl_scan_visitds+0xe2/0x4c0 [zfs]
[] dsl_scan_sync+0x28f/0xbc0 [zfs]
[] spa_sync+0x3c7/0xb10 [zfs]
[] ? __wake_up_common+0x59/0x90
[] ? __wake_up+0x53/0x70
[] ? read_tsc+0x9/0x20
[] txg_sync_thread+0x389/0x620 [zfs]
[] ? account_entity_enqueue+0x7e/0x90
[] ? txg_sync_thread+0x0/0x620 [zfs]
[] ? txg_sync_thread+0x0/0x620 [zfs]
[] thread_generic_wrapper+0x68/0x80 [spl]
[] ? thread_generic_wrapper+0x0/0x80 [spl]
[] kthread+0x9e/0xc0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
INFO: task zfs:25498 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs D 0000000000000001 0 25498 25487 0x00000080
ffff8805c5c3fa28 0000000000000082 ffff8805c5c3f978 ffffffffa022dfce
ffff8805c5c3faf8 ffffffffa02532b7 ffff8805c5c3f998 ffffffff00000000
ffff88072165fb00 ffff88081478e800 ffff88081e9cc5f8 ffff8805c5c3ffd8
Call Trace:
[] ? dmu_buf_rele+0xe/0x10 [zfs]
[] ? dsl_dataset_snapshot_check+0x117/0x3a0 [zfs]
[] ? prepare_to_wait_exclusive+0x4e/0x80
[] cv_wait_common+0x11d/0x130 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] __cv_wait+0x15/0x20 [spl]
[] txg_wait_synced+0x8b/0xd0 [zfs]
[] ? dsl_dataset_snapshot_check+0x0/0x3a0 [zfs]
[] dsl_sync_task+0x16a/0x250 [zfs]
[] ? dsl_dataset_snapshot_sync+0x0/0x1a0 [zfs]
[] ? dsl_dataset_snapshot_check+0x0/0x3a0 [zfs]
[] ? dsl_dataset_snapshot_sync+0x0/0x1a0 [zfs]
[] dsl_dataset_snapshot+0x139/0x2e0 [zfs]
[] ? nvlist_add_common+0x3eb/0x450 [znvpair]
[] ? __kmalloc_node+0x4d/0x60
[] ? spl_kmem_alloc_debug+0x9c/0x1e0 [spl]
[] ? nvlist_lookup_common+0x84/0xd0 [znvpair]
[] zfs_ioc_snapshot+0x249/0x290 [zfs]
[] zfsdev_ioctl+0x1cf/0x4d0 [zfs]
[] vfs_ioctl+0x22/0xa0
[] do_vfs_ioctl+0x84/0x580
[] sys_ioctl+0x81/0xa0
[] system_call_fastpath+0x16/0x1b
The text was updated successfully, but these errors were encountered: