100% system load with 0.6.5.6 - zfs_znode_hold_enter suspected #4521

odoucet · 2016-04-13T15:22:25Z

Hello,

We encountered 100% CPU usage, and all was spent on system load.
System has Kernel 3.10.101 + SPL/ZFS 0.6.5.6
This is a mix of ISCSI targets (with LIO) and NFSD exports. ISCSI targets were running just fine, and load increased to ~ 500 (this is exactly the number of nfsd processes we defined).

zpool consists in 23 mirrors, + one SSD logs mirror + two SSD cache (2x800G).
servers that were reading/writing on fs exports were very slow but responsive.
ISCSI servers were all OK.

Adding 1G to zfs_arc_meta_limit did the trick instantly and system was responsive again in a second.

Details :
zfs_arc_meta_limit was set to 0 (default value), so I grabbed current values from arcstat and added 1GB then

echo 81111600640 > /sys/module/zfs/parameters/zfs_arc_meta_limit

Stack trace of all nfsd processes were identical :

Name:   nfsd
State:  D (disk sleep)
Tgid:   11113
Pid:    11113
[<ffffffffa0386eca>] zfs_znode_hold_enter+0x12a/0x170 [zfs]
[<ffffffffa03892cd>] zfs_zget+0x12d/0x240 [zfs]
[<ffffffffa0379f42>] zfs_vget+0x132/0x3a0 [zfs]
[<ffffffffa0399302>] zpl_fh_to_dentry+0x72/0xb0 [zfs]
[<ffffffff8121eebf>] exportfs_decode_fh+0x6f/0x2c0
[<ffffffffa08a3e74>] nfsd_set_fh_dentry+0x214/0x400 [nfsd]
[<ffffffffa08a421e>] fh_verify+0x1be/0x230 [nfsd]
[<ffffffffa08ae86c>] nfsd3_proc_getattr+0x6c/0xf0 [nfsd]
[<ffffffffa08a0295>] nfsd_dispatch+0xe5/0x230 [nfsd]
[<ffffffffa0839054>] svc_process_common+0x344/0x640 [sunrpc]
[<ffffffffa083969d>] svc_process+0x10d/0x160 [sunrpc]
[<ffffffffa08a0a2f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffff8108354e>] kthread+0xce/0xe0
[<ffffffff815fd308>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

arc_reclaim, if useful :

Name:   arc_reclaim
State:  R (running)
Tgid:   5813
Pid:    5813
[<ffffffff8109393a>] __cond_resched+0x2a/0x40
[<ffffffffa02e065e>] arc_evict_state_impl+0x22e/0x2d0 [zfs]
[<ffffffffa02e0827>] arc_evict_state+0x127/0x1f0 [zfs]
[<ffffffffa02e2763>] arc_adjust_impl.clone.0+0x33/0x40 [zfs]
[<ffffffffa02e2935>] arc_adjust_meta_balanced+0x1c5/0x1e0 [zfs]
[<ffffffffa02e2b55>] arc_adjust+0x205/0x2b0 [zfs]
[<ffffffffa02e2fc8>] arc_reclaim_thread+0xb8/0x230 [zfs]
[<ffffffffa02823a8>] thread_generic_wrapper+0x78/0x90 [spl]
[<ffffffff8108354e>] kthread+0xce/0xe0
[<ffffffff815fd308>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

L2arc_feed :

Name:   l2arc_feed
State:  S (sleeping)
Tgid:   5815
Pid:    5815
[<ffffffffa0286efa>] __cv_timedwait_common+0xba/0x160 [spl]
[<ffffffffa0286fb3>] __cv_timedwait_sig+0x13/0x20 [spl]
[<ffffffffa02de78f>] l2arc_feed_thread+0x6f/0x2f0 [zfs]
[<ffffffffa02823a8>] thread_generic_wrapper+0x78/0x90 [spl]
[<ffffffff8108354e>] kthread+0xce/0xe0
[<ffffffff815fd308>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

All informations gathered on the system are here :
https://gist.github.com/odoucet/2cfb7a28cd58b47f9000b3140e1a776c

ARC usage was really weird :

I wondered if this is linked to modifications from #4106 ...

The text was updated successfully, but these errors were encountered:

odoucet · 2016-04-14T15:42:06Z

I forgot to mention one important thing : zpool was resilvering when this happens (one drive failed).
Resilvering started at 9am.

behlendorf · 2016-11-08T19:52:21Z

Closing. Commit 25458cb which limits the amount of dnode metadata in the ARC was designed to address problems like this with meta-data heavy workloads.

behlendorf added the Type: Performance Performance improvement or performance problem label Nov 8, 2016

behlendorf closed this as completed Nov 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% system load with 0.6.5.6 - zfs_znode_hold_enter suspected #4521

100% system load with 0.6.5.6 - zfs_znode_hold_enter suspected #4521

odoucet commented Apr 13, 2016

odoucet commented Apr 14, 2016

behlendorf commented Nov 8, 2016

100% system load with 0.6.5.6 - zfs_znode_hold_enter suspected #4521

100% system load with 0.6.5.6 - zfs_znode_hold_enter suspected #4521

Comments

odoucet commented Apr 13, 2016

odoucet commented Apr 14, 2016

behlendorf commented Nov 8, 2016