Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor lstat and rename performance - dirent cache congestion? #3829

Closed
woffs opened this issue Sep 24, 2015 · 7 comments
Closed

poor lstat and rename performance - dirent cache congestion? #3829

woffs opened this issue Sep 24, 2015 · 7 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@woffs
Copy link

woffs commented Sep 24, 2015

Symptoms:

  • lstat and rename performance gets degraded after reading lots of large
    directories (find, rsync, backup scenario)
  • disk utilisation is not increased, but lower than normal
  • perf does not show any suspicious deadlocks or spins
  • no hanging kernel threads (txg_sync), everything looks fine in that
    corner

My system:

  • linux 3.16.7-ckt11-1+deb8u4 (Debian Jessie)
  • zfs 0.6.5.1-2
  • two pools, 113 ZFSs, lz4, no dedup, no l2arc
  • NUMA system (two nodes), 96 GB RAM

After doing

  • vm.drop_caches=2 (which apparently clears the ARC), or
  • setting zfs_arc_meta_limit and zfs_arc_max to larger values

performance is restored for a short time (until some cache is filled up
again). Interestingly the ARC does not need to be near arc_meta_limit
resp. c_max for the performance to get degraded.

Setting primarycache=metadata brings no mitigation. Setting zfs_arc_meta_strategy=0 does not help.

Downgrading to linux 3.2.68-1+deb7u3 + zfs 0.6.4-16-544f71-wheezy brings back very good performance.

The (or a similar) problem must have been introduced a few commits after 544f71 and apparently not fully resolved with 0.6.5.1.

Perhaps related to

@woffs
Copy link
Author

woffs commented Sep 24, 2015

A stack trace of a perl process mainly renaming lots of directory entries:

[<ffffffff8127cabf>] __blk_run_queue+0x2f/0x40
[<ffffffff81281073>] blk_queue_bio+0x323/0x360
[<ffffffff810968f0>] default_wake_function+0x0/0x10
[<ffffffffa10a8646>] __vdev_disk_physio+0x446/0x460 [zfs]
[<ffffffffa10a8af5>] vdev_disk_io_start+0x75/0x1b0 [zfs]
[<ffffffffa10e44d9>] zio_vdev_io_start+0x99/0x2e0 [zfs]
[<ffffffffa10e79cf>] zio_nowait+0xaf/0x180 [zfs]
[<ffffffffa10af31d>] vdev_raidz_io_start+0x14d/0x2c0 [zfs]
[<ffffffffa10acfb0>] vdev_raidz_child_done+0x0/0x20 [zfs]
[<ffffffffa10e44d9>] zio_vdev_io_start+0x99/0x2e0 [zfs]
[<ffffffffa10e79cf>] zio_nowait+0xaf/0x180 [zfs]
[<ffffffffa10abb90>] vdev_mirror_io_start+0xa0/0x1a0 [zfs]
[<ffffffffa10ab200>] vdev_mirror_child_done+0x0/0x20 [zfs]
[<ffffffffa10e461d>] zio_vdev_io_start+0x1dd/0x2e0 [zfs]
[<ffffffffa10e79cf>] zio_nowait+0xaf/0x180 [zfs]
[<ffffffffa10407de>] arc_read+0x5de/0xa80 [zfs]
[<ffffffffa1047eae>] dbuf_read+0x2ae/0x920 [zfs]
[<ffffffffa10510b0>] dmu_buf_hold+0x50/0x80 [zfs]
[<ffffffffa10afe9a>] zap_get_leaf_byblk+0x4a/0x290 [zfs]
[<ffffffffa10af9aa>] zap_idx_to_blk+0xda/0x150 [zfs]
[<ffffffffa10b0145>] zap_deref_leaf+0x65/0x70 [zfs]
[<ffffffffa10b0c61>] fzap_lookup+0x51/0x160 [zfs]
[<ffffffffa054e97f>] spl_kmem_alloc+0xbf/0x170 [spl]
[<ffffffffa10b56c4>] zap_lookup_norm+0x104/0x1d0 [zfs]
[<ffffffffa10b57bf>] zap_lookup+0x2f/0x40 [zfs]
[<ffffffffa10be052>] zfs_dirent_lock+0x512/0x5c0 [zfs]
[<ffffffffa10b8a99>] zfs_zaccess_aces_check+0x199/0x360 [zfs]
[<ffffffffa10be186>] zfs_dirlook+0x86/0x2d0 [zfs]
[<ffffffffa10d2714>] zfs_lookup+0x2c4/0x310 [zfs]
[<ffffffffa10edf26>] zpl_lookup+0x86/0x100 [zfs]
[<ffffffff811b0f79>] lookup_real+0x19/0x50
[<ffffffff811b180f>] __lookup_hash+0x2f/0x40
[<ffffffff811b5b00>] SYSC_renameat2+0x1f0/0x530
[<ffffffff811b4fc1>] do_unlinkat+0xd1/0x2c0
[<ffffffff811acecc>] SYSC_newlstat+0x2c/0x40
[<ffffffff8151164d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 24, 2015
@behlendorf behlendorf added this to the 0.7.0 milestone Sep 24, 2015
@behlendorf
Copy link
Contributor

@woffs thanks for filing this. I wasn't aware things had regressed we'll want to git bisert this to find the offending patch.

@behlendorf behlendorf modified the milestones: 0.6.5.3, 0.7.0 Sep 24, 2015
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 24, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and may explain the performance regressions
reported in both openzfs#3829 and openzfs#3780.

This patch resolves the issue by making the blocking behavior
dependant on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3780
Issue openzfs#3829
Issue openzfs#3652
@woffs
Copy link
Author

woffs commented Sep 25, 2015

Note: I could lift the performance in my nightly rsync-find-backup scenario to almost half of the usual level by lowering the ARC to ⅓ of RAM and spawning more parallel rsync threads (6 instead of 4). Glad I had not to drop_caches all night. ☺

Don't know if the little improvement in my setup is caused by lowering ARC or by parallelizing.

@behlendorf
Copy link
Contributor

@woffs I believe the patch #3833 will address this regression and it'll be part of the next point release. If you could verify the fix that would be appreciated.

@woffs
Copy link
Author

woffs commented Sep 25, 2015

thanks a lot. patched module is running. 12 hours later, after the backup cycle we will know more about performance and stability.

@behlendorf behlendorf modified the milestones: 0.6.5.3, 0.6.5.2 Sep 25, 2015
behlendorf added a commit that referenced this issue Sep 25, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
@behlendorf
Copy link
Contributor

This is expected to be resolved by 5592404 which will be cherry-picked in to 0.6.5.2 release. If that's not the case we can reopen this issue.

@woffs
Copy link
Author

woffs commented Sep 26, 2015

Hit. Performance is great. Everything is fast.

behlendorf added a commit that referenced this issue Sep 30, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

2 participants