Reduce dbuf_find() lock contention #13405

behlendorf · 2022-05-02T22:28:36Z

Motivation and Context

Excessive lock contention observed when exclusively using ZFS volumes
on a large memory Linux system with many cores. The majority of the CPU
time was observed to be spent in osq_lock() optimistically spinning to
acquire the contended dbuf hash mutex.

Description

Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may need to take a hold of the same dbuf.
Note this issue isn't Linux specific and should be observable on other
platforms as long as there around enough processes contending for
access.

This is further aggregated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.

static uint64_t
dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
{
        return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
}

This has been observed to result in a somewhat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) when heavily using volumes.

This commit improves the situation by switching the hash mutex to
an rwlock to allow concurrent lookups.

How Has This Been Tested?

Tested locally with ZFS volumes and write heavy workload. Without
this change the node was observed to be effectively CPU bound
spinning on the hash mutexes. After this change the system was
largely idle while handling the same workload.

Note the maximum hash chain depth remains unchanged, the
performance wins are solely due to reduced contention. Dynamically
scaling the hash lock array size based on total system memory may
yield further minor improvements.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

I am a bit curios what workload require 32 threads in parallel to access the same data block. I can more think of some indirect, especially if ibs is not reduced as we do in TrueNAS. I have subtle feeling I saw that, but a while ago. Though if it is really the same block is needed, not conflicting on a hash, then contention may just move from one lock to another.

But I have no objections. On FreeBSD kmutex_t and krwlock_t are both mapped into the same sx lock primitive, so this only changes the code paths, not the data structure.

behlendorf · 2022-05-03T17:27:17Z

The problematic workload here was caused by a large number of relatively small sequential writes to a single zvol with a 1M block size. The dbufstats kstats do a pretty good job illustrating the issue.

In particular, we saw a large number of hash_collisions and hash_insert_race events on the L0 blocks. Switching to the rwlock did helped significantly, with profiling showing reduce contention in the hottest path zvol_write() -> dmu_write_uio_done() -> dmu_buf_hold_array_by_done() -> dbuf_hold() -> dbuf_hold_impl() -> dbuf_find().

But it didn't resolve the issue entirely, we found we also needed to increase the DBUF_MUTEXES back to 8K on this system. What do you think about about restoring the previous default, we did just decrease it in #12289. Or perhaps better yet dynamically size it based on total system memory.

15 1 0x01 43 11696 204910972219 3039719476819100
name                            type data
cache_count                     4    3019
cache_size_bytes                4    2812739584
cache_size_bytes_max            4    4390600704
cache_target_bytes              4    3108257218
cache_lowater_bytes             4    2797431497
cache_hiwater_bytes             4    3419082939
cache_total_evicts              4    826986671
cache_level_0                   4    3014
cache_level_1                   4    5
cache_level_2                   4    0
...
cache_level_0_bytes             4    2812084224
cache_level_1_bytes             4    655360
cache_level_2_bytes             4    0
...
hash_hits                       4    174520406116
hash_misses                     4    2517277827
hash_collisions                 4    650742054
hash_elements                   4    11080
hash_elements_max               4    95640
hash_chains                     4    17
hash_chain_max                  4    17
hash_insert_race                4    13572999202
metadata_cache_count            4    966
metadata_cache_size_bytes       4    5040640
metadata_cache_size_bytes_max   4    5224960
metadata_cache_overflow         4    0

amotin · 2022-05-03T20:18:48Z

As I understand, your hash_insert_race means your application writes are much smaller than 1MB, executed in parallel and may be even somehow synchronized, making the race more probable. It is probably to the application optimization, not the hash function. Large hash_collisions though combined with so small hash_chains and hash_chain_max I have difficulty to explain. Can it be that it counts some previous incarnations of the dbuf, like DB_EVICTING? Don't you have primarycache setting or something else causing extremely fast evictions?

I have no problem with increase the DBUF_MUTEXES if it really helps, the optimization was very subtle. I just feel probability of such collision pretty low, thinking about possible weirdnesses of the hash function distribution, but I may be wrong if the effect of even low probability is getting amplified by extremely bad consequences.

behlendorf · 2022-05-03T21:52:36Z

As I understand, your hash_insert_race means your application writes are much smaller than 1MB, executed in parallel and may be even somehow synchronized, making the race more probable.

That's exactly right. The application I/O workload and our large recordsize size just happen to have pretty clearly exposed this contention. We'll look in to tuning the application as well but I wanted to make sure we also improved the situation in ZFS. In our case, my expectation is the dbufs will be evicted quite quickly since they're 1) large (1MiB), and 2) written once in small chunks then never accessed again.

I'll go ahead and add a commit which increases DBUF_MUTEXES to this PR since in our testing it does really help.

Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13405

Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405

Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13405

behlendorf added the Status: Code Review Needed Ready for review and testing label May 2, 2022

behlendorf requested a review from amotin May 2, 2022 22:28

tonyhutter approved these changes May 2, 2022

View reviewed changes

amotin approved these changes May 3, 2022

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 3, 2022

behlendorf force-pushed the dbuf_hash_rwlock branch from d9f82f4 to 2746066 Compare May 3, 2022 21:55

behlendorf merged commit 34dbc61 into openzfs:master May 4, 2022

behlendorf mentioned this pull request May 4, 2022

Reduce dbuf_find() lock contention - 2.1. backport #13418

Merged

13 tasks

scineram mentioned this pull request Oct 11, 2022

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce dbuf_find() lock contention #13405

Reduce dbuf_find() lock contention #13405

behlendorf commented May 2, 2022

amotin left a comment •

edited

Loading

behlendorf commented May 3, 2022

amotin commented May 3, 2022

behlendorf commented May 3, 2022

Reduce dbuf_find() lock contention #13405

Reduce dbuf_find() lock contention #13405

Conversation

behlendorf commented May 2, 2022

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment • edited Loading

Choose a reason for hiding this comment

behlendorf commented May 3, 2022

amotin commented May 3, 2022

behlendorf commented May 3, 2022

amotin left a comment •

edited

Loading