-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Linux block device interference with ZVOL operations #5902
Reduce Linux block device interference with ZVOL operations #5902
Conversation
@sempervictus, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @tuxoko and @bprotopopov to be potential reviewers. |
Looking forward to having those changes included ! This so far is tested on SSDs only, correct ? We need some numbers on harddrive/rotational media only tests to see how it affects throughput on those :) Thanks |
Would applying this patch affect already created ZVOLs, or do we need to set these tuneabbles? I have ZVOLs on a RAIDZ1 with rotational drives. |
The patch makes no on-disk changes, only affecting the parameters used to initialize the zvol in memory. It works on existing zvols, and can be undone by simply installing another version.
KoT: agree, we need #s for everything. Ill try to spin this up in our DC on some actual metal. I wouldn't expect a serious hit if any though, unless your rotating media is 5400rpm sludge, having fewer functions in the execution path for a write to commit should help.
|
@sempervictus Did you want me to try testing this with or without the zvol taskq reinstatement? I just got fresh numbers for my fio write test for current master and for current master with the zvol taskq reinstatement rebased to current master. I'm getting ready to run the fio test of this with the zvol taskq reinstatement. |
I ran some tests on a zvol using the following fio script:
Pool is 10 8-disk raidz2 groups. Zvol is 3200GiB. ARC was capped at 32GiB for this test. FIrst run was with today's master (8614ddf):
Second test was with the zvol taskq reinstatement rebased on today's master:
Third test was the same but also with the 3 patches in this PR:
Here's a few of my own observations: The zvol taskq, as expected, improves the submission latency a lot but only makes a minor improvement in the total latency. The patches in this PR increase the total latency a bit and lower the bandwidth a bit, however the iops are a bit higher. |
fio's libaio is the wrong ioengine to test the taskq changes. Think of it as doing the work in libaio that you've implemented in the taskq. It is not clear to me which ioengine would be best, because it is not clear to me how sync behaves with iodepth > 1, but ideally the ioengine will try to issue multiple I/Os with synchronous request semantics and done callbacks occur as they occur. |
@richardelling: thanks, thats a rational explanation of what i've been seeing with these tests. Block devices exported as iSCSI for instance, do not have the higher level libaio pipeline scheduling IOs, and thus behave differently. Consumers atop that pipeline have all sorts of behavior, so seems like we want to test as many IO patterns as we can using different consumers. I've created a bash wrapper with a benchmark function in it for anyone testing this to collect results with and without Linux optimizations - http://pastebin.com/7cRnzqEz. Set the volpath and change out the run_bench() function as needed for your use case. The script toggles the same sysfs controls that this PR modifies showing the thinned out pipeline (no-opt), then the Linux block device defaults we currently use (with-opt) Here's what i'm seeing with the included tiotest runs:
|
@richardelling, @dweeezil, or anyone else with thoughts on this: I've set the readahead to match the volblocksize in my tests, and it produced:
Further tests at 4 and 16K show that the random write throughput and linear read throughput benefit inversely from changes to the readahead. Higher values make linear reads faster, but hurt random writes, and visa versa. the volblocksize seems to offer the best balance between the two, ZVOLs seem to have some strange performance constraints given that they're virtual copy-on-write "devices" which theoretically shouldn't have contending reads and writes due to both ARC and that a write should never occur in a contended place for a read (up to the point of saturating hardware IO capacity in terms of operations dispatched of volume of data transferred/traversed). With compression disabled, they're often 1/4-1/3 the throughput of the underlying SSD (far smaller fractions on spanned pools). Anyone have input on where we should set the readhead defaults given this variance, or how to resolve whatever contention we're hitting in either the LInux or ZFS pipeline? |
module/zfs/zvol.c
Outdated
@@ -1399,7 +1399,7 @@ zvol_alloc(dev_t dev, const char *name) | |||
goto out_kmem; | |||
|
|||
blk_queue_make_request(zv->zv_queue, zvol_request); | |||
blk_queue_set_write_cache(zv->zv_queue, B_TRUE, B_TRUE); | |||
blk_queue_set_write_cache(zv->zv_queue, B_FALSE, B_TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. This will tell Linux don't send any FLUSH request.
Also to clarify things, the libaio in fio does use zvol_taskq. The libaio engine uses linux aio syscall, which depends on the asyncness of direct_IO, which depends on the asyncness of submit_bio. |
This looks good to me. |
@ryao: including the flush concern? Also, any chance I can harass you again to ask for whatever optimizations you were planning in the bio layer to get PRed?
|
@tuxoko: could you please elaborate on the flush issue? Based on the comments in the kernel's block/blk-flush.c, my understanding is that both of those should actually be set to B_FALSE as we've disabled the write-back caching altogether:
Do ZVOLs have a way to honor forced unit access without the Linux write-back cache over the block device? Or should i actually disable that as well? If these are async, couldn't the FUA tag result in a wait on the return until the next TXG commits (or does it assume synchronous)? |
@sempervictus I missed @tuxoko's remark. I had read the term writecache to mean that Linux was implementing one, but in reality, it tells Linux that the device implemented one: http://lxr.free-electrons.com/source/block/blk-flush.c#L105 The documentation that you quoted is referring to hardware block devices that truly don't have write caches. In that case, IO completion of a write signals that it reached stable storage. Doing a flush or FUA is therefore pointless. A zvol is a device with a write cache. Completion does not signal that data has reached stable storage. If it helps, a write IO with FUA is the equivalent of an I am really glad that @tuxoko pointed that out because saying it looked good after reading just the patches was a major goof on my part. Under no circumstance should we turn that flag off. As for zvols implementing flushes and FUA, the code will honor them if passed by Linux: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L834 The FUA tag means to send the IO to ZIL. Anyway, I withdraw my okay on this. The patch to disable the write cache flag needs to be dropped. We probably should add a comment explaining why it must always be set so no else looking at the code makes the mistake of interpreting it as Linux doing a write cache. If you have a workload where disabling flushes and FUA is okay, then you can set |
@sempervictus when you get a chance could you rebase this entire stack on master and force update the PR. |
8969c88
to
e09e735
Compare
I have not seen the async bio piece, but the current changes minus the write cache patch are fine with me to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the build issue for modern kernels this all LGTM.
module/zfs/zvol.c
Outdated
*/ | ||
zv->zv_queue->backing_dev_info.ra_pages = 0; | ||
zv->zv_queue->backing_dev_info.ra_pages = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a build failure on recent kernels which needs to be addressed. It should also be squashed with the previous patch which entirely disabled read-ahead.
fs/zfs/zfs/zvol.c: In function ‘zvol_alloc’:
fs/zfs/zfs/zvol.c:1483:32: error: request for member ‘ra_pages’ in something not a structure or union
zv->zv_queue->backing_dev_info.ra_pages = 1;
^
The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation.
e09e735
to
35973b5
Compare
Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Signed-off-by: Richard Yao <ryao@gentoo.org>
Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@sempervictus I hope you don't mind but I took the liberty of rebasing this PR on master and addressing the remaining issues.
|
The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902
Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #5902
Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5902
These tweaks do appear to improve performance for the tested workloads. They've been merged to master to facilitate a wider range of testing. |
Reduce the amount of interference introduced into ZVOL block device operations by Linux' own optimizations for dealing with conventional storage medium (not backed by ARC or having its own IO pipeline and scheduler).
Description
While testing #5824, changes to several tunables in sysfs produced very significant jumps in performance under anecdotal testing. The performance increases were significant enough to merit review for inclusion as default configurations for these tunables. This set of commits thins out three Linux block layer optimizations:
Motivation and Context
ZVOL performance issues are so significant and unpredictable under current conditions as to make it difficult to use in contended production environments requiring a guaranteed minimum baseline performance quotient. These changes are intended to simplify the execution flow, reduce memory allocations which make futile attempts at optimization, and hand more of the related logic back to the ZIO pipeline.
How Has This Been Tested?
The PR (along with #5824) has been built in DKMS format under a 4.9.14 grsec kernel (no RAP) and pushed through several ztest cycles. Anecdotal performance tests have been performed using tiotest.
Types of changes
Checklist: