Reduce Linux block device interference with ZVOL operations #5902

sempervictus · 2017-03-18T05:32:06Z

Reduce the amount of interference introduced into ZVOL block device operations by Linux' own optimizations for dealing with conventional storage medium (not backed by ARC or having its own IO pipeline and scheduler).

Description

While testing #5824, changes to several tunables in sysfs produced very significant jumps in performance under anecdotal testing. The performance increases were significant enough to merit review for inclusion as default configurations for these tunables. This set of commits thins out three Linux block layer optimizations:

Read ahead on ZVOLs by the Linux block layer, written by @ryao in a8f9ad7 3 years ago.
Write-back caching in the block device layer - ARC does this already, redundant, and appears to hurt linear write performance (throughput) considerably when in use.
Write merging, which appears to significantly hurt random write performance when enabled.

Motivation and Context

ZVOL performance issues are so significant and unpredictable under current conditions as to make it difficult to use in contended production environments requiring a guaranteed minimum baseline performance quotient. These changes are intended to simplify the execution flow, reduce memory allocations which make futile attempts at optimization, and hand more of the related logic back to the ZIO pipeline.

How Has This Been Tested?

The PR (along with #5824) has been built in DKMS format under a 4.9.14 grsec kernel (no RAP) and pushed through several ztest cycles. Anecdotal performance tests have been performed using tiotest.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
Change has been approved by a ZFS on Linux member.

mention-bot · 2017-03-18T05:32:08Z

@sempervictus, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @tuxoko and @bprotopopov to be potential reviewers.

kernelOfTruth · 2017-03-18T11:53:50Z

Looking forward to having those changes included !

This so far is tested on SSDs only, correct ?

We need some numbers on harddrive/rotational media only tests to see how it affects throughput on those :)

Thanks

dracwyrm · 2017-03-18T12:14:12Z

Would applying this patch affect already created ZVOLs, or do we need to set these tuneabbles? I have ZVOLs on a RAIDZ1 with rotational drives.

sempervictus · 2017-03-18T15:37:05Z

The patch makes no on-disk changes, only affecting the parameters used to initialize the zvol in memory. It works on existing zvols, and can be undone by simply installing another version. KoT: agree, we need #s for everything. Ill try to spin this up in our DC on some actual metal. I wouldn't expect a serious hit if any though, unless your rotating media is 5400rpm sludge, having fewer functions in the execution path for a write to commit should help.

dweeezil · 2017-03-18T21:29:55Z

@sempervictus Did you want me to try testing this with or without the zvol taskq reinstatement? I just got fresh numbers for my fio write test for current master and for current master with the zvol taskq reinstatement rebased to current master. I'm getting ready to run the fio test of this with the zvol taskq reinstatement.

dweeezil · 2017-03-18T23:13:46Z

I ran some tests on a zvol using the following fio script:

[test]
        blocksize=8k
        scramble_buffers=1
        disk_util=0
        invalidate=0
        size=10g
        numjobs=32
        create_serialize=1
        direct=1
        filename=/dev/zvol/tank/v1
        offset=0
        offset_increment=10g
        group_reporting=1
        ioengine=libaio
        iodepth=10
        rw=write
        thread=1
        time_based=1
        runtime=3600
        fsync=0
        fallocate=none

Pool is 10 8-disk raidz2 groups. Zvol is 3200GiB. ARC was capped at 32GiB for this test.

FIrst run was with today's master (8614ddf):

test: (groupid=0, jobs=32): err= 0: pid=1451: Sat Mar 18 10:29:43 2017
  write: io=833148MB, bw=236984KB/s, iops=29622, runt=3600006msec
    slat (usec): min=17, max=193015, avg=1076.92, stdev=2677.77
    clat (usec): min=1, max=279206, avg=9723.00, stdev=19552.61
     lat (usec): min=38, max=285654, avg=10800.33, stdev=21602.41
    clat percentiles (usec):
     |  1.00th=[ 1576],  5.00th=[ 2192], 10.00th=[ 2416], 20.00th=[ 2640],
     | 30.00th=[ 2800], 40.00th=[ 2896], 50.00th=[ 2992], 60.00th=[ 3120],
     | 70.00th=[ 3248], 80.00th=[ 3440], 90.00th=[40704], 95.00th=[64256],
     | 99.00th=[81408], 99.50th=[88576], 99.90th=[108032], 99.95th=[118272],
     | 99.99th=[148480]
    bw (KB  /s): min=  650, max=30640, per=3.13%, avg=7410.34, stdev=4973.96
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 50=0.01%, 100=0.01%
    lat (usec) : 250=0.01%, 500=0.06%, 750=0.08%, 1000=0.12%
    lat (msec) : 2=2.79%, 4=82.63%, 10=2.01%, 20=1.24%, 50=1.57%
    lat (msec) : 100=9.32%, 250=0.19%, 500=0.01%
  cpu          : usr=0.33%, sys=26.36%, ctx=36242810, majf=0, minf=379804
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=106642920/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=833148MB, aggrb=236983KB/s, minb=236983KB/s, maxb=236983KB/s, mint=3600006msec, maxt=3600006msec

Second test was with the zvol taskq reinstatement rebased on today's master:

test: (groupid=0, jobs=32): err= 0: pid=22310: Sat Mar 18 15:30:26 2017
  write: io=882357MB, bw=250981KB/s, iops=31372, runt=3600007msec
    slat (usec): min=2, max=48514, avg=14.92, stdev=57.20
    clat (usec): min=47, max=250002, avg=10181.66, stdev=19875.64
     lat (usec): min=205, max=250012, avg=10196.88, stdev=19876.37
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[   41], 95.00th=[   65],
     | 99.00th=[   87], 99.50th=[   97], 99.90th=[  125], 99.95th=[  137],
     | 99.99th=[  161]
    bw (KB  /s): min=  505, max=24544, per=3.12%, avg=7840.57, stdev=4515.67
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.19%, 4=85.57%, 10=2.49%, 20=0.78%, 50=1.73%
    lat (msec) : 100=8.82%, 250=0.42%, 500=0.01%
  cpu          : usr=0.70%, sys=2.22%, ctx=88499308, majf=0, minf=318108
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=112941728/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=882357MB, aggrb=250981KB/s, minb=250981KB/s, maxb=250981KB/s, mint=3600007msec, maxt=3600007msec

Third test was the same but also with the 3 patches in this PR:

test: (groupid=0, jobs=32): err= 0: pid=33656: Sat Mar 18 17:35:35 2017
  write: io=872636MB, bw=248212KB/s, iops=31026, runt=3600062msec
    slat (usec): min=2, max=77352, avg=14.96, stdev=64.36
    clat (usec): min=21, max=244250, avg=10295.35, stdev=20056.24
     lat (usec): min=244, max=244282, avg=10310.61, stdev=20056.99
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[   44], 95.00th=[   65],
     | 99.00th=[   87], 99.50th=[   97], 99.90th=[  124], 99.95th=[  135],
     | 99.99th=[  159]
    bw (KB  /s): min=  677, max=24640, per=3.12%, avg=7755.45, stdev=4388.54
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.23%, 4=85.32%, 10=2.49%, 20=0.82%, 50=1.72%
    lat (msec) : 100=9.00%, 250=0.42%
  cpu          : usr=0.70%, sys=2.20%, ctx=87163050, majf=0, minf=125389
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=111697380/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=872636MB, aggrb=248212KB/s, minb=248212KB/s, maxb=248212KB/s, mint=3600062msec, maxt=3600062msec

Here's a few of my own observations: The zvol taskq, as expected, improves the submission latency a lot but only makes a minor improvement in the total latency. The patches in this PR increase the total latency a bit and lower the bandwidth a bit, however the iops are a bit higher.

richardelling · 2017-03-19T00:22:51Z

fio's libaio is the wrong ioengine to test the taskq changes. Think of it as doing the work in libaio that you've implemented in the taskq. It is not clear to me which ioengine would be best, because it is not clear to me how sync behaves with iodepth > 1, but ideally the ioengine will try to issue multiple I/Os with synchronous request semantics and done callbacks occur as they occur.

sempervictus · 2017-03-21T19:08:35Z

@richardelling: thanks, thats a rational explanation of what i've been seeing with these tests. Block devices exported as iSCSI for instance, do not have the higher level libaio pipeline scheduling IOs, and thus behave differently. Consumers atop that pipeline have all sorts of behavior, so seems like we want to test as many IO patterns as we can using different consumers.

I've created a bash wrapper with a benchmark function in it for anyone testing this to collect results with and without Linux optimizations - http://pastebin.com/7cRnzqEz. Set the volpath and change out the run_bench() function as needed for your use case. The script toggles the same sysfs controls that this PR modifies showing the thinned out pipeline (no-opt), then the Linux block device defaults we currently use (with-opt)

Here's what i'm seeing with the included tiotest runs:

Created dpool/images/ssd-testvol with 4k blocksize using zd160
No-opt run @ 4k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    4.6 s | 222.228 MB/s |  14.0 %  | 471.7 % |
| Random Write  469 MBs |    0.7 s | 638.489 MB/s |  44.8 %  | 539.2 % |
| Read         1024 MBs |    3.7 s | 276.820 MB/s |  24.8 %  | 373.1 % |
| Random Read   469 MBs |    1.4 s | 323.537 MB/s |  34.8 %  | 379.9 % |
`----------------------------------------------------------------------'
No-opt run @ 4k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.8 s | 368.665 MB/s |  21.1 %  | 464.1 % |
| Random Write  469 MBs |    0.7 s | 653.438 MB/s |  55.7 %  | 573.3 % |
| Read         1024 MBs |    3.7 s | 279.787 MB/s |  25.4 %  | 371.1 % |
| Random Read   469 MBs |    1.5 s | 319.632 MB/s |  40.6 %  | 374.9 % |
`----------------------------------------------------------------------'
With-opt run @ 4k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    3.6 s | 282.254 MB/s |  11.3 %  | 397.6 % |
| Random Write  469 MBs |    3.1 s | 151.419 MB/s |  13.0 %  | 133.6 % |
| Read         1024 MBs |    0.6 s | 1649.261 MB/s | 113.8 %  | 681.2 % |
| Random Read   469 MBs |    1.4 s | 323.977 MB/s |  37.0 %  | 395.6 % |
`----------------------------------------------------------------------'
With-opt run @ 4k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    9.1 s | 112.024 MB/s |   6.2 %  | 222.0 % |
| Random Write  469 MBs |    3.4 s | 136.990 MB/s |  11.5 %  | 190.4 % |
| Read         1024 MBs |    0.6 s | 1619.738 MB/s | 108.4 %  | 674.3 % |
| Random Read   469 MBs |    1.5 s | 322.817 MB/s |  40.6 %  | 391.7 % |
`----------------------------------------------------------------------'

sempervictus · 2017-03-22T01:32:25Z

@richardelling, @dweeezil, or anyone else with thoughts on this:
While it seems that while readahead enabled at the linux defaults results in significant penalty to the write throughput, it looks like having it completely disabled reduces the IOPs slightly but significantly penalizes the linear read throughut (wheat readahead is good at). Is this happening because the resulting read requests going into the ZIO pipeline are not seen as aggressive enough to require readahead?

I've set the readahead to match the volblocksize in my tests, and it produced:

No-opt run @ 8k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    3.9 s | 261.135 MB/s |  13.8 %  | 441.7 % |
| Random Write  469 MBs |    0.7 s | 701.880 MB/s |  44.5 %  | 500.2 % |
| Read         1024 MBs |    1.2 s | 848.680 MB/s |  63.3 %  | 703.7 % |
| Random Read   469 MBs |    1.5 s | 319.374 MB/s |  29.5 %  | 389.5 % |
`----------------------------------------------------------------------'
No-opt run @ 8k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.8 s | 370.460 MB/s |  23.6 %  | 383.1 % |
| Random Write  469 MBs |    0.6 s | 736.652 MB/s |  62.3 %  | 518.8 % |
| Read         1024 MBs |    1.2 s | 848.200 MB/s |  63.8 %  | 705.3 % |
| Random Read   469 MBs |    1.5 s | 320.417 MB/s |  35.5 %  | 384.9 % |
`----------------------------------------------------------------------'
With-opt run @ 8k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.9 s | 354.408 MB/s |  18.9 %  | 368.3 % |
| Random Write  469 MBs |    4.0 s | 116.562 MB/s |   8.2 %  |  98.4 % |
| Read         1024 MBs |    0.5 s | 1932.652 MB/s | 119.6 %  | 775.0 % |
| Random Read   469 MBs |    1.5 s | 319.205 MB/s |  36.6 %  | 384.2 % |
`----------------------------------------------------------------------'
With-opt run @ 8k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    7.0 s | 146.399 MB/s |   5.8 %  | 168.2 % |
| Random Write  469 MBs |    3.4 s | 136.206 MB/s |  11.9 %  | 168.1 % |
| Read         1024 MBs |    0.5 s | 1921.186 MB/s | 122.5 %  | 793.7 % |
| Random Read   469 MBs |    1.5 s | 317.996 MB/s |  35.2 %  | 385.6 % |
`----------------------------------------------------------------------'

Further tests at 4 and 16K show that the random write throughput and linear read throughput benefit inversely from changes to the readahead. Higher values make linear reads faster, but hurt random writes, and visa versa. the volblocksize seems to offer the best balance between the two,

ZVOLs seem to have some strange performance constraints given that they're virtual copy-on-write "devices" which theoretically shouldn't have contending reads and writes due to both ARC and that a write should never occur in a contended place for a read (up to the point of saturating hardware IO capacity in terms of operations dispatched of volume of data transferred/traversed). With compression disabled, they're often 1/4-1/3 the throughput of the underlying SSD (far smaller fractions on spanned pools).

Anyone have input on where we should set the readhead defaults given this variance, or how to resolve whatever contention we're hitting in either the LInux or ZFS pipeline?

tuxoko · 2017-03-29T20:48:51Z

module/zfs/zvol.c

@@ -1399,7 +1399,7 @@ zvol_alloc(dev_t dev, const char *name)
 		goto out_kmem;

 	blk_queue_make_request(zv->zv_queue, zvol_request);
-	blk_queue_set_write_cache(zv->zv_queue, B_TRUE, B_TRUE);
+	blk_queue_set_write_cache(zv->zv_queue, B_FALSE, B_TRUE);


This is not correct. This will tell Linux don't send any FLUSH request.

tuxoko · 2017-03-29T21:02:50Z

Also to clarify things, the libaio in fio does use zvol_taskq. The libaio engine uses linux aio syscall, which depends on the asyncness of direct_IO, which depends on the asyncness of submit_bio.

ryao · 2017-04-08T16:33:05Z

This looks good to me.

sempervictus · 2017-04-08T17:17:21Z

@ryao: including the flush concern? Also, any chance I can harass you again to ask for whatever optimizations you were planning in the bio layer to get PRed?

sempervictus · 2017-04-11T05:11:44Z

@tuxoko: could you please elaborate on the flush issue? Based on the comments in the kernel's block/blk-flush.c, my understanding is that both of those should actually be set to B_FALSE as we've disabled the write-back caching altogether:

If the device doesn't have writeback cache, FLUSH and FUA don't make any

difference. The requests are either completed immediately if there's no

data or executed as normal requests otherwise.

Do ZVOLs have a way to honor forced unit access without the Linux write-back cache over the block device? Or should i actually disable that as well? If these are async, couldn't the FUA tag result in a wait on the return until the next TXG commits (or does it assume synchronous)?

ryao · 2017-04-12T16:00:41Z

@sempervictus I missed @tuxoko's remark. I had read the term writecache to mean that Linux was implementing one, but in reality, it tells Linux that the device implemented one:

http://lxr.free-electrons.com/source/block/blk-flush.c#L105

The documentation that you quoted is referring to hardware block devices that truly don't have write caches. In that case, IO completion of a write signals that it reached stable storage. Doing a flush or FUA is therefore pointless.

A zvol is a device with a write cache. Completion does not signal that data has reached stable storage. If it helps, a write IO with FUA is the equivalent of an O_SYNC write while a flush is the equivalent of fsync(). Turning that off breaks assumptions required to ensure data integrity by whatever is above us. The equivalent of no write cache in ZFS would be sync=always, which we could force, but it is not performant.

I am really glad that @tuxoko pointed that out because saying it looked good after reading just the patches was a major goof on my part. Under no circumstance should we turn that flag off.

As for zvols implementing flushes and FUA, the code will honor them if passed by Linux:

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L834

The FUA tag means to send the IO to ZIL.

Anyway, I withdraw my okay on this. The patch to disable the write cache flag needs to be dropped. We probably should add a comment explaining why it must always be set so no else looking at the code makes the mistake of interpreting it as Linux doing a write cache. If you have a workload where disabling flushes and FUA is okay, then you can set sync=disabled on the zvol. It should do the same thing as your patch.

sempervictus · 2017-04-26T05:12:28Z

@ryao & @tuxoko: with the cache commit removed, and the async BIO piece added to the original ZVOL PR, are we ok to merge the remainder?

behlendorf · 2017-04-26T16:21:26Z

@sempervictus when you get a chance could you rebase this entire stack on master and force update the PR.

ryao · 2017-04-27T17:43:23Z

I have not seen the async bio piece, but the current changes minus the write cache patch are fine with me to merge.

behlendorf

Aside from the build issue for modern kernels this all LGTM.

behlendorf · 2017-04-28T20:11:01Z

module/zfs/zvol.c

 	 */
-	zv->zv_queue->backing_dev_info.ra_pages = 0;
+	zv->zv_queue->backing_dev_info.ra_pages = 1;


This introduces a build failure on recent kernels which needs to be addressed. It should also be squashed with the previous patch which entirely disabled read-ahead.

fs/zfs/zfs/zvol.c: In function ‘zvol_alloc’: fs/zfs/zfs/zvol.c:1483:32: error: request for member ‘ra_pages’ in something not a structure or union zv->zv_queue->backing_dev_info.ra_pages = 1; ^

The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation.

Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Signed-off-by: Richard Yao <ryao@gentoo.org>

Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2017-05-03T00:43:28Z

@sempervictus I hope you don't mind but I took the liberty of rebasing this PR on master and addressing the remaining issues.

The read-ahead patches were squashed and updated to build against 4.11 and newer kernels.
The disable write merging patch was refreshed.
Added a patch to enable async request handling on zvols by default.

The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902

Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #5902

Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5902

behlendorf · 2017-05-04T22:01:59Z

These tweaks do appear to improve performance for the tested workloads. They've been merged to master to facilitate a wider range of testing.

sempervictus mentioned this pull request Mar 18, 2017

Reinstate zvol_taskq to fix aio on zvol #5824

Closed

11 tasks

behlendorf added the Type: Performance Performance improvement or performance problem label Mar 20, 2017

behlendorf added this to the 0.7.0 milestone Mar 20, 2017

sempervictus mentioned this pull request Mar 22, 2017

Require new code to pass build in-tree for commit to master #5911

Closed

tuxoko suggested changes Mar 29, 2017

View reviewed changes

koplover mentioned this pull request Apr 26, 2017

Reduced performance for 0.6.5 in virtualized environment #4880

Closed

tuxoko approved these changes Apr 26, 2017

View reviewed changes

ryao added the Component: ZVOL ZFS Volumes label Apr 26, 2017

sempervictus force-pushed the feature-minimize_redundant_volume_function branch from 8969c88 to e09e735 Compare April 26, 2017 21:51

behlendorf requested changes Apr 28, 2017

View reviewed changes

behlendorf force-pushed the feature-minimize_redundant_volume_function branch from e09e735 to 35973b5 Compare May 3, 2017 00:41

ryao and others added 2 commits May 2, 2017 17:42

Default to zvol_request_async=0

35973b5

Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf approved these changes May 3, 2017

View reviewed changes

behlendorf closed this May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Linux block device interference with ZVOL operations #5902

Reduce Linux block device interference with ZVOL operations #5902

sempervictus commented Mar 18, 2017

mention-bot commented Mar 18, 2017

kernelOfTruth commented Mar 18, 2017 •

edited

Loading

dracwyrm commented Mar 18, 2017

sempervictus commented Mar 18, 2017 via email •

edited

Loading

dweeezil commented Mar 18, 2017

dweeezil commented Mar 18, 2017

richardelling commented Mar 19, 2017

sempervictus commented Mar 21, 2017 •

edited

Loading

sempervictus commented Mar 22, 2017

tuxoko Mar 29, 2017

tuxoko commented Mar 29, 2017

ryao commented Apr 8, 2017

sempervictus commented Apr 8, 2017 via email

sempervictus commented Apr 11, 2017

ryao commented Apr 12, 2017 •

edited

Loading

sempervictus commented Apr 26, 2017

behlendorf commented Apr 26, 2017

ryao commented Apr 27, 2017

behlendorf left a comment

behlendorf Apr 28, 2017

behlendorf commented May 3, 2017

behlendorf commented May 4, 2017

Reduce Linux block device interference with ZVOL operations #5902

Reduce Linux block device interference with ZVOL operations #5902

Conversation

sempervictus commented Mar 18, 2017

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

mention-bot commented Mar 18, 2017

kernelOfTruth commented Mar 18, 2017 • edited Loading

dracwyrm commented Mar 18, 2017

sempervictus commented Mar 18, 2017 via email • edited Loading

dweeezil commented Mar 18, 2017

dweeezil commented Mar 18, 2017

richardelling commented Mar 19, 2017

sempervictus commented Mar 21, 2017 • edited Loading

sempervictus commented Mar 22, 2017

tuxoko Mar 29, 2017

Choose a reason for hiding this comment

tuxoko commented Mar 29, 2017

ryao commented Apr 8, 2017

sempervictus commented Apr 8, 2017 via email

sempervictus commented Apr 11, 2017

ryao commented Apr 12, 2017 • edited Loading

sempervictus commented Apr 26, 2017

behlendorf commented Apr 26, 2017

ryao commented Apr 27, 2017

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Apr 28, 2017

Choose a reason for hiding this comment

behlendorf commented May 3, 2017

behlendorf commented May 4, 2017

kernelOfTruth commented Mar 18, 2017 •

edited

Loading

sempervictus commented Mar 18, 2017 via email •

edited

Loading

sempervictus commented Mar 21, 2017 •

edited

Loading

ryao commented Apr 12, 2017 •

edited

Loading