Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Linux block device interference with ZVOL operations #5902

Conversation

sempervictus
Copy link
Contributor

Reduce the amount of interference introduced into ZVOL block device operations by Linux' own optimizations for dealing with conventional storage medium (not backed by ARC or having its own IO pipeline and scheduler).

Description

While testing #5824, changes to several tunables in sysfs produced very significant jumps in performance under anecdotal testing. The performance increases were significant enough to merit review for inclusion as default configurations for these tunables. This set of commits thins out three Linux block layer optimizations:

  1. Read ahead on ZVOLs by the Linux block layer, written by @ryao in a8f9ad7 3 years ago.
  2. Write-back caching in the block device layer - ARC does this already, redundant, and appears to hurt linear write performance (throughput) considerably when in use.
  3. Write merging, which appears to significantly hurt random write performance when enabled.

Motivation and Context

ZVOL performance issues are so significant and unpredictable under current conditions as to make it difficult to use in contended production environments requiring a guaranteed minimum baseline performance quotient. These changes are intended to simplify the execution flow, reduce memory allocations which make futile attempts at optimization, and hand more of the related logic back to the ZIO pipeline.

How Has This Been Tested?

The PR (along with #5824) has been built in DKMS format under a 4.9.14 grsec kernel (no RAP) and pushed through several ztest cycles. Anecdotal performance tests have been performed using tiotest.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • Change has been approved by a ZFS on Linux member.

@mention-bot
Copy link

@sempervictus, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @tuxoko and @bprotopopov to be potential reviewers.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Mar 18, 2017

Looking forward to having those changes included !

This so far is tested on SSDs only, correct ?

We need some numbers on harddrive/rotational media only tests to see how it affects throughput on those :)

Thanks

@dracwyrm
Copy link

Would applying this patch affect already created ZVOLs, or do we need to set these tuneabbles? I have ZVOLs on a RAIDZ1 with rotational drives.

@sempervictus
Copy link
Contributor Author

sempervictus commented Mar 18, 2017 via email

@dweeezil
Copy link
Contributor

@sempervictus Did you want me to try testing this with or without the zvol taskq reinstatement? I just got fresh numbers for my fio write test for current master and for current master with the zvol taskq reinstatement rebased to current master. I'm getting ready to run the fio test of this with the zvol taskq reinstatement.

@dweeezil
Copy link
Contributor

I ran some tests on a zvol using the following fio script:

[test]
        blocksize=8k
        scramble_buffers=1
        disk_util=0
        invalidate=0
        size=10g
        numjobs=32
        create_serialize=1
        direct=1
        filename=/dev/zvol/tank/v1
        offset=0
        offset_increment=10g
        group_reporting=1
        ioengine=libaio
        iodepth=10
        rw=write
        thread=1
        time_based=1
        runtime=3600
        fsync=0
        fallocate=none

Pool is 10 8-disk raidz2 groups. Zvol is 3200GiB. ARC was capped at 32GiB for this test.

FIrst run was with today's master (8614ddf):

test: (groupid=0, jobs=32): err= 0: pid=1451: Sat Mar 18 10:29:43 2017
  write: io=833148MB, bw=236984KB/s, iops=29622, runt=3600006msec
    slat (usec): min=17, max=193015, avg=1076.92, stdev=2677.77
    clat (usec): min=1, max=279206, avg=9723.00, stdev=19552.61
     lat (usec): min=38, max=285654, avg=10800.33, stdev=21602.41
    clat percentiles (usec):
     |  1.00th=[ 1576],  5.00th=[ 2192], 10.00th=[ 2416], 20.00th=[ 2640],
     | 30.00th=[ 2800], 40.00th=[ 2896], 50.00th=[ 2992], 60.00th=[ 3120],
     | 70.00th=[ 3248], 80.00th=[ 3440], 90.00th=[40704], 95.00th=[64256],
     | 99.00th=[81408], 99.50th=[88576], 99.90th=[108032], 99.95th=[118272],
     | 99.99th=[148480]
    bw (KB  /s): min=  650, max=30640, per=3.13%, avg=7410.34, stdev=4973.96
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 50=0.01%, 100=0.01%
    lat (usec) : 250=0.01%, 500=0.06%, 750=0.08%, 1000=0.12%
    lat (msec) : 2=2.79%, 4=82.63%, 10=2.01%, 20=1.24%, 50=1.57%
    lat (msec) : 100=9.32%, 250=0.19%, 500=0.01%
  cpu          : usr=0.33%, sys=26.36%, ctx=36242810, majf=0, minf=379804
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=106642920/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=833148MB, aggrb=236983KB/s, minb=236983KB/s, maxb=236983KB/s, mint=3600006msec, maxt=3600006msec

Second test was with the zvol taskq reinstatement rebased on today's master:

test: (groupid=0, jobs=32): err= 0: pid=22310: Sat Mar 18 15:30:26 2017
  write: io=882357MB, bw=250981KB/s, iops=31372, runt=3600007msec
    slat (usec): min=2, max=48514, avg=14.92, stdev=57.20
    clat (usec): min=47, max=250002, avg=10181.66, stdev=19875.64
     lat (usec): min=205, max=250012, avg=10196.88, stdev=19876.37
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[   41], 95.00th=[   65],
     | 99.00th=[   87], 99.50th=[   97], 99.90th=[  125], 99.95th=[  137],
     | 99.99th=[  161]
    bw (KB  /s): min=  505, max=24544, per=3.12%, avg=7840.57, stdev=4515.67
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.19%, 4=85.57%, 10=2.49%, 20=0.78%, 50=1.73%
    lat (msec) : 100=8.82%, 250=0.42%, 500=0.01%
  cpu          : usr=0.70%, sys=2.22%, ctx=88499308, majf=0, minf=318108
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=112941728/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=882357MB, aggrb=250981KB/s, minb=250981KB/s, maxb=250981KB/s, mint=3600007msec, maxt=3600007msec

Third test was the same but also with the 3 patches in this PR:

test: (groupid=0, jobs=32): err= 0: pid=33656: Sat Mar 18 17:35:35 2017
  write: io=872636MB, bw=248212KB/s, iops=31026, runt=3600062msec
    slat (usec): min=2, max=77352, avg=14.96, stdev=64.36
    clat (usec): min=21, max=244250, avg=10295.35, stdev=20056.24
     lat (usec): min=244, max=244282, avg=10310.61, stdev=20056.99
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[   44], 95.00th=[   65],
     | 99.00th=[   87], 99.50th=[   97], 99.90th=[  124], 99.95th=[  135],
     | 99.99th=[  159]
    bw (KB  /s): min=  677, max=24640, per=3.12%, avg=7755.45, stdev=4388.54
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.23%, 4=85.32%, 10=2.49%, 20=0.82%, 50=1.72%
    lat (msec) : 100=9.00%, 250=0.42%
  cpu          : usr=0.70%, sys=2.20%, ctx=87163050, majf=0, minf=125389
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=111697380/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=872636MB, aggrb=248212KB/s, minb=248212KB/s, maxb=248212KB/s, mint=3600062msec, maxt=3600062msec

Here's a few of my own observations: The zvol taskq, as expected, improves the submission latency a lot but only makes a minor improvement in the total latency. The patches in this PR increase the total latency a bit and lower the bandwidth a bit, however the iops are a bit higher.

@richardelling
Copy link
Contributor

fio's libaio is the wrong ioengine to test the taskq changes. Think of it as doing the work in libaio that you've implemented in the taskq. It is not clear to me which ioengine would be best, because it is not clear to me how sync behaves with iodepth > 1, but ideally the ioengine will try to issue multiple I/Os with synchronous request semantics and done callbacks occur as they occur.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Mar 20, 2017
@behlendorf behlendorf added this to the 0.7.0 milestone Mar 20, 2017
@sempervictus
Copy link
Contributor Author

sempervictus commented Mar 21, 2017

@richardelling: thanks, thats a rational explanation of what i've been seeing with these tests. Block devices exported as iSCSI for instance, do not have the higher level libaio pipeline scheduling IOs, and thus behave differently. Consumers atop that pipeline have all sorts of behavior, so seems like we want to test as many IO patterns as we can using different consumers.

I've created a bash wrapper with a benchmark function in it for anyone testing this to collect results with and without Linux optimizations - http://pastebin.com/7cRnzqEz. Set the volpath and change out the run_bench() function as needed for your use case. The script toggles the same sysfs controls that this PR modifies showing the thinned out pipeline (no-opt), then the Linux block device defaults we currently use (with-opt)

Here's what i'm seeing with the included tiotest runs:

Created dpool/images/ssd-testvol with 4k blocksize using zd160
No-opt run @ 4k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    4.6 s | 222.228 MB/s |  14.0 %  | 471.7 % |
| Random Write  469 MBs |    0.7 s | 638.489 MB/s |  44.8 %  | 539.2 % |
| Read         1024 MBs |    3.7 s | 276.820 MB/s |  24.8 %  | 373.1 % |
| Random Read   469 MBs |    1.4 s | 323.537 MB/s |  34.8 %  | 379.9 % |
`----------------------------------------------------------------------'
No-opt run @ 4k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.8 s | 368.665 MB/s |  21.1 %  | 464.1 % |
| Random Write  469 MBs |    0.7 s | 653.438 MB/s |  55.7 %  | 573.3 % |
| Read         1024 MBs |    3.7 s | 279.787 MB/s |  25.4 %  | 371.1 % |
| Random Read   469 MBs |    1.5 s | 319.632 MB/s |  40.6 %  | 374.9 % |
`----------------------------------------------------------------------'
With-opt run @ 4k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    3.6 s | 282.254 MB/s |  11.3 %  | 397.6 % |
| Random Write  469 MBs |    3.1 s | 151.419 MB/s |  13.0 %  | 133.6 % |
| Read         1024 MBs |    0.6 s | 1649.261 MB/s | 113.8 %  | 681.2 % |
| Random Read   469 MBs |    1.4 s | 323.977 MB/s |  37.0 %  | 395.6 % |
`----------------------------------------------------------------------'
With-opt run @ 4k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    9.1 s | 112.024 MB/s |   6.2 %  | 222.0 % |
| Random Write  469 MBs |    3.4 s | 136.990 MB/s |  11.5 %  | 190.4 % |
| Read         1024 MBs |    0.6 s | 1619.738 MB/s | 108.4 %  | 674.3 % |
| Random Read   469 MBs |    1.5 s | 322.817 MB/s |  40.6 %  | 391.7 % |
`----------------------------------------------------------------------'

@sempervictus
Copy link
Contributor Author

@richardelling, @dweeezil, or anyone else with thoughts on this:
While it seems that while readahead enabled at the linux defaults results in significant penalty to the write throughput, it looks like having it completely disabled reduces the IOPs slightly but significantly penalizes the linear read throughut (wheat readahead is good at). Is this happening because the resulting read requests going into the ZIO pipeline are not seen as aggressive enough to require readahead?

I've set the readahead to match the volblocksize in my tests, and it produced:

No-opt run @ 8k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    3.9 s | 261.135 MB/s |  13.8 %  | 441.7 % |
| Random Write  469 MBs |    0.7 s | 701.880 MB/s |  44.5 %  | 500.2 % |
| Read         1024 MBs |    1.2 s | 848.680 MB/s |  63.3 %  | 703.7 % |
| Random Read   469 MBs |    1.5 s | 319.374 MB/s |  29.5 %  | 389.5 % |
`----------------------------------------------------------------------'
No-opt run @ 8k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.8 s | 370.460 MB/s |  23.6 %  | 383.1 % |
| Random Write  469 MBs |    0.6 s | 736.652 MB/s |  62.3 %  | 518.8 % |
| Read         1024 MBs |    1.2 s | 848.200 MB/s |  63.8 %  | 705.3 % |
| Random Read   469 MBs |    1.5 s | 320.417 MB/s |  35.5 %  | 384.9 % |
`----------------------------------------------------------------------'
With-opt run @ 8k #1: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    2.9 s | 354.408 MB/s |  18.9 %  | 368.3 % |
| Random Write  469 MBs |    4.0 s | 116.562 MB/s |   8.2 %  |  98.4 % |
| Read         1024 MBs |    0.5 s | 1932.652 MB/s | 119.6 %  | 775.0 % |
| Random Read   469 MBs |    1.5 s | 319.205 MB/s |  36.6 %  | 384.2 % |
`----------------------------------------------------------------------'
With-opt run @ 8k #2: Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    7.0 s | 146.399 MB/s |   5.8 %  | 168.2 % |
| Random Write  469 MBs |    3.4 s | 136.206 MB/s |  11.9 %  | 168.1 % |
| Read         1024 MBs |    0.5 s | 1921.186 MB/s | 122.5 %  | 793.7 % |
| Random Read   469 MBs |    1.5 s | 317.996 MB/s |  35.2 %  | 385.6 % |
`----------------------------------------------------------------------'

Further tests at 4 and 16K show that the random write throughput and linear read throughput benefit inversely from changes to the readahead. Higher values make linear reads faster, but hurt random writes, and visa versa. the volblocksize seems to offer the best balance between the two,

ZVOLs seem to have some strange performance constraints given that they're virtual copy-on-write "devices" which theoretically shouldn't have contending reads and writes due to both ARC and that a write should never occur in a contended place for a read (up to the point of saturating hardware IO capacity in terms of operations dispatched of volume of data transferred/traversed). With compression disabled, they're often 1/4-1/3 the throughput of the underlying SSD (far smaller fractions on spanned pools).

Anyone have input on where we should set the readhead defaults given this variance, or how to resolve whatever contention we're hitting in either the LInux or ZFS pipeline?

@@ -1399,7 +1399,7 @@ zvol_alloc(dev_t dev, const char *name)
goto out_kmem;

blk_queue_make_request(zv->zv_queue, zvol_request);
blk_queue_set_write_cache(zv->zv_queue, B_TRUE, B_TRUE);
blk_queue_set_write_cache(zv->zv_queue, B_FALSE, B_TRUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. This will tell Linux don't send any FLUSH request.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 29, 2017

Also to clarify things, the libaio in fio does use zvol_taskq. The libaio engine uses linux aio syscall, which depends on the asyncness of direct_IO, which depends on the asyncness of submit_bio.

@ryao
Copy link
Contributor

ryao commented Apr 8, 2017

This looks good to me.

@sempervictus
Copy link
Contributor Author

sempervictus commented Apr 8, 2017 via email

@sempervictus
Copy link
Contributor Author

@tuxoko: could you please elaborate on the flush issue? Based on the comments in the kernel's block/blk-flush.c, my understanding is that both of those should actually be set to B_FALSE as we've disabled the write-back caching altogether:

  • If the device doesn't have writeback cache, FLUSH and FUA don't make any
  • difference. The requests are either completed immediately if there's no
  • data or executed as normal requests otherwise.

Do ZVOLs have a way to honor forced unit access without the Linux write-back cache over the block device? Or should i actually disable that as well? If these are async, couldn't the FUA tag result in a wait on the return until the next TXG commits (or does it assume synchronous)?

@ryao
Copy link
Contributor

ryao commented Apr 12, 2017

@sempervictus I missed @tuxoko's remark. I had read the term writecache to mean that Linux was implementing one, but in reality, it tells Linux that the device implemented one:

http://lxr.free-electrons.com/source/block/blk-flush.c#L105

The documentation that you quoted is referring to hardware block devices that truly don't have write caches. In that case, IO completion of a write signals that it reached stable storage. Doing a flush or FUA is therefore pointless.

A zvol is a device with a write cache. Completion does not signal that data has reached stable storage. If it helps, a write IO with FUA is the equivalent of an O_SYNC write while a flush is the equivalent of fsync(). Turning that off breaks assumptions required to ensure data integrity by whatever is above us. The equivalent of no write cache in ZFS would be sync=always, which we could force, but it is not performant.

I am really glad that @tuxoko pointed that out because saying it looked good after reading just the patches was a major goof on my part. Under no circumstance should we turn that flag off.

As for zvols implementing flushes and FUA, the code will honor them if passed by Linux:

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L834

The FUA tag means to send the IO to ZIL.

Anyway, I withdraw my okay on this. The patch to disable the write cache flag needs to be dropped. We probably should add a comment explaining why it must always be set so no else looking at the code makes the mistake of interpreting it as Linux doing a write cache. If you have a workload where disabling flushes and FUA is okay, then you can set sync=disabled on the zvol. It should do the same thing as your patch.

@sempervictus
Copy link
Contributor Author

@ryao & @tuxoko: with the cache commit removed, and the async BIO piece added to the original ZVOL PR, are we ok to merge the remainder?

@behlendorf
Copy link
Contributor

@sempervictus when you get a chance could you rebase this entire stack on master and force update the PR.

@ryao ryao added the Component: ZVOL ZFS Volumes label Apr 26, 2017
@sempervictus sempervictus force-pushed the feature-minimize_redundant_volume_function branch from 8969c88 to e09e735 Compare April 26, 2017 21:51
@ryao
Copy link
Contributor

ryao commented Apr 27, 2017

I have not seen the async bio piece, but the current changes minus the write cache patch are fine with me to merge.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the build issue for modern kernels this all LGTM.

*/
zv->zv_queue->backing_dev_info.ra_pages = 0;
zv->zv_queue->backing_dev_info.ra_pages = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a build failure on recent kernels which needs to be addressed. It should also be squashed with the previous patch which entirely disabled read-ahead.

fs/zfs/zfs/zvol.c: In function ‘zvol_alloc’:
fs/zfs/zfs/zvol.c:1483:32: error: request for member ‘ra_pages’ in something not a structure or union
  zv->zv_queue->backing_dev_info.ra_pages = 1;
                                ^

The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.

Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.

Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.
@behlendorf behlendorf force-pushed the feature-minimize_redundant_volume_function branch from e09e735 to 35973b5 Compare May 3, 2017 00:41
ryao and others added 2 commits May 2, 2017 17:42
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.

Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Change the default ZVOL behavior so requests are handled asynchronously.
This behavior is functionally the same as in the zfs-0.6.4 release.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor

@sempervictus I hope you don't mind but I took the liberty of rebasing this PR on master and addressing the remaining issues.

  • The read-ahead patches were squashed and updated to build against 4.11 and newer kernels.
  • The disable write merging patch was refreshed.
  • Added a patch to enable async request handling on zvols by default.

behlendorf pushed a commit that referenced this pull request May 4, 2017
The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.

Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.

Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Issue #5902
behlendorf pushed a commit that referenced this pull request May 4, 2017
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.

Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #5902
behlendorf added a commit that referenced this pull request May 4, 2017
Change the default ZVOL behavior so requests are handled asynchronously.
This behavior is functionally the same as in the zfs-0.6.4 release.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5902
@behlendorf
Copy link
Contributor

These tweaks do appear to improve performance for the tested workloads. They've been merged to master to facilitate a wider range of testing.

@behlendorf behlendorf closed this May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants