Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subpar performance of RAIDZ, reads are slower than writes. #9375

Open
Maciej-Poleski opened this issue Sep 28, 2019 · 22 comments
Open

Subpar performance of RAIDZ, reads are slower than writes. #9375

Maciej-Poleski opened this issue Sep 28, 2019 · 22 comments
Labels
Bot: Not Stale Override for the stale bot Type: Performance Performance improvement or performance problem

Comments

@Maciej-Poleski
Copy link

System information

Type Version/Name
Distribution Name Gentoo Linux
Distribution Version amd64 (stable) 17.1/no-multilib
Linux Kernel 4.19.72-gentoo
Architecture x86_64
ZFS Version 8.2 (the same behavior in 8.1)
SPL Version N/A

Describe the problem you're observing

Performance is not satisfying. Cannot saturate full HDDs' bandwidth. Reads are less than 50% of write performance.

I have 8 ST4000VN008. 6 of them performs according to specification when reading from beginning ~180MiB/s. 2 of them under-perform with ~160MiB/s (should I RMA?). Write performance is lower with ~160MiB/s (slow disks are slower in all tests).
Benchmark provides the same results when run on all disks at the same time in parallel.

These disks are configured as RAIDZ-2 with the following command:

zpool create -n -m /mnt/storage -o ashift=12 -o autoexpand=on -o autotrim=on \
-O acltype=posixacl -O atime=off -O compression=lz4 -O dedup=off -O dnodesize=auto \
-O encryption=aes-256-gcm -O keyformat=raw -O keylocation=file:///root/storage.key \
-O logbias=latency -O xattr=sa -O casesensitivity=sensitive storage raidz2 \
ata-ST4000VN008-ZDR166_ZGY5C3W7 ata-ST4000VN008-ZDR166_ZGY5E06J \
ata-ST4000VN008-ZDR166_ZDH7EMPY ata-ST4000VN008-ZDR166_ZDH7F08Z \
ata-ST4000VN008-ZDR166_ZDH7ESM1 ata-ST4000VN008-ZDR166_ZDH7FA7S \
ata-ST4000VN008-ZDR166_ZDH7F9P5 ata-ST4000VN008-ZDR166_ZDH7F9BN \
log nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN

(but with compression disabled for benchmarks).

When the filesystem created above is benchmarked using

dd if=/dev/zero of=zero bs=10M

then

zpool iostat -vl 10

displays the following data:

                                                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
storage                                        24.0G  29.1T      1  10.3K  4.80K   951M   15ms    5ms   15ms    4ms    3us    2us      -  614us      -      -
  raidz2                                       24.0G  29.1T      1  10.3K  4.80K   951M   15ms    5ms   15ms    4ms    3us    2us      -  614us      -      -
    ata-ST4000VN008-2DR166_ZGY5C3W7                -      -      0  1.32K    818   119M   37ms    4ms   37ms    4ms    3us    4us      -  507us      -      -
    ata-ST4000VN008-2DR166_ZGY5E06J                -      -      0  1.31K    409   119M   50ms    4ms   50ms    4ms    3us    1us      -  548us      -      -
    ata-ST4000VN008-2DR166_ZDH7EMPY                -      -      0  1.27K    409   119M  196us    5ms  196us    4ms    3us    2us      -  665us      -      -
    ata-ST4000VN008-2DR166_ZDH7F08Z                -      -      0  1.34K  1.20K   119M    4ms    4ms    4ms    4ms    3us    1us      -  487us      -      -
    ata-ST4000VN008-2DR166_ZDH7ESM1                -      -      0  1.31K    409   119M   50ms    4ms   50ms    4ms    3us    1us      -  545us      -      -
    ata-ST4000VN008-2DR166_ZDH7FA7S                -      -      0  1.27K    818   119M  196us    5ms  196us    4ms    3us    1us      -  635us      -      -
    ata-ST4000VN008-2DR166_ZDH7F9P5                -      -      0  1.24K    818   119M  393us    5ms  393us    5ms    3us    1us      -  730us      -      -
    ata-ST4000VN008-2DR166_ZDH7F9BN                -      -      0  1.22K      0   119M      -    6ms      -    5ms      -    1us      -  818us      -      -
logs                                               -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      0   260G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note: Sequential write peaks at 120MiB/s per HDD.

When reading

dd if=zero of=/dev/null bs=10M

then

zpool iostat -vl 10

displays the following data:

                                                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
storage                                        33.9G  29.1T  7.28K      0   362M      0    1ms      -    1ms      -    9ms      -  368us      -      -      -
  raidz2                                       33.9G  29.1T  7.28K      0   362M      0    1ms      -    1ms      -    9ms      -  368us      -      -      -
    ata-ST4000VN008-2DR166_ZGY5C3W7                -      -    943      0  44.0M      0    1ms      -    1ms      -    6ms      -  310us      -      -      -
    ata-ST4000VN008-2DR166_ZGY5E06J                -      -    964      0  44.8M      0    1ms      -    1ms      -    4ms      -  316us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7EMPY                -      -    946      0  45.0M      0    1ms      -    1ms      -    6ms      -  332us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F08Z                -      -    911      0  46.2M      0    2ms      -    1ms      -   17ms      -  423us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7ESM1                -      -    928      0  46.6M      0    2ms      -    1ms      -    9ms      -  453us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7FA7S                -      -    887      0  44.3M      0    2ms      -    1ms      -    8ms      -  433us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F9P5                -      -    893      0  45.3M      0    1ms      -    1ms      -    9ms      -  346us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F9BN                -      -    984      0  45.8M      0    1ms      -    1ms      -   13ms      -  339us      -      -      -
logs                                               -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      0   260G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note: Sequential read peaks at ~45MiB/s per HDD.

Further experimentation revealed the following information:

  • Changing VDEV type from RAIDZ-2 to RAIDZ-1 doesn't impact performance.
  • Reducing stripe size from 8 to 6 doesn't impact performance.
  • zfs_vdev_raidz_impl is avx512bw (determined to be the fastest). Changing it to scalar doesn't impact performance.
  • Disabling encryption and checksums doesn't impact performance (but encryption does visibly impact CPU utilization).
  • Changing dnodesize to legacy doesn't impact performance.
  • Changing recordsize to 1M slightly increases performance (~130MiB/s write, 50-55MiB/s read).
  • Disabling hyper-threading decreases performance (<100MiB/s write).
  • When disks are configured in RAID0 equivalent they perform better (~140MiB/s write, ~80MiB/s read).
  • When disks are split into two sets (the fastest 4, the slowest 4) and configured as two RAID0-equivalent pools they perform slightly better than one RAID0 pool with 8 disks. There is no difference in performance between "fast" and "slow" pool though.
  • When just one disk is used to create a pool, it performs reasonably close to underlying raw disk performance.

My build is roughly:

  • Xeon 4208 (8x 2.1GHz, AES-NI, AVX2, AVX512f, AVX512dq, AVX512cd, AVX512bw, AVX512vl)
  • X11SPL-F (Chipset C621, 8x SATA 6Gb/s onboard)
  • ST4000VN008 8x (4TB, 5900 RPM, 180MiB/s)

Describe how to reproduce the problem

Copy-paste the above commands. Note that they depend on availability of HDDs with specific S/Ns, so you will need to adjust.

Include any warning/errors/backtraces from the system logs

None/Not aware of any.

@gmelikov
Copy link
Member

You should look at IOPS too, please show iostat -x 1 on your disks, for example, during tests. If %util is nearly 100% - you've got all the IOPS your disks may give. ZFS is CoW filesystem, so on (nearly) each uncached read you should read it's metadata, and read will usually be random, even if you try to read logically sequential data. So the more recordsize is - the better your seq read/write is.

And one more thing - sometimes 1 thread can't give you full pool performance. You may want to tune params for it, for example prefetch read https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfetch_max_distance . It depends on your load.

So looks like not a bug.

@Maciej-Poleski
Copy link
Author

Numbers vary greatly with 1 second intervals (workload seems to be bursty). When interval is set to 10 seconds these are:

For write:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   59.59    2.08    0.00   38.33

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme1n1          1.20    0.00     92.40      0.00    21.90     0.00  94.81   0.00    0.83    0.00   0.00    77.00     0.00   0.75   0.09
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.80 1326.40      0.80 120472.80     0.00     3.30   0.00   0.25   56.75    4.15   5.51     1.00    90.83   0.47  62.77
sde              0.80 1398.30      0.80 120432.00     0.00     2.10   0.00   0.15   67.25    3.55   4.96     1.00    86.13   0.42  58.29
sdf              0.70 1312.40      0.40 120686.40     0.00     2.20   0.00   0.17   76.71    4.32   5.64     0.57    91.96   0.49  63.87
sdg              0.90 1353.80      1.20 120470.80     0.00     2.70   0.00   0.20   68.11    4.11   5.57     1.33    88.99   0.47  63.77
sdh              0.60 1268.90      0.00 120447.20     0.00     2.60   0.00   0.20  104.33    5.01   6.36     0.00    94.92   0.55  70.17
sdi              0.90 1307.10      1.20 120485.20     0.00     2.60   0.00   0.20   53.78    4.31   5.64     1.33    92.18   0.49  64.03
sdb              0.90 1393.80      1.20 120481.20     0.00     2.90   0.00   0.21   58.33    3.64   5.08     1.33    86.44   0.42  59.01
sdc              0.70 1360.80      0.40 120433.20     0.00     2.10   0.00   0.15   55.71    3.89   5.31     0.57    88.50   0.45  61.39
dm-0            23.10    0.00     92.40      0.00     0.00     0.00   0.00   0.00    0.93    0.00   0.02     4.00     0.00   0.04   0.09
dm-1            23.10    0.00     92.40      0.00     0.00     0.00   0.00   0.00    0.93    0.00   0.02     4.00     0.00   0.04   0.09

for read:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   27.94    3.30    0.00   68.76

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd            835.40    0.00  42555.60      0.00     0.00     0.00   0.00   0.00    1.33    0.00   1.11    50.94     0.00   0.58  48.41
sde            896.60    0.00  42510.80      0.00     0.00     0.00   0.00   0.00    1.22    0.00   1.10    47.41     0.00   0.52  46.53
sdf            816.60    0.00  43789.60      0.00     0.00     0.00   0.00   0.00    1.62    0.00   1.31    53.62     0.00   0.69  55.98
sdg            877.50    0.00  41629.60      0.00     0.00     0.00   0.00   0.00    1.28    0.00   1.12    47.44     0.00   0.54  47.49
sdh            873.50    0.00  41980.80      0.00     0.10     0.00   0.01   0.00    1.32    0.00   1.15    48.06     0.00   0.56  48.86
sdi            850.10    0.00  43307.60      0.00     0.10     0.00   0.01   0.00    1.42    0.00   1.18    50.94     0.00   0.59  49.90
sdb            896.30    0.00  41473.20      0.00     0.10     0.00   0.01   0.00    1.09    0.00   0.98    46.27     0.00   0.48  43.36
sdc            898.10    0.00  41845.60      0.00     0.20     0.00   0.02   0.00    1.25    0.00   1.12    46.59     0.00   0.52  46.63
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

Tunning zfetch_max_distance (default value is 8MiB):

  • 24MiB gives 55-60 MiB/s per HDD
  • 80MiB gives 70-75 peaking to 90 MiB/s per HDD
  • 800MiB gives 130-140 MiB/s per HDD
  • 2400MiB gives 150-155 MiB/s with %util 90-95%

@HiFiPhile
Copy link

Hi,
I'm experiencing the same behavior in my raidz1 pool, Large file read speed is only about 140MB/s, which is equal to the performance of one disk. The system is at no load with more than 10GB RAM
available.

System: Proxmox 6.0
Kernel: 5.0
CPU: Xeon E3-1285
RAM: 32GB 1600MHZ ECC
Disk: 3* WD RED 3TB 5400RPM
HBA: SAS 9305-24i
ZFS: 0.8.1

pool: Workspace
 state: ONLINE
  scan: scrub repaired 0B in 0 days 09:59:32 with 0 errors on Sun Sep  8 04:14:16 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        Workspace                   ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50014ee65a70ec05  ONLINE       0     0     0
            wwn-0x50014ee2b6deadfc  ONLINE       0     0     0
            wwn-0x50014ee264864e38  ONLINE       0     0     0

errors: No known data errors

During large file read, as you can see the disks are not in full load:

           1.13    0.00    3.65   43.32    0.00   51.89

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sde             89.00    0.00  45568.00      0.00     0.00     0.00   0.00   0.00   34.58    0.00   2.89   512.00     0.00   5.44  48.40
sdf             91.00    0.00  46592.00      0.00     0.00     0.00   0.00   0.00   29.37    0.00   2.48   512.00     0.00   5.27  48.00
sdg             96.00    0.00  49152.00      0.00     0.00     0.00   0.00   0.00   40.56    0.00   3.70   512.00     0.00   5.33  51.20

And cpu usage is very low:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                     
 2493 mengsk    20   0 4259668  28548   9996 S   1.1   0.1   0:19.07 /usr/sbin/smbd --foreground --no-process-group                                              
 6352 root      20   0 4903636  47884   4968 S   1.0   0.1 708:21.85 /usr/bin/kvm -id 101 -name vsrvl -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,+ 
 1936 root      39  19       0      0      0 S   0.1   0.0  25:09.83 [kipmi0]                                                                                    
 3072 root       0 -20       0      0      0 S   0.1   0.0   0:32.87 [z_rd_int]                                                                                  
 3074 root       0 -20       0      0      0 S   0.1   0.0   0:32.83 [z_rd_int]                                                                                  
 3075 root       0 -20       0      0      0 S   0.1   0.0   0:33.00 [z_rd_int]                                                                                  
 3076 root       0 -20       0      0      0 S   0.1   0.0   0:32.86 [z_rd_int]                                                                                  
 5825 www-data  20   0  357536 112404   9472 S   0.1   0.3   0:03.35 pveproxy worker                                                                             
 6467 root      20   0       0      0      0 S   0.1   0.0  23:59.06 [vhost-6352]                                                                                
   10 root      20   0       0      0      0 I   0.0   0.0   1:14.74 [rcu_sched]                                                                                 
  557 root       1 -19       0      0      0 S   0.0   0.0   0:54.80 [z_wr_iss]                                                                                  
  558 root       1 -19       0      0      0 S   0.0   0.0   0:54.79 [z_wr_iss]                                                                                  
  563 root       0 -20       0      0      0 S   0.0   0.0   0:28.06 [z_wr_int]                                                                                  
 3071 root       0 -20       0      0      0 S   0.0   0.0   0:32.80 [z_rd_int]                                                                                  
 3073 root       0 -20       0      0      0 S   0.0   0.0   0:32.89 [z_rd_int]                                                                                  
 3077 root       0 -20       0      0      0 S   0.0   0.0   0:32.76 [z_rd_int]                                                                                  
 3078 root       0 -20       0      0      0 S   0.0   0.0   0:32.83 [z_rd_int]                                                                                  
 6325 root      20   0  325804  68868   6456 S   0.0   0.2   0:14.49 pve-ha-crm                                                                                  
 6464 root      20   0       0      0      0 S   0.0   0.0  35:39.83 [vhost-6352]                                                                                
 6466 root      20   0       0      0      0 S   0.0   0.0  27:37.46 [vhost-6352]                                                                                
13163 root      20   0   11720   3492   2500 R   0.0   0.0   0:00.13 top                                                                                         
    1 root      20   0  170592   8028   4724 S   0.0   0.0   0:15.98 /sbin/init                                                                                  
    2 root      20   0       0      0      0 S   0.0   0.0   1:36.34 [kthreadd]                                                                                  
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [rcu_gp]                                                                                    
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [rcu_par_gp]                                                                                
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/0:0H-kblockd]                                                                      
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [mm_percpu_wq]                                                                              
    9 root      20   0       0      0      0 S   0.0   0.0   0:04.41 [ksoftirqd/0]   

I'll tune zfetch_max_distance to see if it helps.

@Maciej-Poleski
Copy link
Author

Returning after making some more tunings and benchmarks.

In general I found that increasing:

  • zfetch_array_rd_sz
  • zfetch_max_distance
  • zfs_pd_bytes_max

To values ~1G increases single HDD throughput to 120-140MiB/s decreasing IOPS at the same time. Going beyond that range seems to be difficult.
Before taking a look at zio_taskq_batch_pct (which is less convenient to adjust) I did some tests of sync workload:

dd if=/dev/zero of=zero bs=10M count=5000 oflag=sync

With result of roughly 90MiB/s. This is also throughput of Optane 900P SSD drive. iostat -mx 5 reveals that:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    6.65    0.01    0.00   93.33

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00  666.80      0.00     83.35     0.00     0.00   0.00   0.00    0.00    0.07   1.00     0.00   128.00   1.50  99.98
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdh              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdi              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

Note this low IOPS combined with 100% utilization. zpool iostat -vyr 5 shows:


storage                                          sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      6      0      0      0     46      0      0      0      0      0
8K                                                 0      0      0      0      0      0      3     20      0      0      0      0
16K                                                0      0      0      0      0      0    497     23      0      0      0      0
32K                                                0      0      0      0      0      0      0    116      0      0      0      0
64K                                                0      0      0      0      0      0      0    218      0      0      0      0
128K                                               0      0    695      0      0      0      0    149      0      0      0      0
256K                                               0      0      0      0      0      0      0     66      0      0      0      0
512K                                               0      0      0      0      0      0      0    113      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


raidz2                                           sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      6      0      0      0     46      0      0      0      0      0
8K                                                 0      0      0      0      0      0      3     20      0      0      0      0
16K                                                0      0      0      0      0      0    497     23      0      0      0      0
32K                                                0      0      0      0      0      0      0    116      0      0      0      0
64K                                                0      0      0      0      0      0      0    218      0      0      0      0
128K                                               0      0      0      0      0      0      0    149      0      0      0      0
256K                                               0      0      0      0      0      0      0     66      0      0      0      0
512K                                               0      0      0      0      0      0      0    113      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZGY5C3W7                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      6      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     64      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     16      0      0      0      0
64K                                                0      0      0      0      0      0      0     24      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     12      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZGY5E06J                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     60      4      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     18      0      0      0      0
256K                                               0      0      0      0      0      0      0      8      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7EMPY                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     68      1      0      0      0      0
32K                                                0      0      0      0      0      0      0     17      0      0      0      0
64K                                                0      0      0      0      0      0      0     26      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      6      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F08Z                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     68      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     30      0      0      0      0
128K                                               0      0      0      0      0      0      0     19      0      0      0      0
256K                                               0      0      0      0      0      0      0      7      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7ESM1                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      3      0      0      0      0
16K                                                0      0      0      0      0      0     66      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     15      0      0      0      0
64K                                                0      0      0      0      0      0      0     33      0      0      0      0
128K                                               0      0      0      0      0      0      0     13      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7FA7S                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      8      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      1      0      0      0      0
16K                                                0      0      0      0      0      0     68      1      0      0      0      0
32K                                                0      0      0      0      0      0      0     11      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     19      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F9P5                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      4      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     47      3      0      0      0      0
32K                                                0      0      0      0      0      0      0     14      0      0      0      0
64K                                                0      0      0      0      0      0      0     26      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      7      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F9BN                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      3      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     52      3      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     16      0      0      0      0
256K                                               0      0      0      0      0      0      0      6      0      0      0      0
512K                                               0      0      0      0      0      0      0     15      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      0      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      0      0      0      0      0
16K                                                0      0      0      0      0      0      0      0      0      0      0      0
32K                                                0      0      0      0      0      0      0      0      0      0      0      0
64K                                                0      0      0      0      0      0      0      0      0      0      0      0
128K                                               0      0    696      0      0      0      0      0      0      0      0      0
256K                                               0      0      0      0      0      0      0      0      0      0      0      0
512K                                               0      0      0      0      0      0      0      0      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------

So we write to SLOG in 128K blocks. Let's benchmark this with

dd if=/dev/zero of=/dev/disk/by-id/nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN bs=128K oflag=sync

I get throughput of 1.1GB/s and the following output from iostat -mx 5:



avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    2.62    3.36    0.00   93.99

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00 8605.40      0.00   1075.67     0.00 266769.60   0.00  96.88    0.00    0.08   1.00     0.00   128.00   0.12 100.00
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdh              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdi              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

As we can see, this SSD is able to handle much more IOPS sized at 128K resulting in ~1075MB/s write performance.

What causes ZFS to consume all the bandwidth of the SSD drive with just 700 IOPS, when dd can issue more than 10x at the same queue length and utilization?
I guess the answer is in wrqm column. But stalling SSD device at 666.8 writes per second seems to be wrong. I'm wondering if something is wrong with my system.

After discovering these nice histograms I run another iteration of write benchmark:

dd if=/dev/zero of=zero bs=10M count=5000

To keep it short:


storage                                          sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0     12      0      0      0    156      0      0      0      0      0
8K                                                 0      0      0      0      0      0     41      6      0      0      0      0
16K                                                0      0      0      0      0      0  6.65K    155      0      0      0      0
32K                                                0      0      0      0      0      0      0    668      0      0      0      0
64K                                                0      0      0      0      0      0      0    813      0      0      0      0
128K                                               0      0      0      0      0      0      0    637      0      0      0      0
256K                                               0      0      0      0      0      0      0    522      0      0      0      0
512K                                               0      0      0      0      0      0      0    492      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------

I interpret it as follows (please correct me if I'm wrong): column ind is a "pre-aggregation" value and agg is a "post-aggregation" (thus only blocks from agg column issue IOs). It strikes me that the above dd command with block size of 10M results in IO/s of size 16K. recordsize should be somewhat bigger... I think this can be related to inability to reach higher throughput of async write during tests.

With that I kind of have solution of the problem "reads are slower than writes". I still don't have solution of the problem "Subpar performance of RAIDZ". I think it can be related to IO size. Also issue with SLOG is worrying. I actually get much better bandwidth (>300MiB/s) when SLOG device is not present.
I used to suspect some issue with scheduling or some form of lock contention (indicated by noticeable performance difference when hyper-threading was switched off). I think the explanation of the phenomenon observed on SLOG goes beyond that.
Issues related to raidz read performance being lower than write performance are somewhat viral on the internet. While some benchmarks have mistakes like block size of 512B, there is some data suggesting that the thing might be in configuration of the software (like https://www.reddit.com/r/zfs/comments/8pm7i0/8x_seagate_12tb_in_raidz2_poor_readwrite/).

@jonathanspw
Copy link

I'm facing the exact same issue. Random writes with 1M BS can sustain around 1GB/s on 8x 8TB Ultrastar 7200 HDDs. Sequential write gets me up to around 1200MB/s.
Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jan 22, 2020
@craigyk
Copy link

craigyk commented Jan 31, 2020

I have noticed this as well, 0.8.2 on 4.15.14

@scineram
Copy link

scineram commented Jan 31, 2020

Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

What is wrong with that? That is around what iops a spinning disk can do.

@jonathanspw
Copy link

Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

What is wrong with that? That is around what iops a spinning disk can do.

Not exactly. The IOPS on each disk are well under 100, far shy of the ~200-250 they can sustain each, and do individually. Even iostat reports each drive as 30-40% utilized. The limitation seems to be in ZFS somewhere, not the hardware.

In all other benchmarks against ZFS (and random read as well against the disk directly) the disks max out at a sustained ~250 IOPS.

@scineram
Copy link

scineram commented Feb 2, 2020

I was thinking about the 200 MB/s sequential. Wouldn't that be the max a disc and thus raidz can do?

@drescherjm
Copy link

drescherjm commented Feb 2, 2020

Doesn't raidz1 have a theoretical max sequential of (n-1) * STR of a single disk?

I have seen greater than 1GB /s sequential on my raidz3 arrays.

@yazun
Copy link

yazun commented Mar 3, 2020

We had very similar results.

We had to drop ZFS from consideration after seeing similar issue.

HP Apollo machines with P408i-p controllers. 256GB, 40 cores.
Centos 7, ZFS ver 0.7 to 0.8/master from December 2019.
Raidz 4x6 4TB disks.

With XFS+hw controller we get 2.5-3GBs sequential speeds (fio, dd), single threaded.

With ZFS we got max 300-400MBs, after heavy tuning there were peaks around 700MBs with 8+ threads. Write speeds around 1.5GBs, acceptable in our scenario (parallel DB).

We would be happy to sacrifice some speed for the features, but this was a deal-breaker for us.

@wariole
Copy link

wariole commented Mar 10, 2020

I have faced exactly the same behaviour with my 2 x 6 RaidZ2 pool (ZFS version 0.8.3-pve1 on Kernel 5.3.18-2-pve).
I described the issue here: https://forums.servethehome.com/index.php?threads/disappointing-zfs-read-performance-on-2-x-6-raidz2-and-quest-for-bottleneck-s.27716

Thank you very much @Maciej-Poleski: setting zfetch_max_distance to the maximum value of 2147483648 also got me the read speed I expected from my system.

@Hr46ph
Copy link

Hr46ph commented May 31, 2020

I would like to chime in. I've been running benchmarks on my system after running into performance issues too.

System is HPe ML10 Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz, 32 GB DDR4 ECC.
6 x 4TB spinning disks, Hitachi Deskstar and Ultrastar 7200rpm.

Before running the benchmarks, I have installed a completely fresh system with Arch Linux, 5.6.15 and zfs 0.8.4. I changed nothing on the config, simply installed with basic and minimal settings and just the packages I needed to run the benchmarks and monitor performance.

I ran memtest86+ full cycle to make sure RAM is oke (its ECC, but still).

To establish a baseline, I have benchmarked each disk individually with 2048 aligned ext4 partitions, with fio in a loop with the following parameters. Testfile gets deleted and caches dropped between each run:

  • Test Filesize: 64 GB (double the RAM)
  • Modes: read, randread, write, randwrite
  • Blocksizes: 4K, 8K, 64K, 128K, 1M
  • queuedepth: 8
  • Jobs: 8
  • end_fsync: 1
  • ioengine: libaio
  • direct: 1
  • group reporting: 1
  • ramp_time: 120 seconds
  • runtime: 500
  • Time_based: 1 (makes fio run for 500 seconds no matter if its done or not).

I can post the results if you like but believe me when I say these numbers are consistent accross the board, and completely within every reasonable expectation. They also match with online test results.

I created a zfs pool as follows:

  • ashift=12
  • relatime=on
  • canmount=off
  • compression=lz4
  • xattr=sa
  • dnodesize=auto
  • acltype=posixacl
  • normalization=formD
  • raidz2

I created a zfs dataset for each of the following recordsizes: 4K, 8K, 64K, 128K, 1M. I then ran the fio loop with each blocksize on each dataset. This amounts to 20 tests on each dataset, a total of 100 tests across the board, 15 minutes per run, 5 hours per dataset and 25 hours to complete.

The random read numbers on the pool:

Mode: RANDREAD   RS4K RS8K RS64K RS128K RS1M
4K Rand IOPS 591 432 240 222 232
8K Rand IOPS 345 416 278 246 187
64K Rand IOPS 140 331 272 243 230
128K Rand IOPS 241 155 235 245 165
1M Rand IOPS 88 80 109 116 168
Averages   281 283 227 214 196
4K Rand MiB/s 2,3 1,7 0,9 0,9 0,9
8K Rand MiB/s 2,7 3,3 2,2 1,9 1,5
64K Rand MiB/s 8,8 20,7 17,1 15,2 14,4
128K Rand MiB/s 30,2 19,5 29,4 30,7 20,7
1M Rand MiB/s 89,0 80,4 110,0 117,0 169,0
Averages   26,6 25,1 31,9 33,1 41,3

Comparing to the single disk speeds, only the 4K was faster (about twice as fast) on my pool. From 8K and up its pretty much single disk speeds, give or take here and there.

Random writes are a different story. Look at this:

Mode: RANDWRITE   RS4K RS8K RS64K RS128K RS1M
4K Rand IOPS 8905 5068 4829 5873 2200
8K Rand IOPS 6555 15500 2744 2819 2064
64K Rand IOPS 921 1813 3705 579 407
128K Rand IOPS 389 851 1794 2674 297
1M Rand IOPS 44 54 257 336 505
Averages   3363 4657 2666 2456 1095
4K Rand MiB/s 34,8 19,8 18,9 22,9 8,6
8K Rand MiB/s 51,2 121,0 21,4 22,0 16,1
64K Rand MiB/s 57,5 113,0 232,0 26,2 25,5
128K Rand MiB/s 48,7 106,0 224,0 334,0 37,2
1M Rand MiB/s 44,3 54,6 258,0 336,0 505,0
Averages   47,3 82,9 150,9 148,2 118,5

I don't know what to make of this. IOPS are through the roof (unreal, each disk is capable of maybe 250-300 max?). The 4K and 8K MiB/s are also unrealistically high, but the rest seems decent and consistent with triple to quad single disk speeds.

Again, I don't know what to make of this but I would really like to find out whether I can get those random read speeds "up to speed", so to speak. I'm running the above tests on a striped NVMe pool of 3 SSD's (which look to turn out abnormally slow while their single disk speeds are reasonable). After that is done I can experiment with tuning performance parameters (if I know which ones).

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jun 2, 2021
@stale stale bot closed this as completed Aug 31, 2021
@yazun
Copy link

yazun commented Aug 31, 2021

Maybe the bot should reconsider as no explanation nor solution was given?

@behlendorf behlendorf reopened this Aug 31, 2021
@stale stale bot removed the Status: Stale No recent activity for issue label Aug 31, 2021
@behlendorf behlendorf added the Bot: Not Stale Override for the stale bot label Aug 31, 2021
@julmb
Copy link
Contributor

julmb commented May 29, 2022

I am having a similar issue with my raidz2 pool of 6x ST16000NM001G HDDs. The individual drives can sustain >250 MB/s sequential reads or writes. My dataset uses 1M recordsize and lz4 compression.

During sequential writes, each drive is near its maximum sequential write performance and I get 987 MB/s on the dataset:

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00  250.00 256000.00     0.00   0.00   30.17  1024.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    7.54 100.00
sdc              0.00      0.00     0.00   0.00    0.00     0.00  252.00 247576.00     0.00   0.00   33.95   982.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    8.56 100.40
sdd              0.00      0.00     0.00   0.00    0.00     0.00  281.00 254728.00     0.00   0.00   29.11   906.51    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    8.18 100.00
sde              0.00      0.00     0.00   0.00    0.00     0.00  431.00 273456.00    16.00   3.58   17.23   634.47    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    7.43 100.40
sdf              0.00      0.00     0.00   0.00    0.00     0.00  257.00 262144.00     0.00   0.00   33.25  1020.02    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    8.55  99.60
sdg              0.00      0.00     0.00   0.00    0.00     0.00  465.00 255000.00    31.00   6.25   14.11   548.39    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.56 100.40

During sequential reads, each drive is stuck at around 180 MB/s and I get 707 MB/s on the dataset:

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda            717.00 183552.00     0.00   0.00    0.54   256.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.39 100.00
sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdd              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sde            717.00 183552.00     0.00   0.00    0.56   256.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.41  99.60
sdf            718.00 183808.00     0.00   0.00    0.54   256.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.39  99.60
sdg            717.00 183552.00     0.00   0.00    0.56   256.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.40  99.60

When I switch the recordsize to the default of 128kB, the drives reach their maximum sequential speeds during reading, giving 922 MB/s on the dataset.

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdc           8001.00 256032.00     0.00   0.00    0.18    32.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.44 100.00
sdd           8001.00 256032.00     0.00   0.00    0.16    32.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.32 100.00
sde           8000.00 256000.00     0.00   0.00    0.18    32.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.47 100.00
sdf           8003.00 256096.00     0.00   0.00    0.17    32.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.36 100.00
sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

However, with the smaller recordsize, writes are now a lot slower, stuck around 160 MB/s per drive, for 677 MB/s on the dataset.

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00 1264.00 160416.00     3.00   0.24    5.31   126.91    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.72  88.40
sdc              0.00      0.00     0.00   0.00    0.00     0.00 1366.00 161184.00     5.00   0.36    4.88   118.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.67  94.00
sdd              0.00      0.00     0.00   0.00    0.00     0.00 1282.00 158496.00     5.00   0.39    5.24   123.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.72  89.60
sde              0.00      0.00     0.00   0.00    0.00     0.00 1235.00 160192.00     6.00   0.48    5.39   129.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.65  88.00
sdf              0.00      0.00     0.00   0.00    0.00     0.00 1479.00 160704.00    13.00   0.87    4.22   108.66    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.25  90.80
sdg              0.00      0.00     0.00   0.00    0.00     0.00 1546.00 162976.00     7.00   0.45    4.03   105.42    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.23  89.60

Why is reading fast with 128k records but slow with 1M records and the other way round for writing? I already tested higher values for zfetch_array_rd_sz and zfetch_max_distance but saw no difference in performance.

It also looks like raidz2 is only using 4 out of the 6 disks when reading. Why is that?

@julmb
Copy link
Contributor

julmb commented May 29, 2022

I just noticed something even stranger. For my benchmarks, I had set primarycache = metadata on the dataset. Setting primarycache = default drastically improved read performance, despite the cache being completely empty!

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdb            921.00 260352.00     0.00   0.00    1.69   282.68    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.56  99.20
sdc            900.00 259072.00     0.00   0.00    2.62   287.86    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.36  99.60
sdd            944.00 260096.00     0.00   0.00    1.64   275.53    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.54  99.20
sde              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdf            465.00 260864.00     0.00   0.00    6.43   561.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.99 100.00

This test was done immediately after booting, with the ARC being completely empty. I have no idea why an empty cache would improve read performance. Does ZFS read some blocks multiple times during a sequential read operation?

EDIT: It kind of looks like the prefetcher is limited/inactive when ARC is set to metadata-only? With primarycache set to default, suddenly parameters like zfetch_max_distance do have an effect, and reduce the reads/second from 900 to 260. Although it seems that this hardly affects the read speeds, as the disk was already able to hit 260 MB/s even with 900 reads/second.

EDIT2: This also affects my NVME pool, where read performance is almost doubled from 1.8 GB/s to 3.2 GB/s simply by enabling primarycache (with the ARC being empty at the time). This is particularly silly considering that I do not actually want to cache any data from a pool capable of reading at 3.2 GB/s. I just want whatever auxiliary behavior is causing the pool to be able to read at these speeds in the first place.

@amotin
Copy link
Member

amotin commented May 31, 2022

To read fast from a wide pool, as you have noticed, you do need some prefetch, either speculative by ZFS or explicit by the application. But with primarycache=metadata you are denying ZFS to use ARC for data, so it can't do speculative prefetch without risk of data being evicted from ARC before they are actually used, that would be a waste. I agree that it would be good if the restriction was less strict.

@julmb
Copy link
Contributor

julmb commented May 31, 2022

Yes, I was under the impression that ARC and prefetching were independent features. I learned just today that prefetching uses the ARC to do its job.

It really is unfortunate that we cannot have prefetching without also setting up the pool to use the ARC in general, the latter of which would be fairly wasteful for fast NVME storage. It would be nice if there was a switch like primarycache=prefetchonly that would only allow the prefetcher access to the ARC. There even already exist module parameters zfs_arc_min_prefetch_ms and zfs_arc_min_prescient_prefetch_ms that govern ARC contents originating from the prefetcher specifically.

@IsaacVaughn
Copy link

Perhaps the prefetcher could be an entirely separate property, so primarycache=metadata and primarycache=none could both work with and without prefetch? Could whatever tracking allows adaptive prefetch and the prefetch efficiency stats be checked before evicting ARC data?

@piotrminkina
Copy link

ZFS read performance is a really strange thing. I have 4xHDDs in RAID10 (stripped mirrors) as below. I run fio tests in dataset with compression=off, primarycache=metadata and secondarycache=metadata set.

The write performance in my case is something like 300 MiB/s (the total write of all disks is almost 600 MiB/s) — here it all adds up, as a single disk write outside ZFS is ~150 MiB/s. As far as read performance is concerned it is decidedly unsatisfactory. A single drive in my array outside of ZFS can read at ~200 MiB/s, so I would expect reads in fio to be closer to 800 MiB/s. Unfortunately, during testing, the read from the array is ~300 MiB/s (the total read from all drives is also ~300 MiB/s). I have tried a number of settings in /sys/module/zfs/parameters, but without any increase on performance, and sometimes performance just drops.

I have read in previous posts that enabling cache in ZFS speeds up read performance, even if the cache is still empty. This is exactly what happened! During an ongoing read performance test I turned on the cache and suddenly magic! Immediately the transfers in fio jumped from ~300 MiB/s to ~500 MiB/s, which is definitely an improvement, but still lower than write performance. You can see exactly this in the graphs from netdata, which I've included below. That sudden jump in read performance (green colour) on the right side of the graph is precisely when I turn on the cache. Is this behaviour expected?

My performance results:
image

My array topology:

$ zpool status pool1
  pool: pool1
 state: ONLINE
  scan: resilvered 586M in 00:00:29 with 0 errors on Mon Jan 16 23:15:43 2023
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool1                                     ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    22490305-8130-11eb-908c-d05099db41a7  ONLINE       0     0     0
	    2257669b-8130-11eb-908c-d05099db41a7  ONLINE       0     0     0
	  mirror-1                                ONLINE       0     0     0
	    3151952f-bd08-4982-8d38-1d43ea11368b  ONLINE       0     0     0
	    dd813ecd-5f9d-4fd8-8222-436f6fc3073e  ONLINE       0     0     0
	cache
	  208950e6-8130-11eb-908c-d05099db41a7    ONLINE       0     0     0

@amotin
Copy link
Member

amotin commented Jan 30, 2023

@piotrminkina Before recently data cache disabling also disabled speculative data prefetch, since one does require cache for its operation. I've fixed it 3 weeks ago with #14243 in master branch. If you need to run with primarycache=metadata, it should give you huge performance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests