-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subpar performance of RAIDZ, reads are slower than writes. #9375
Comments
You should look at IOPS too, please show And one more thing - sometimes 1 thread can't give you full pool performance. You may want to tune params for it, for example prefetch read https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfetch_max_distance . It depends on your load. So looks like not a bug. |
Numbers vary greatly with 1 second intervals (workload seems to be bursty). When interval is set to 10 seconds these are: For write:
for read:
Tunning
|
Hi,
During large file read, as you can see the disks are not in full load:
And cpu usage is very low:
I'll tune |
Returning after making some more tunings and benchmarks. In general I found that increasing:
To values ~1G increases single HDD throughput to 120-140MiB/s decreasing IOPS at the same time. Going beyond that range seems to be difficult.
With result of roughly 90MiB/s. This is also throughput of Optane 900P SSD drive.
Note this low IOPS combined with 100% utilization.
So we write to SLOG in 128K blocks. Let's benchmark this with
I get throughput of 1.1GB/s and the following output from
As we can see, this SSD is able to handle much more IOPS sized at 128K resulting in ~1075MB/s write performance. What causes ZFS to consume all the bandwidth of the SSD drive with just 700 IOPS, when dd can issue more than 10x at the same queue length and utilization? After discovering these nice histograms I run another iteration of write benchmark:
To keep it short:
I interpret it as follows (please correct me if I'm wrong): column With that I kind of have solution of the problem "reads are slower than writes". I still don't have solution of the problem "Subpar performance of RAIDZ". I think it can be related to IO size. Also issue with SLOG is worrying. I actually get much better bandwidth (>300MiB/s) when SLOG device is not present. |
I'm facing the exact same issue. Random writes with 1M BS can sustain around 1GB/s on 8x 8TB Ultrastar 7200 HDDs. Sequential write gets me up to around 1200MB/s. |
I have noticed this as well, 0.8.2 on 4.15.14 |
What is wrong with that? That is around what iops a spinning disk can do. |
Not exactly. The IOPS on each disk are well under 100, far shy of the ~200-250 they can sustain each, and do individually. Even iostat reports each drive as 30-40% utilized. The limitation seems to be in ZFS somewhere, not the hardware. In all other benchmarks against ZFS (and random read as well against the disk directly) the disks max out at a sustained ~250 IOPS. |
I was thinking about the 200 MB/s sequential. Wouldn't that be the max a disc and thus raidz can do? |
Doesn't raidz1 have a theoretical max sequential of (n-1) * STR of a single disk? I have seen greater than 1GB /s sequential on my raidz3 arrays. |
We had very similar results. We had to drop ZFS from consideration after seeing similar issue. HP Apollo machines with P408i-p controllers. 256GB, 40 cores. With XFS+hw controller we get 2.5-3GBs sequential speeds (fio, dd), single threaded. With ZFS we got max 300-400MBs, after heavy tuning there were peaks around 700MBs with 8+ threads. Write speeds around 1.5GBs, acceptable in our scenario (parallel DB). We would be happy to sacrifice some speed for the features, but this was a deal-breaker for us. |
I have faced exactly the same behaviour with my 2 x 6 RaidZ2 pool (ZFS version 0.8.3-pve1 on Kernel 5.3.18-2-pve). Thank you very much @Maciej-Poleski: setting |
I would like to chime in. I've been running benchmarks on my system after running into performance issues too. System is HPe ML10 Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz, 32 GB DDR4 ECC. Before running the benchmarks, I have installed a completely fresh system with Arch Linux, 5.6.15 and zfs 0.8.4. I changed nothing on the config, simply installed with basic and minimal settings and just the packages I needed to run the benchmarks and monitor performance. I ran memtest86+ full cycle to make sure RAM is oke (its ECC, but still). To establish a baseline, I have benchmarked each disk individually with 2048 aligned ext4 partitions, with fio in a loop with the following parameters. Testfile gets deleted and caches dropped between each run:
I can post the results if you like but believe me when I say these numbers are consistent accross the board, and completely within every reasonable expectation. They also match with online test results. I created a zfs pool as follows:
I created a zfs dataset for each of the following recordsizes: 4K, 8K, 64K, 128K, 1M. I then ran the fio loop with each blocksize on each dataset. This amounts to 20 tests on each dataset, a total of 100 tests across the board, 15 minutes per run, 5 hours per dataset and 25 hours to complete. The random read numbers on the pool:
Comparing to the single disk speeds, only the 4K was faster (about twice as fast) on my pool. From 8K and up its pretty much single disk speeds, give or take here and there. Random writes are a different story. Look at this:
I don't know what to make of this. IOPS are through the roof (unreal, each disk is capable of maybe 250-300 max?). The 4K and 8K MiB/s are also unrealistically high, but the rest seems decent and consistent with triple to quad single disk speeds. Again, I don't know what to make of this but I would really like to find out whether I can get those random read speeds "up to speed", so to speak. I'm running the above tests on a striped NVMe pool of 3 SSD's (which look to turn out abnormally slow while their single disk speeds are reasonable). After that is done I can experiment with tuning performance parameters (if I know which ones). |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Maybe the bot should reconsider as no explanation nor solution was given? |
I am having a similar issue with my raidz2 pool of 6x ST16000NM001G HDDs. The individual drives can sustain >250 MB/s sequential reads or writes. My dataset uses 1M recordsize and lz4 compression. During sequential writes, each drive is near its maximum sequential write performance and I get 987 MB/s on the dataset:
During sequential reads, each drive is stuck at around 180 MB/s and I get 707 MB/s on the dataset:
When I switch the recordsize to the default of 128kB, the drives reach their maximum sequential speeds during reading, giving 922 MB/s on the dataset.
However, with the smaller recordsize, writes are now a lot slower, stuck around 160 MB/s per drive, for 677 MB/s on the dataset.
Why is reading fast with 128k records but slow with 1M records and the other way round for writing? I already tested higher values for It also looks like raidz2 is only using 4 out of the 6 disks when reading. Why is that? |
I just noticed something even stranger. For my benchmarks, I had set
This test was done immediately after booting, with the ARC being completely empty. I have no idea why an empty cache would improve read performance. Does ZFS read some blocks multiple times during a sequential read operation? EDIT: It kind of looks like the prefetcher is limited/inactive when ARC is set to metadata-only? With EDIT2: This also affects my NVME pool, where read performance is almost doubled from 1.8 GB/s to 3.2 GB/s simply by enabling |
To read fast from a wide pool, as you have noticed, you do need some prefetch, either speculative by ZFS or explicit by the application. But with primarycache=metadata you are denying ZFS to use ARC for data, so it can't do speculative prefetch without risk of data being evicted from ARC before they are actually used, that would be a waste. I agree that it would be good if the restriction was less strict. |
Yes, I was under the impression that ARC and prefetching were independent features. I learned just today that prefetching uses the ARC to do its job. It really is unfortunate that we cannot have prefetching without also setting up the pool to use the ARC in general, the latter of which would be fairly wasteful for fast NVME storage. It would be nice if there was a switch like |
Perhaps the prefetcher could be an entirely separate property, so primarycache=metadata and primarycache=none could both work with and without prefetch? Could whatever tracking allows adaptive prefetch and the prefetch efficiency stats be checked before evicting ARC data? |
ZFS read performance is a really strange thing. I have 4xHDDs in RAID10 (stripped mirrors) as below. I run The write performance in my case is something like 300 MiB/s (the total write of all disks is almost 600 MiB/s) — here it all adds up, as a single disk write outside ZFS is ~150 MiB/s. As far as read performance is concerned it is decidedly unsatisfactory. A single drive in my array outside of ZFS can read at ~200 MiB/s, so I would expect reads in I have read in previous posts that enabling cache in ZFS speeds up read performance, even if the cache is still empty. This is exactly what happened! During an ongoing read performance test I turned on the cache and suddenly magic! Immediately the transfers in My array topology:
|
@piotrminkina Before recently data cache disabling also disabled speculative data prefetch, since one does require cache for its operation. I've fixed it 3 weeks ago with #14243 in master branch. If you need to run with primarycache=metadata, it should give you huge performance improvements. |
System information
Describe the problem you're observing
Performance is not satisfying. Cannot saturate full HDDs' bandwidth. Reads are less than 50% of write performance.
I have 8 ST4000VN008. 6 of them performs according to specification when reading from beginning ~180MiB/s. 2 of them under-perform with ~160MiB/s (should I RMA?). Write performance is lower with ~160MiB/s (slow disks are slower in all tests).
Benchmark provides the same results when run on all disks at the same time in parallel.
These disks are configured as RAIDZ-2 with the following command:
(but with compression disabled for benchmarks).
When the filesystem created above is benchmarked using
then
displays the following data:
Note: Sequential write peaks at 120MiB/s per HDD.
When reading
then
displays the following data:
Note: Sequential read peaks at ~45MiB/s per HDD.
Further experimentation revealed the following information:
zfs_vdev_raidz_impl
isavx512bw
(determined to be thefastest
). Changing it toscalar
doesn't impact performance.dnodesize
tolegacy
doesn't impact performance.recordsize
to1M
slightly increases performance (~130MiB/s write, 50-55MiB/s read).My build is roughly:
Describe how to reproduce the problem
Copy-paste the above commands. Note that they depend on availability of HDDs with specific S/Ns, so you will need to adjust.
Include any warning/errors/backtraces from the system logs
None/Not aware of any.
The text was updated successfully, but these errors were encountered: