-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very high cpu load writing to a zvol #7631
Comments
@akschu try decreasing the default number of zvol threads. You can do this by setting the |
This machine has (2) 12c CPU's, so 24 actual cores, and 48 listed in linux due to hyperthreading. I can do some benchmarking to help out, but I can already tell that 1 or 2 is going to absolutely kill performance. The single zvol thread test has been running for over 1/2 hour and the zvol is only hitting 10% cpu. What I'm not understanding is why reading/writing to/from a fs/zvol on the same dataset is so different in CPU load and performance when only the direction is different. Seems like I should see nearly identical performance. If an image file on the file system is 5 times faster than zvol, then that leads me to believe there is some other issue, or perhaps I shouldn't use a zvol. Anyway, I'll benchmark a little more, but some information about why the direction matters so much would be helpful for me, and probably others that see their machine come to it's knees with nothing more than a zvol write. |
@akschu it would be helpful to run |
I tried to cancel my single zvol thread dd test after an hour tried to ctrl-c and then kill -9 and it's stuck. I'd say that on my system that a single thread zvol isn't even usable. In the mean time, here is what the system is busy doing:
I'll have more information tomorrow. Thanks for the help and working on ZFS |
I'd like to point out that dd-ing data to a block device is essentially filling memory with data and then relying on the kernel to flush it to the file system and, effectively subverts the various throttles built into zfs. Try adding |
@akschu did you export/import the pool between the tests? The reverse copy could well have been served the data in the zvol purely from ARC (given it being big enough), without any reads from the physical disks, which could explain the speedup (and expecially the lesser load) on the reverse copy. Also: should the prerequisites for the nop_write feature (strong checksum and active compression) be met on datastore/vm it would lead to the data actually not being written at all to /datastore/vm/dng-smokeping/dng-smokeping.raw as identical data already exists (ZFS sees this and drops the rewrites as the data is already there so there is no need write anything to the physical disks), this would be another way to explain the speedup (and part of the lower load) you noticed on the copy back from the zvol to the image on the filesystem. |
@dweeezil that might be the issue, because when I use that flag, load averages are way more normal and I only get a single zvol thread regardless of how many I specify when loading the zfs module. The reason for dd was because I was just trying to reduce the test to the lowest common denominator and determine if this was a qemu issue or a zfs issue. When I saw the same exact behavior in dd I figured it's probably a zfs issue and reported here. The command that I was running when I discovered this was:
This command simply copies a virtual machine raw image to a block device and then switches out the storage (pivot) to migrate the storage from a raw image to a zvol. When I copy from file to zvol it pretty much grinds the system to a halt due to crazy high load average. When I copy from zvol to a file, it's very fast and little load. Not sure how to go about fixing this in the real world. The workload that qemu creates with a blockcopy writing to a zvol absolutely crushes the server. I think I might limit the zvol_threads to 8 and see if that imposes a threading throttle that prevents the copy from crushing the machine. |
@GregorKopka, no I wasn't importing/exporting between tests. I can try that. Your theory makes sense why it works so much faster one way vs another. oflag=direct seems to confirm that as well as with that set, I see 40% better read vs write performance, instead of 300% better read without that flag. So caching is certainly playing a role. I think the big issue for me is that writing a lot of data to a zvol hurts the machine enough to cause outages. That's why I reported it as a bug, because I don't think writing to a zvol should be able to completely take out the system, especially not a 24 core machine. |
I am using cache=none and I did post to the mailing list without reply. Once I saw a simple DD crushing the server, I figured it would be reasonable to open a bug. Here is the benchmarking data behlendorf asked for. Perhaps it provides some use:
Seems like anything over 8 on my system just results in higher load averages. |
@akschu After reading #7787, I ran your fio test from there to see what's happening. For reference, the command is:
I added a Before going into my findings, I'd like to repeat a bit of history here for anyone else following along with these and related issues: ZoL originally had the "zvol" taskqs but they were eventually removed in the restructuring of 37f9dac. Later, due to their apparent need in some workloads, were reinstated in 692e55b (issue #5824), but a tunable was added to revert to the previous behavior (see below). It looks like an fio test like yours is an example of a workload that's hurt by the zvol taskqs. It spawns 8 processes, each of which attempt to maintain an IO depth of 8. This yields very high load averages due to excessive taskq dispatch and the use of spinlocks. Here's a flame graph showing the excessive CPU use: Setting Now for the performance numbers (the pool is simply 32 7200RPM drives on the same SAS expander arranged as 16 2-drive mirrors). Here's the relevant fio stats for the first case:
It wrote 400.2MB/s at 100K IOPS. For the second case, in which the zvol taskqs were not used:
It wrote 661.2MB/s at 165K IOPS. Finally, since you had mentioned how much faster it was to use the file system, I created a
It wrote 549.0Mb/s at 137K IOPS. Somewhat worse than the zvol-without-taskq case and its CPU utilization was also similar. If there's a problem here, I think it would be whether Finally, I'll note that your use of libvirt's "blockcopy" command is likely not using direct IO which means it will suffer from all the ills you'd see when performing non-direct bulk writes to any other block device but with the extra penalty of the overhead from all the taskqs. |
@dweeezil using --rw=write is a sequential workload and therefore subject to merging at the block layer. Thus the number of iops reported by fio is not the iops seen by the zvol. directio doesn't really matter for zvols, per se, it is there to satisfy the aio engine. One approach that we use elsewhere is to size the number of threads by the number of CPUs. If you only have a few CPUs, then adding a bunch of threads won't help. |
Also, it is worth noting that flamegraphs, by default, elide idle time, so you have to compare the actual rates between two different flamegraphs. |
I wasn't actually terribly interested in the absolute performance numbers, particularly given what this test is actually doing. I think the most interesting finding is that much of the excessive CPU time is being used during spinlock contention while dispatching to the taskq. I'm still not convinced that the zvol taskqs are beneficial for most (user-land) workloads. |
agree. I've tried to add some wisdom to the wiki |
Agreed. We should consider promoting the module option to a dataset property so this can be controlled per volume. To reduce the lock contention we could additionally make the zvol taskq's per volume, instead of global, and decrease the default number of threads. |
Making them per-dataset would potentially result in more total threads, but it would also entirely prevent lock contention between two volumes being actively used. Alternately, we could do something per-pool like the zio taskq's where there tasks are spread over multiple taskqs. That would let you bound the total number of threads and significantly reduce the contention. We'd want to experiment with what works best. |
I think for my workload (virtual machines, each in their own dataset) having the ability to set the number taskqs per dataset would help a lot. That way I can spread the taskqs between VM's instead of having one VM stampede all of the taskqs and make all of the other VMs unusable. The current proposed solution of zvol_request_sync=1 certainly lowers the load average, but I'm not sure how much that will help if all of the I/O is consumed by one VM. I'll be testing some stuff tonight when it's less intrusive. |
Setting zvol_request_sync=1 will ensure that each zvol can only accept 1 I/O request at a time. Others will queue at the block layer. In some ways, for multiple zvols this like round-robin scheduling, but not really since the amount of effort and resources to handle an I/O can vary widely. |
I'm seeing the same thing as the OP on my server. I noticed that zvols take up no space when first created and space is continuously allocated from the pool as the zvol is written to. My system load also goes through the roof (about 5x my core count) and the high CPU usage (~85%) from this situation is in kernel mode. I need to go back and double check but I don't believe there is much of a CPU hit at all when reading/writing to the volume when no allocation is being done. |
I was running into similar issues as this on one server out of 4 similar ones, doing similar tasks (just backup storage). When the zvol processes went crazy on my end they would make the system completely inaccessible, and the issue only seemed to occur when gzip compression was enabled. The gzip threads weren't what went haywire though, it was always zvol itself. I limited the number of threads to equal the number of HT CPU cores and so far the issue hasn't shown itself again. Solid red is what it did to the CPU (system time gone haywire). The gap/odd data is when it was totally unresponsive. I rebooted it and set new thread limit and it's been fine. The time before/after the solid red chunks are when everything was running fine. When the solid red ends it's when I rebooted the server and changed the thread limit to 12. This is with an E5-2620 v3 and 8x 6TB drives. |
I wouldn't mind some guidance here also. I'm running a pair of EPYC 7502P machines with NVMe storage, whenever I do any form of heavy writing to a zvol (be it qemu-img, or a storage migration) I see extremely high load averages and the machine gets quite unresponsive. I think my saving grace is having 64 threads on hand, otherwise I think the machine would be completely unresponsive. I've seen comments around limiting the number of zvol threads, as well as setting zvol_request_sync, and/or disabling compression on the pool. Can someone steer me in the right direction, perhaps a safe ZFS thread limit to start with? |
@fixyourcodeplease222 it looks like I've got 'none' on all the zvols:
|
@akschu How did this end up going for you? Did you see any improvement? |
@shaneshort start here then, once you've determined if you care about the "load average" and know what is it showing for your experiment, look at: There are many untunable parts and pieces that we know will not scale well to large cpu counts. So if you notice the thundering herds and can isolate them to specific operations (reads, writes, zil, prefetch) then it would help in developing more scalable algorithms for thread counts. |
This comment has been minimized.
This comment has been minimized.
Hi @richardelling, Thanks for the input. I'd actually read Brendan's post before, the load average comment was simply to mention that I was seeing similar behaviour. I actually had user complaints of poor performance when doing a simple single-threaded volume from one pool to another. Best I can tell, things hum along nicely (and the ARC starts to fill) then the copy stalls and I see the load average skyrocket, as well as other performance suffering. My best guess is that the zvol thread count is simply overwhelming the underlying I/O subsystem with writes (on both MLC and TLC based flash volumes), which adversely effects it's ability to read/write from the pool. As this machine is in production and moving workloads off it is a bit of a pain, I'll see if I can set up a similar machine in my lab and reproduce it there. For now I plan to limit the zvol threads to 8 as well as setting zvol_request_sync to 1 to see if that helps in the interim. Thanks again for your reply, it's certainly appreciated! |
This is actually quite easy for me to reproduce. Because of I/O contention, I decided to migrate all my VMs to raw image files, away from zvols. Example:
The above zvol was 40GiB and the conversion basically shut down the entire zpool until it completed. This is on a pair of Samsung 1TB SSDs in RAID1. |
@shaneshort the number of threads is not related to the number of concurrent I/Os submitted to devices. The latter is controlled by the ZIO scheduler and, by default, capped at 10 I/Os per device per I/O class. Therefore, it is unlikely to be related to the performance problem you see. Finally, if you set zvol_request_sync=1, then only one zvol thread will be used, so limiting the number of threads to 1 can reduce thundering herd as there won't be a lot of threads waiting to be active, but this can be very difficult to measure because the issues it causes are not directly visible to the OS. This will also mean your outstanding number of read I/Os will be directly limited to <=1 per device per volume. Whether this helps your situation or not is difficult to predict. NB, the issue tracker is not an appropriate place for discussion. The email list is better. |
I'd just like to leave a comment on this, through many different iterations of testing etc, I've had to abandon zvols for storage at the moment, as any attempt to do any kind of sequential writing tanks the machine and causes IO stalls in other VMs. Storing raw files inside a ZFS directory doesn't have this issue. I've been able to replicate this issue on multiple machines now, so my conclusion is that zvol on ZoL has some kind of scheduling defect making it unusable for my application. If someone would like to work with me on attempting to find a solution, let me know. |
In my tests the situation could be improved by opening the ZVOL with O_DIRECT. Furthermore (at least in my setup) the load spiked to zvol_threads + n, with n between 1 and 3. (Linux 5.9, ZFS 0.8.6) |
I just encountered this on current Debian Sid when trying to use zvol for my kvm machines. On running a benchmark in the guests, the whole host freezes up. At first I thought it was a RAM issue (OOM killer was triggered), so I massively reduced the amount of hugepages, ARC and dropped caches to ensure I have enough free RAM. It didn't help, the host still freezes. Especially on writes. (tested with dd and Crystaldiskmark in a VM which lies on this storage). The disk being benchmarked is NVME (Intel 660p) if that could be a problem. I changed the module parameter zvol_threads from 32 (default) to 2 and it runs much better (not perfectly though). Distro is Debian Sid (up-to-date as of 2021-01-13) I was wondering, could this be because my system only has 2 cores, so running 32 just produces bad results ? |
I've similar issues. The zfs-storage (debian stretch +backports) is connected via iscsi (LIO) to vmware esxi, al lot of machines are running (~120) and it works fine until I try to restore a machine (with veeam backup), the load goes up, the machine is unresponsable and then the LUNs are unaccessible from the ESXi servers (3) and then all VMs are crashing. |
yeah, I've basically given up on zvols using ZoL, it seems that it's broken and there's no real interest in getting it fixed. I might suggest using omniOS or something solaris/bsd based. |
I'm looking forward to bcachefs ;-) |
could the ones having problems try Proxmox PVE on the same hardware? It uses ZFS zvols for VMs and may not (?) suffer from this. |
I first reported the issue using proxmox. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Please don't close this issue, to the best of my knowledge it's very much still a problem with no clear fix or workaround apparent. |
System information:
Distro is customized slackware64-14.2
Kernel 4.14.49
zfs/spl 0.7.9-1
(2) E5-2690 v3 CPUs
HP P440ar raid controller (using ZFS for volume management/compression)
Also tried on (with same results):
Distro is customized slackware64-14.2
Kernel 4.9.101
zfs/spl 0.7.9-1
(1) E3-1230 CPU
LSI 2008 in IT mode with 4 SAS disks.
The issue is that I get poor write performance to a ZVOL, and the zvol kernel threads burn lots of CPU causing very high load averages on the machine. At first I was seeing the issue in libvirt/qemu while doing a virtual machine block copy, but reduced it down to this:
Speed isn't great, but the real issue is the load average goes through the roof:
Now, if I go the opposite direction it's much faster and the load average isn't nearly as high:
There is only a single zvol, and the load average is normal:
What is also interesting is that both of these things are on the same dataset:
So not sure what to look at. As it is right now, I can't really write to a zvol without killing the machine, so I'm using raw disk images on mounted zfs filesystem to avoid the double COW.
Thanks!
The text was updated successfully, but these errors were encountered: