-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage by "z_fr_iss" after deleting large files #3976
Comments
@blind-oracle I have a feeling it may have to do with dedup and/or its interaction with snapshots or possibly the pool geometry. Are there any snapshots on the pool? How effective is the dedup with this set of (how many?) files. What's the storage stack underneath the dm vdevs? I did run a quick sanity test by removing 4 50GiB files from a 34TiB (non-raidz) pool with no dedup and there was no excessive CPU use by the |
Yep, i thought about dedup - maybe it's so recklessly cleaning up references in ARC... But it shouldn't be so CPU-hungy, i think. DM vdevs are dm-crypt encrypted 3Tb HDDs and SSDs for SLOG. I don't think the problem's with them. There's no snapshots, no nothing. It's a newly created pool with ~10 HD movies copied to it for testing. These are different files, so dedupe ratio is around 1.00x. Pool and FS parameters:
Here's the picture from top taken after removing all files:
LA 33, 56 running threads, 0.0% idle. P.S.
The main problem persists, and system didn't became much more responsive (maybe because these ZFS threads are run at -20 nice priority). |
@blind-oracle After re-running the test with dedup=sha256, I think I can see what's happening. Over time, both the upstream and ZoL have fiddled with the I'll note that the system on which I ran the test, there were 80 threads available so it was still oversubscribed but not nearly as bad. I also had no other load present so it managed the freeing pretty quickly. Based on this issue and others which may be related, it seems we may need to consider doing something about either or both the number of zio threads and their priority. I'm considering opening a more general issue which summarizes the various problems seen since the updates mentioned above. |
Aha.. Can this number of threads be (or made be) tuned through module parameters? BTW, i think i've found the root problem. My initial kernel was built with Voluntary preemption model. Now i've rebuilt it with no preemption and after deleting ~90Gb of files the CPU is hogged in such a way for only about 10 seconds, not for a minute or more like before. So the preemption mode has some major influence on ZoL performance in this particular situation. |
@blind-oracle I just analyzed a flame graph of a 500GiB deletion on a filesystem with dedup and what happens is that the 96 Your observation regarding preemption makes sense: Since you've oversubscribed the available cores, with preemption, there will be plenty of extra context switching and waiting as the 96 tasks contend for only 24 threads. With no preemption, the 24 that are actually running at once will likely be able to finish their work uninterrupted. Presumably this didn't have as much impact in the past when we erroneously ran the My test ran faster on my standard testing kernel which does also use PREEMPT_VOLUNTARY likely because the 96 tasks were contending for 80 threads. As I alluded to originally, this all brings into question the current taskq parameters (quantity and/or priority). I am going to re-run my own tests with CONFIG_PREEMPT_NONE. I don't expect to see a huge difference because 96 isn't that much greater than 80. As to your other question, there's currently no way to tune the number of zio threads without editing |
Okay, the preemption role looks clear to me now, thanks. I've modified the thread/taskq count it spa.c from (12, 8) to (6, 4) to match my hardware. Maybe we should add some logic that analyzes CPU count on module load and adjusts thread count accordingly? |
@blind-oracle I'm not sure what the proper solution is for this general problem which is why I'm thinking of opening a new more general issue regarding the type of thrashing which can occur when the physical cores are oversubscribed. There are clearly a number of issues at play here, not the least of which is the particular spinlock implementation, preemption model, etc. For my part, I'm planning on running a new round of high-concurrency tests with various combinations of kernel settings. I'm also keen on testing with a 4.2 kernel now that queue spinlocks have been made the default. They might also help your situation but the author (of the queue spinlock implmentation) states that the big wins come into play on >= 4 socket systems. It's also not clear to me whether ZoL is supposed to "officially" support In light of this issue, I'll be adding a large dedup unlink workload to my set of tests. One other note regarding future work in this area is that there's clearly a desire for ZoL to stay as close to the upstream as possible. It would seem this issue ought to affect all OpenZFS implementations but the impact would be dependent on all the other kernel-related variables I alluded to above. |
I'm curious if the following issue (https://www.illumos.org/issues/6288 , #3987) could be somewhat related: the symptoms are at least partially similar: high load, low performance
It would be a start for mid-size boxen, I wonder if the performance and responsiveness in all cases would be better if the number of threads stayed below the cpu (or number of cpu threads, SMT) count. If the performance hit weren't too large a working state in all cases would be preferential over occasional service interruptions and outages |
@kernelOfTruth In this particular case, I don't think those issues are related. The CPU time is all being burned in the spinlocks underneath |
@dweeezil I see, thanks ! I did a quick search and reports mention that snapshot handling and dedup are particularly tricky, like you mentioned above (but doesn't apply here), also that memory requirements are rather high even when only small amounts of data need deduplication (e.g. several windows VMs with already compressed NTFS in them), I also remember having read in the past that using both compression and deduplication tend to make things especially slow Related finds:
All of this while considering how an ARC works (https://en.wikipedia.org/wiki/Adaptive_replacement_cache) and adding these features - truly remarkable ! |
I have an experimental approach to tamping down on the CPU and IOPS used by large deduped removals at #3725, which attempts to throttle these operations and keep the rest of the system responsive. It does mean that these removes take a good deal longer, but that's often much preferred to the system going away. |
@dweeezil |
@tuxoko Interesting idea. At the very least, it looks like it might help the |
Another way to avoid this is put a customizable limit for the number o z_fr_iss threads in module parameters and change de default limit to number os available CPU threads, this may avoid the excessive load and allow each user to tune the values for their own needs |
Problem: after deleting several large files (~50Gb summary) ZFS goes crazy on CPU usage.
For about a minute it hogs all cores (it's 2 x 6 core E5-2620 with HT, 24 CPUs total) to 100% with about 48 running "z_fr_iss_X" threads (or more, hard to say).
Then, when they are done, txg_sync goes for some time using 30% of one core, but that's Ok i presume.
Without PREEMPT kernel this literally blocks all userspace activity for this period of time.
I can switch to PREEMPT as it's now supported by ZFS, but i'd rather not to.
Kernel 4.1.12, ZoL 0.6.5.3, 256Gb RAM.
ZFS created with:
Pool size is ~21Tb and it's empty but for these test files.
If there's any more info required - ask.
The text was updated successfully, but these errors were encountered: