-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in IOWAIT state on multiple CPUs #3668
Comments
My send has frozen again after 9 hours. The system hasn't locked, but there is a good bit of latency when accessing the ZFS mounts. One CPU is again at 100% iowait. This is with nothing else except the send running. This is also with |
I know this doesn't particularly help narrow down the issue but... I was only able to get the system to run up to 9 hours before, but now I'm trying out DeHackEd's ADB + master branch (https://github.com/DeHackEd/zfs/commits/dehacked-bleedingedge2), and I've been copying for over 24 hours without an issue. My issue also sounds a bit like #3680 and #3676. In #3676 it was mentioned that you would see that happen if you couldn't write to the destination, but I have no issue writing or accessing it. |
@angstymeat any ideas what commits you think are in that repo that are helping you here vs what you were running. I am hitting this exact same issue and would love a fix. |
@angstymeat @eolson78 it looks like you're having a similar ARC collapse issue as described in #3680. The fix for this proposed by @dweeezil was merged to master this morning. If you could pull the latest masters source and verify it resolves the issue that would be helpful. |
Thanks, I will give it a try when I can reboot next (might be a day or so). I'm in the process of verifying a 5.2TB send stream in preparation of recreating my pool with new hard drives. In the meantime, DeHackEd's bleeding-edge2 ABD branch has been working very well. |
My system blew-up with |
With the latest master: I was running for about an hour and then I noticed that
I rebooted and I'm trying again. |
Got the system to slow down to a crawl in about 20 minutes. I was running a I tried dropping caches and my send speed went back up to around 25MB/s for a while, but when the drop finished `arc_reclaim`` ran continuously for several more minutes while my send dropped back down to a few MB/s.
For the record, running my send under the ABD branch ran continuously at around 50MB/s despite the rest of the processes I had running. ... Around 10 minutes after the above, my memory filled up again. Speed is slow again. Running ... Long-story made short, so far it appears that performance under memory pressure seems to suffer a lot more than it does with ABD. The I'm going to let everything continue running untouched for the next couple of hours and take a look at it again. |
@angstymeat did you make sure to update the SPL code as well? That's where the fix was. Also your previous data showed that the ARC size has collapsed. Is that still the case? |
Yes, I always recompile SPL along with ZFS even when it hasn't changed. It's currently version I forgot about the arc collapse; I was looking more for iowait state freezes. Looking at it, I'm not seeing a sudden collapse although my arc is currently just 388MB in size. I'm now trying the same job that I was running when I originally experienced the issue. The arc is currently filling and I will let you know what happens. |
Ok, when I posted this issue the arc collapsed in about 20 minutes. I've been running this for about 45 minutes and my arc is still full. I'm going to let it run overnight since I was able to grind the system to a halt within 9 hours of starting the send in my previous attempts. |
Ok, it's been running for over 20 hours without an issue. |
@angstymeat what's the status of this issue? Has the initial issue been resolved by the latest code? Are there other issues here? |
It's been running with this commit just fine. No issues with the arc collapsing. I don't want to muddy the issue, but it looks like there's a lot more I would say that this issue can be closed, now. |
@angstymeat yes I think that's to be expected. ABD should enable significant additional improvements when it gets merged after the tag. I'm glad we got this sorted out, closing this issue. |
This is to replace issue #3654 that I messed up pretty badly with some confused ramblings because my zswap stats weren't being calculated correctly.
Summary:
I've seen this happen on two systems, but I an going to be reporting on the system that is easier for me to work on...
After the system gets low on memory after running for a while, my ARC drains and I eventually get to a state where the ZFS filesystems become inaccessible. This eventually causes the whole system to slow down to the point where it is essentially frozen. While I can still log on, I see CPUs stuck in the IOWAIT state according to
htop
.This system is a Dell R515 server with 32GB of memory, Fedora 22 running the 4.0.8-300.fc22.x86_64 kernel, SPL 0.6.4-18_g8ac6ffe, and ZFS 0.6.4-184_g6bec435.
At first I thought this was #3637, but that was supposed to be fixed in the commit I am using.
The operation I'm running is a ZFS send of a 6.2TB filesystem to a compressed stream on an NFS mount. The command I'm running is
zfs send -R storage@backup | mbuffer | pigz > /ext/storage.zfs
. Thembuffer
is there for me to watch the transfer rate and the totals.I started this from a fresh reboot. For the first 20 minutes, ARC usage increased until it was full. Memory usage spiked to 99% of my 32GB (16GB being used by the ARC). Shortly thereafter, the ARC drained to almost nothing, staying at around 34MB.
It was at this point that my swap partition started seeing regular usage, growing to 14MB where it currently is. It's not dramatic, but the system does have 32GB of RAM, and it is doing nothing else other than running the
zfs send
I mentioned earlier. You would think it would have enough memory to do this without needing to swap.After running for about seven hours, I started running into my problem state again. While the system is not totally unresponsive right now, I have two CPUs stuck in IOWAIT accompanied by sporadic, slow access to the ZFS mounts. The system did appear to be froze for between 5 to 10 minutes, but then recovered to where it is now.
I'm not sure what is using my memory right now. According to
htop
I have 2.5GB in use, with the rest being listed as cache.perf top -ag
is showing:I've uploaded some debugging information to https://cloud.passcal.nmt.edu/index.php/s/0f4Axgf72Of4oOa.
Just for some additional data, I was able to crash another machine last week that was at high memory utilization by running a scrub at the same time an rsync backup was running.
The text was updated successfully, but these errors were encountered: