-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs blocking everything, out of memory, and daily lockups #860
Comments
Try using this kernel patch and rebuilding SPL: |
Yes, if you can rebuild your kernel with the above patch that would help us prove that this is related to a known issue. We're looking for a way to resolve this without needing to patch the kernel, but we're not quite there yet. |
Ok, I'll try this soon. I'm on Ubuntu now, and I haven't built a kernel On Fri, Aug 17, 2012 at 3:31 PM, Brian Behlendorf
|
Alternately, you could try openzfs/spl#155 which is the latest patch to try and address this issue without needing to patch your kernel. However, thus far it hasn't seen much testing on real systems just the regression test suites. |
Richard, will your kernel patch work with the 3.2 kernel series? On Fri, Aug 17, 2012 at 4:52 PM, Brian Behlendorf
|
I've patched the kernel on this machine (now running 3.5) and will report back. On Aug 17, 2012, at 4:52 PM, Brian Behlendorf notifications@github.com wrote:
|
If you get a chance could you try the following patch stacks as well. They should also resolve the issue but without the need to patch your kernel. |
With the kernel patch, the system seems to be stable, and the issue is resolved. I have not had a chance to try the zfs/spl patches yet, though I plan to in the future. In the meantime, it's nice to have the system stable. |
This issue was be resolved when issue #883 is merged, hopefully in a few days. |
The #883 changes have been merged in to master and will appear in -rc11. Since you had good luck with the kernel patch I full expect this issue will be resolved in the next update so I'm going to close it. We can easily reopen it if that's some some reason not the case. |
I am having this problem on 12.04.1 64-bit with rc12(0.6.0.86) from the stable ppa,. I never seem to get the bitter end into my syslog, but I'm pretty sure it is OOM crashes. I usually REIUSB when it gets unresponsive.
to my config. |
I just put in 0.6.0.88 and it's chugging away. I don't expect to see the errors for a while because of the sysctl change.
|
Ooh. There goes my first automatic reboot. Do you think it's better to leave the system running with hung tasks getting killed periodically and potentially an OOM crash because it might keep working on whatever zfs task it's doing that's hanging or keep the system rebooting so that it doesn't die? I think i'll disable the rebooting and let it run for the near future to see if the arc cap keeps it from dieing. |
9647 sec later (2.6 hours), 8 OOM errors. killed mountall, upstart-udev-br, upstart-socket-, smbd(x2), dbus-daemon, dhclient3, rsyslogd, but interestingly enough, left dnsmasq running to interfere with other computers networking(should have disabled authoritative.) anyway, i didn't get a final kernel error on the tty, but it is unresponsive aside from REISUB, which unsurprisingly found no tasks left to kill. drive lights were finally dead too. sometimes, when i get partway through reisub, disk activity resumes like it's continuing the normal background zfs stuff. |
Tried something new. Booted to a FreeBSD 9 memdisk.img and it imported and exported fine. Figured it would wait a while, doing slow access, finishing whatever was keeping it from working and then be usable. It didn't really take long though. Then I boot back into Ubuntu 12.04.1, and have the same problems as before. ZFS for Ubuntu was great for a few months I think it was. I had migrated the pool from FreeBSD 8 so I would have an OS in common with my network and an easier time with software and operations in general. I can't imaging moving back to FreeBSD now. Well, I can, but I don't like it. Hmm... just remembered, I'm pretty sure that for a while, the pool was imported on Ubuntu and would block when trying to mount. After I got it more or less exported, it would hang on import. I guess I need to try importing and mounting on FreeBSD. |
Imported, mounted, exported on FreeBSD. Imported, mounted on Ubuntu. Everything looked wonderful. OOMx4, then, with my panic on hung setting, panicked about scsi_eh_0 blocking for 3600 sec. |
I've seen a number of bugs that look like this, but I'm not sure if this is the same bug. I understand many things can cause this sort of blocking behavior, and I don't think I've seen this particular combination. I'm using 0.6.0.65-0ubuntu1~precise1 version from the Ubuntu ppa-daily on a stock Ubuntu 12.04 server 64 bit install, though I've had problems like this (though perhaps not identical) going back to 11.10 and older versions of zfs on linux. I'm not sure it's relevant to this bug, but I should mention that issues with lockups and crashes and blocking seemed to begin about 6 months ago and may have been related to a bad stick of RAM which was later found a removed. Scrubs have been successful since and the backup array, which consists of SATA drives in an external enclosure, either eSATA or USB (seems to not make a difference), is not on a different system with ECC RAM. Problems continue.
Under a moderate load, the system locks up about once a day, probably related to the pattern of usage. Computers around our office back up to this system daily, typically overnight. Additionally, the system is resilvering a disk (and has been for weeks... it finished the resilver but there were data errors due to the old bad RAM. I have removed the offending files, and am letting it resilver again, since it seems to need to do that if it encounters data corruption it can't fix). If I don't have the pool mounted (and hence no backups to it are happening) I don't seem to have problems, at least not every day. However, if I try to use the pool, I get messages like this and, recently, out of memory messages:
The text was updated successfully, but these errors were encountered: