Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

Closed
AceSlash opened this issue Jan 15, 2016 · 3 comments
Closed
Labels
Status: Inactive Not being actively updated

Comments

@AceSlash
Copy link

Hello,

I had a strange issue on a server: during a resilvering, zfs snapshot started to get stuck, never finishing. It took some time for us to see that, at the end there was 73 blocked zfs snapshots.

It didn't stop the availability of the datasets until the very end, where all processes accessing any dataset on the zpool were finally now blocked (apache httpd, postgresql, etc). I tried to stop them but it was impossible to kill them, I also tried to kill the zfs snapshots but no luck either.

I found what I think is the origin of the issue several days ago in the kernel log: http://apaste.info/ue5

Result of useful commands:
zpool status: http://apaste.info/1yL
zpool get all: http://apaste.info/PGG
list of ZFS Debian package installed (the server is on Wheezy): http://apaste.info/OFN
Stack of a blocked zfs snapshot process: http://apaste.info/JoO

I had to hard reboot the server to get it back (impossible to shutdown it otherwise).

Not sure what more information I can give.

@kernelOfTruth
Copy link
Contributor

referencing - since it looks suspiciously similar:

#4106 ZFS 0.6.5.3 servers hang trying to get mutexes
(main thread)

#4166 live-lock in arc_reclaim, blocking any pool IO
(additional stack trace)

#3979 (comment) in a nutshell the fixes (mentioned in 4106) you could apply to your system to circumvent this

@behlendorf
Copy link
Contributor

I can't say for certain since the back traces are incomplete but I believe cherry-picking the folloiwng patch will resolve the issue. It should be in the next point release 0.6.5.5.

openzfs/spl@e843553 Don't hold mutex until release cv in cv_wait

@AceSlash
Copy link
Author

zfs snapshot stuck again on the same machine, the snapshot is done every hour and the one that is stuck is from 16:00. I did not apply any patch (I may do so now... I don't want my snapshots to not work every ~5 days).

dmesg output: http://apaste.info/k1W
stack of the blocked zfs snapshot process: http://apaste.info/f3C
stack of the blocked txg_sync: http://apaste.info/dGE
stack of the blocked rsync: http://apaste.info/Gi6

Please tell me if I can help with anymore information or if it is a duplicate of one of the other issue referenced here.

update: added stack of txg_sync and rsync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @kernelOfTruth @AceSlash and others