zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

AceSlash · 2016-01-15T14:27:45Z

Hello,

I had a strange issue on a server: during a resilvering, zfs snapshot started to get stuck, never finishing. It took some time for us to see that, at the end there was 73 blocked zfs snapshots.

It didn't stop the availability of the datasets until the very end, where all processes accessing any dataset on the zpool were finally now blocked (apache httpd, postgresql, etc). I tried to stop them but it was impossible to kill them, I also tried to kill the zfs snapshots but no luck either.

I found what I think is the origin of the issue several days ago in the kernel log: http://apaste.info/ue5

Result of useful commands:
zpool status: http://apaste.info/1yL
zpool get all: http://apaste.info/PGG
list of ZFS Debian package installed (the server is on Wheezy): http://apaste.info/OFN
Stack of a blocked zfs snapshot process: http://apaste.info/JoO

I had to hard reboot the server to get it back (impossible to shutdown it otherwise).

Not sure what more information I can give.

kernelOfTruth · 2016-01-15T14:54:23Z

referencing - since it looks suspiciously similar:

#4106 ZFS 0.6.5.3 servers hang trying to get mutexes
(main thread)

#4166 live-lock in arc_reclaim, blocking any pool IO
(additional stack trace)

#3979 (comment) in a nutshell the fixes (mentioned in 4106) you could apply to your system to circumvent this

behlendorf · 2016-01-16T00:21:08Z

I can't say for certain since the back traces are incomplete but I believe cherry-picking the folloiwng patch will resolve the issue. It should be in the next point release 0.6.5.5.

openzfs/spl@e843553 Don't hold mutex until release cv in cv_wait

AceSlash · 2016-01-20T16:51:19Z

zfs snapshot stuck again on the same machine, the snapshot is done every hour and the one that is stuck is from 16:00. I did not apply any patch (I may do so now... I don't want my snapshots to not work every ~5 days).

dmesg output: http://apaste.info/k1W
stack of the blocked zfs snapshot process: http://apaste.info/f3C
stack of the blocked txg_sync: http://apaste.info/dGE
stack of the blocked rsync: http://apaste.info/Gi6

Please tell me if I can help with anymore information or if it is a duplicate of one of the other issue referenced here.

update: added stack of txg_sync and rsync

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

AceSlash commented Jan 15, 2016

kernelOfTruth commented Jan 15, 2016

behlendorf commented Jan 16, 2016

AceSlash commented Jan 20, 2016

zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

zpool lock down with hundred of zfs snapshot stuck during resilvering #4226

Comments

AceSlash commented Jan 15, 2016

kernelOfTruth commented Jan 15, 2016

behlendorf commented Jan 16, 2016

AceSlash commented Jan 20, 2016