-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs destroy hangs other zfs commands #3104
Comments
I was able to reproduce this same behavior on FreeBSD, so this may be an upstream problem. I am seeing that while the freeze is occurring, no new TXGs are created, so that explains why things like zfs snapshot and zfs recv freeze. Here is the last TXG that was active for the destroy:
otime reads 10 min ((## / 1000000000) / 60), but that time was not spent in the q w or s states. |
@stevenburgess I mean to comment on this earlier but it slipped my mind. I haven't confirmed this with the source but I strongly suspect what happening here is that the recursive destroy needs to be handled in a single TXG. Since destroying 4000 snapshots takes a fair bit of work this TXG takes a long time to process which prevents the TXGs from rolling forward every few seconds like they're supposed too. When you destroy the snapshots with multiple commands you spread the work over multiple TXGs minimize the impact. This is a problem all the implementations suffer from. |
That certainly stacks up with my observations. I find this behavior to be unexpected, since I would probably not think twice about calling What do you think is best in terms of this ticket? I dont want to load the ZoL GitHub with upstream issues. Since it is an OpenZFS issue, should I try to get it posted there? Or is this an acceptable place? |
@stevenburgess I think it's definitely appropriate to open an issue in the upstream tracker. We can leave it open here as well, it might make it easier for people to find. I'm a little less optimistic about how best to address this. There are good reasons to want to handle this all in a single TXG. |
@behlendorf Is there an upstream ticket since? This issue is still here on ZoL 0.7.5. I got bitten by it several time now when I forget and naively runs a |
@stevenburgess @AceSlash we expect that ZFS Channel Programs (OpenZFS 7431), #6558, should improve performance for this case. This feature was recently merged to master and if your in a postion to test the current master source it would be appreciated. |
@behlendorf I'm sorry but I can't test the latest master branch. I really don't know what to do. Today I just did it again, forgetting this bug and killed all zfs operations on an important server. No backup (zfs send...) Is this merged on 0.7.7 (I'm still on 0.7.6 on this system)? I mean that's a super serious issue from my point of view, since it breaks servers. I'm just sad that no solution was implemented and we still have to deal with this. |
I was able to reproduce this while destroying a ~400 snapshot pool and running
I also tried the sample "zfs list" channel program from #7281 (comment) while doing the destroy and saw the delay as well:
Lastly, I tried running the zfs recursive snapshot destroy channel program from https://www.delphix.com/blog/delphix-engineering/zfs-channel-programs while running Note that I did notice that |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This issue is still existing. |
We are having a problem where any zfs destroy that takes time to return causes all new zfs commands to not start until the zfs destroy has returned.
I am testing on ubuntu 14.04, ZoL 0.6.3, pool version 5000.
My testing shows that a zfs destroy call being passed any number of snapshots takes a long time to return. Destroying a 8GB FS with one snapshot returns almost instantly and causes no downtime. Destroying an 8GB FS with 4000 snapshots can take 10 min. If I break destroying the 4000 snapshot file into 100 snapshot zfs destroy -d fs@1%100, I get 10 second periods of inactivity.
I have a few scripts that I used to demonstrate this problem.
-# Create a file system with a ton of snapshots (or use one you have)
https://gist.github.com/stevenburgess/793f6c8e8c593b6b30f6
-# Watch new calls to zfs succeed with a little scroll bar
while [ true ]; do zfs get name tank 2>&1 > /dev/null; printf '|'; sleep 0.1; done
-# Destroy the FS in question
zfs destroy -r tank/constant
-# Watch the scroll bar stop a few seconds in, and not move until the command returns
I use zfs get to show the slowdown, but it is all zfs commands that halt. This means that if one command destroys an FS with 4000 snapshots, for a 10 min period no one can:
zfs get any stats
zfs snapshot anything
zfs clone a new FS
zfs recv new data
zfs send start a sendfile
Which is downtime for a bunch of processes. Our only workaround currently is to ensure that zfs destroy commands destroy single snapshots at a time. This causes a period of .2 to .4 seconds of inactivity.
I was able to reproduce it on a few machines around here, on 0.6.3 and git master.
The text was updated successfully, but these errors were encountered: