-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support forced export of ZFS pools #11082
base: master
Are you sure you want to change the base?
Conversation
Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #11082 +/- ##
==========================================
+ Coverage 75.17% 79.66% +4.49%
==========================================
Files 402 398 -4
Lines 128071 126235 -1836
==========================================
+ Hits 96283 100571 +4288
+ Misses 31788 25664 -6124
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Ran into #9067 on the last full ZTS run I did locally, but it looks better now. |
It depends on the specific issue. Most of the time, if a pool is suspended, zpool/zfs commands simply fail and get kicked out. There may be other cases that hold the namespace lock and get stuck on a suspension, but they can be fixed in a follow-up. |
@WCI @allanjude sorry about the delay in getting back to this. Thanks for addressing my initial feedback. I should be able to further dig in to the PR this week and get you some additional comments and testing. As part of that work I took the liberty of getting this PR rebased on the latest master, resolving the relevant conflicts, and squashing a few commits. I appreciate how you originally kept the commits separate for the benefit of the reviewers. But keeping the individual review fixes separate at this point I'm not sure is really helpful. I didn't make any changes to the real functionality. You can find the changes in my forced-export-squashed branch. The squashed branch looks like this: 48f1e24 zfs: support force exporting pools For next steps what I'd suggest is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments while I'm working my way through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this, I should be able to give it a more careful look later this week. In the meanwhile I went ahead and merged most of the smaller unrelated cleanup you opened PRs for to master, So you can drop this on master any time and drop those commits.
I also noticed in the CI logs we hit at least one ASSERT during testing. Can you take a look at the follow stack in these console logs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f
, forcing the unmount even if there are fd's open in the filesystem?
include/sys/txg.h
Outdated
typedef enum { | ||
/* Reject the call with EINTR upon receiving a signal. */ | ||
TXG_WAIT_F_SIGNAL = (1U << 0), | ||
/* Reject the call with EAGAIN upon suspension. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between "suspension", "an exiting pool", and "forced export"?
man/man5/zfs-module-parameters.5
Outdated
in which all new I/Os fail, except for those required to unmount it. | ||
Intended for users trying to forcibly export a pool even when I/Os are in | ||
progress, without the need to find and stop them. This option does not | ||
affect processes that are merely sitting on the filesystem, only those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "sitting on" mean? Is this referring to having open file handles or the CWD (current working directory)? Does this mean that regardless of what zfs_forced_export_unmount
is set to, we can unmount filesystems that have open fd's? Does that require the -f
flag to zfs unmount
/zfs destroy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. For more detail see my reply to your top-level question.
man/man5/zfs-module-parameters.5
Outdated
.ad | ||
.RS 12n | ||
During forced unmount, leave the filesystem in a disabled mode of operation, | ||
in which all new I/Os fail, except for those required to unmount it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the alternative (the default) that a forced unmount will fail if there are i/o's in progress? by "i/o's in progress" do we mean that there's a ZPL syscall (e.g. read/write) in progress with a zio_t outstanding?
Why would we not want the default to be zfs_forced_export_unmount=1
? What's the downside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no downside, IMO, but Linux has a long-standing practice of disallowing this on other filesystems, so I created the tunable to conform to that.
6a0280e
to
e5395d1
Compare
On Linux, it's limited in the extent to which they can be forcibly exported, which is why I tried to explain the limit regarding file descriptors held open by processes. On FreeBSD, however, VFS will disassociate file descriptors from the filesystem, so it can work in that scenario. I believe this also applies to illumos, but I haven't tested there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wca since this change modifies the libzfs
library ABI you're going to need to generate a new lib/libzfs/libzfs.abi
file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.
The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools
package, and then invoke the storeabi
Makefile target in
lib/libzfs/directory. That will generate a new
libzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.
cd lib/libzfs/
make storeabi
Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.
FAIL l2arc/persist_l2arc_004_pos (expected PASS)
FAIL l2arc/persist_l2arc_005_pos (expected PASS)
Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?
I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some. The only thing directly related that's modified in this PR is the addition of But these tests still fail even if I change the three 'exiting' functions to return B_FALSE, and remove that call in |
I'm still pretty new to it myself. But from what I've seen so far the answer is "yes". But that's not really a problem since it's a generated file and we only update it when we're knowingly making ABI changes.
Oddly I wasn't able to reproduce these locally in a VM so it may be some kind of race. I'll give it another try latter. But if you're able to easy reproduce it I'd suggest taking a look at the |
@wca would you mind rebasing this again on the latest master. There's a minor conflicts to resolve and you'll need to generate a new |
@wca it looks like there's still at least one path in the force export code which needs to be updated. This was caught by the CI |
I was playing with 801a440 today. On a fresh pool with no issues,
( If I follow this with an import and another force export, I get a panic:
I notice that it seems less likely to happen if I waited a while between import and export. While trying to reduce this to a reliable test, I got a different crash:
If I have time later today I'll try to dig into both the slowness and the crashes. |
include/sys/fs/zfs.h
Outdated
@@ -1469,6 +1469,8 @@ typedef enum zfs_ioc { | |||
ZFS_IOC_USERNS_DETACH = ZFS_IOC_UNJAIL, /* 0x86 (Linux) */ | |||
ZFS_IOC_SET_BOOTENV, /* 0x87 */ | |||
ZFS_IOC_GET_BOOTENV, /* 0x88 */ | |||
ZFS_IOC_HARD_FORCE_UNMOUNT_BEGIN, /* 0x89 (Linux) */ | |||
ZFS_IOC_HARD_FORCE_UNMOUNT_END, /* 0x90 (Linux) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x8a
:)
More playing. Here's a heavy write+fsync load, on a relatively slow device (a USB-SATA SSD, pushes about 12MB/s).
Then:
This panics instantly:
I'm not totally sure of how we get here. I've seen reexecute storms on |
It is log writes.
I added this patch:
It gets some of the way there, but there are a lot of |
Small update. Currently, the issue is with
|
@oshogbo when you get a chance can you please rebase so we can get an updated CI run. |
@behlendorf done |
2b6f248
to
aef5e28
Compare
This is primarily of use when a pool has lost its disk, while the user doesn't care about any pending (or otherwise) transactions. Implement various control methods to make this feasible: - txg_wait can now take a NOSUSPEND flag, in which case the caller will be alerted if their txg can't be committed. This is primarily of interest for callers that would normally pass TXG_WAIT, but don't want to wait if the pool becomes suspended, which allows unwinding in some cases, specifically when one is attempting a non-forced export. Without this, the non-forced export would preclude a forced export by virtue of holding the namespace lock indefinitely. - txg_wait also returns failure for TXG_WAIT users if a pool is actually being force exported. Adjust most callers to tolerate this. - spa_config_enter_flags now takes a NOSUSPEND flag to the same effect. - DMU objset initiator which may be set on an objset being forcibly exported / unmounted. - SPA export initiator may be set on a pool being forcibly exported. - DMU send/recv now use an interruption mechanism which relies on the SPA export initiator being able to enumerate datasets and closing any send/recv streams, causing their EINTR paths to be invoked. - ZIO now has a cancel entry point, which tells all suspended zios to fail, and which suppresses the failures for non-CANFAIL users. - metaslab, etc. cleanup, which consists of simply throwing away any changes that were not able to be synced out. - Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled, which allows the filesystem to remain in a modified 'unmounted' state upon exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos, which have VFS-level support for yanking filesystems out from under users. However, this only helps when the user is actively performing I/O, while not sitting on the filesystem. In particular, this allows test #3 below to pass on Linux. - Add basic logic to zpool to indicate a force-exporting pool, instead of crashing due to lack of config, etc. Add tests which cover the basic use cases: - Force export while a send is in progress - Force export while a recv is in progress - Force export while POSIX I/O is in progress This change modifies the libzfs ABI: - New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value. - New field libzfs_force_export for libzfs_handle. Co-Authored-by: Will Andrews <will@firepipe.net> Co-Authored-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#3461 Signed-off-by: Will Andrews <will@firepipe.net> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
@@ -1441,12 +1460,16 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolean_t unmounting) | |||
} | |||
} | |||
if (!zfs_is_readonly(zfsvfs) && os_dirty) { | |||
txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0); | |||
(void) txg_wait_synced_tx(dmu_objset_pool(zfsvfs->z_os), 0, | |||
NULL, wait_flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetics, but here and in other places txg_wait_synced_tx() without tx makes no sense. It should be txg_wait_synced_flags().
txg_wait_synced_impl(dsl_pool_t *dp, uint64_t txg, boolean_t wait_sig) | ||
int | ||
txg_wait_synced_tx(dsl_pool_t *dp, uint64_t txg, dmu_tx_t *tx, | ||
txg_wait_flag_t flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API looks confusing to me, receiving both tx and txg. Submitting tx I would already expect tx_txg, but instead it uses tx only to get tx_objset from it. Looking through the patch I found two places where tx is submitted, but in none of the cases it waits for the actual tx to be committed, but a pretty abstract txg, and the tx specification only allows to check dmu_objset_exiting(), which makes me think: why the txg_wait_synced() should ever care about the objset umount progress? I would understand exiting of forced pool export, but that would not require the objset, already available pool argument would be enough for that.
@@ -6763,7 +6763,6 @@ arc_write_done(zio_t *zio) | |||
arc_access(hdr, 0, B_FALSE); | |||
mutex_exit(hash_lock); | |||
} else { | |||
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); | |||
VERIFY3S(remove_reference(hdr, hdr), >, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARC_FLAG_IO_IN_PROGRESS has its own reference, which is dropped here. Would you explain why are you leaving the flag, but dropping the reference?
|
||
if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY)) | ||
if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY) || | ||
dmu_objset_exiting(db->db_objset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here or in the next chunk dmu_objset_exiting() make no sense. And here it looks bad, like we are saying: "dbuf state is totally insane and we do not care", which should not be so.
VERIFY0(zap_increment(os, DMU_USERUSED_OBJECT, | ||
uqn->uqn_id, uqn->uqn_delta, tx)); | ||
mutex_exit(&os->os_userused_lock); | ||
if (!dmu_objset_exiting(os)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following group checks look excessive, considering the check in VERIFY() below. It may be only needed if there are still some problematic cases inside, but then it is only an ugly and probably unreliable workaround.
|
||
if ((flags & SCL_FLAG_TRYENTER) != 0) | ||
error = SET_ERROR(EAGAIN); | ||
if (error == 0 && ((flags & SCL_FLAG_NOSUSPEND) != 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be else if (...
@@ -511,28 +496,54 @@ spa_config_enter_impl(spa_t *spa, int locks, const void *tag, krw_t rw, | |||
mutex_enter(&scl->scl_lock); | |||
if (rw == RW_READER) { | |||
while (scl->scl_writer || | |||
(!mmp_flag && scl->scl_write_wanted)) { | |||
((flags & SCL_FLAG_MMP) && scl->scl_write_wanted)) { | |||
error = spa_config_eval_flags(spa, flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is already congested sometimes, and here you are adding another global spa_suspend_lock acquisition. I am not happy.
In addition to that, I am not sure how the locking primitive should care about pool suspend? Why could not it be regularly acquired and dropped when respective protected operation fails? It feels like some workaround to me.
@@ -3505,7 +3531,7 @@ zil_commit_impl(zilog_t *zilog, uint64_t foid) | |||
zil_commit_writer(zilog, zcw); | |||
zil_commit_waiter(zilog, zcw); | |||
|
|||
if (zcw->zcw_zio_error != 0) { | |||
if (zcw->zcw_zio_error != 0 && !dmu_objset_exiting(zilog->zl_os)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not feel right to me. The error may be valid, and as I see txg_wait_synced() should exit normally in case of forced export.
(u_longlong_t)txg); | ||
if (txg < spa_freeze_txg(zilog->zl_spa)) | ||
VERIFY(!zilog_is_dirty(zilog)); | ||
if (!dmu_objset_exiting(zilog->zl_os)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again why objset unmount (possibly on healthy pool) should affect its ZIL operation? Wouldn't patching VERIFY() below be sufficient? Other parts should not care.
@@ -2308,10 +2308,13 @@ zio_wait(zio_t *zio) | |||
__zio_execute(zio); | |||
|
|||
mutex_enter(&zio->io_lock); | |||
while (zio->io_executor != NULL) { | |||
while (zio->io_executor != NULL && !spa_exiting_any(zio->io_spa)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you exit here and call zio_destroy() below before ZIO processing is officially complete? Is there anything to prevent use-after-free when I/O finally unblocks and try to complete?
any chance completing this PR can be prioritized? |
@oshogbo any chance you can respond to what github calls "unresolved conversations"? I think code review remarks from other people. |
We are continuing to investigate issues with this feature in production to improve the pull request. |
but guys you have lots of unanswered code reviews here. can you do that as well? |
Any progress ? |
@nerozero it looks to me a combination of "it wasn't designed to handle this" and "you're just a minority so there are no rources to fix it". just go plan with the dm-error workaround where you can safely kick the physical device out and put in back in, I've totally lost hope on this happening natively ever. |
While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export. I am sorry of the pace of progress is not to your liking. |
Perhaps some frustration might come because of different behaviors on different systems and to some people this is a bigger issue than for others? Based on some responses and workarounds suggested I have feeling that's the case. For example I saw responses, where somebody was able to recover from the issue by performing some operations with dm, and then invoking In my case on FreeBSD 14.1-p2, when I reconnect the device it appears under the same name it was before (I also attached it using GPT label, to prevent issues if device name would change), the device also appears to be fully accessible, I can interact with the device, for example list partitions or calling My problem is that invoking
doesn't work. It tells me that operations are blocked because the pool is in waiting state or something like that. The only way I know so far to restore access to the drive back is to reboot the system, which is frustrating, as it feels like it is a bug. I'm wondering if this is maybe implementation issue in FreeBSD or is that's how it is working for everyone? Could someone confirm if that also happens on the other systems, perhaps I need to open a bug with FreeBSD. |
So to my understanding you are focusing in increasing the robustness of ZFS in the case of sudden disconnects/reconnects of storage devices? Does this mean, ZFS then auto-resumes its operation, when the storage device becomes available again? Is that the aim? |
Thanks for the update! I haven't been following a lot of the current development, are there are any pull requests/issues in particular that are tracking this related work on recoverability? |
Motivation and Context
This change enables users to forcibly export a pool in a manner which safely discards all pending dirty data, transaction groups, associated zios, metaslab changes, etc. It allows a regular
zpool export
to fail out so that it releases the namespace lock to enable a forced export. It is able to do this regardless of whether the current pool activity involves send, receive, or POSIX I/O.This allows users to continue using their system without rebooting, in the event the disk cannot be recovered in an online manner, or the user prefers to resume use of their pool at a later time. Since ZFS can simply resume from the most recent consistent transaction group, the latter is easily achieved.
Closes #3461
Description
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.
Implement various control methods to make this feasible:
alerted if their txg can't be committed. This is primarily of interest
for callers that would normally pass TXG_WAIT, but don't want to wait if
the pool becomes suspended, which allows unwinding in some cases,
specifically when one is attempting a non-forced export. Without this,
the non-forced export would preclude a forced export by virtue of holding
the namespace lock indefinitely.
being force exported. Adjust most callers to tolerate this.
exported / unmounted.
killer being able to enumerate datasets and closing any send/recv streams,
causing their EINTR paths to be invoked.
and which suppresses the failures for non-CANFAIL users.
that were not able to be synced out.
which allows the filesystem to remain in a modified 'unmounted' state upon
exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos,
which have VFS-level support for yanking filesystems out from under users.
However, this only helps when the user is actively performing I/O, while
not sitting on the filesystem. In particular, this allows test Use New BIO_RW_FAILFAST_* API #3 below
to pass on Linux.
crashing due to lack of config, etc.
Add tests which cover the basic use cases:
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.Pending items on this checklist are pending review. I am seeking feedback on whether to break out any of the commits, rather than squash all of them into the primary one.