Support forced export of ZFS pools #11082

wca · 2020-10-17T16:32:19Z

Motivation and Context

This change enables users to forcibly export a pool in a manner which safely discards all pending dirty data, transaction groups, associated zios, metaslab changes, etc. It allows a regular zpool export to fail out so that it releases the namespace lock to enable a forced export. It is able to do this regardless of whether the current pool activity involves send, receive, or POSIX I/O.

This allows users to continue using their system without rebooting, in the event the disk cannot be recovered in an online manner, or the user prefers to resume use of their pool at a later time. Since ZFS can simply resume from the most recent consistent transaction group, the latter is easily achieved.

Closes #3461

Description

This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:

txg_wait can now take a NOSUSPEND flag, in which case the caller will be
alerted if their txg can't be committed. This is primarily of interest
for callers that would normally pass TXG_WAIT, but don't want to wait if
the pool becomes suspended, which allows unwinding in some cases,
specifically when one is attempting a non-forced export. Without this,
the non-forced export would preclude a forced export by virtue of holding
the namespace lock indefinitely.
txg_wait also returns failure for TXG_WAIT users if a pool is actually
being force exported. Adjust most callers to tolerate this.
spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
DMU objset "killer" flag which may be set on an objset being forcibly
exported / unmounted.
SPA "killer" flag which may be set on a pool being forcibly exported.
DMU send/recv now use an interruption mechanism which relies on the SPA
killer being able to enumerate datasets and closing any send/recv streams,
causing their EINTR paths to be invoked.
ZIO now has a cancel entry point, which tells all suspended zios to fail,
and which suppresses the failures for non-CANFAIL users.
metaslab, etc. cleanup, which consists of simply throwing away any changes
that were not able to be synced out.
Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled,
which allows the filesystem to remain in a modified 'unmounted' state upon
exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos,
which have VFS-level support for yanking filesystems out from under users.
However, this only helps when the user is actively performing I/O, while
not sitting on the filesystem. In particular, this allows test Use New BIO_RW_FAILFAST_* API #3 below
to pass on Linux.
Add basic logic to zpool to indicate a force-exporting pool, instead of
crashing due to lack of config, etc.

Add tests which cover the basic use cases:

Force export while a send is in progress
Force export while a recv is in progress
Force export while POSIX I/O is in progress

How Has This Been Tested?

New ZFS Test Suite tests covering the three main pool activity scenarios.
Testing in a production environment that focuses around use of send/recv to a network-based disk, which can fail and not return "for a while". User doesn't mind losing the last few transaction groups, and is able to pick up later.
Existing ZFS Test Suite tests, to check for any unexpected breakage of non-forced-export scenarios.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Pending items on this checklist are pending review. I am seeking feedback on whether to break out any of the commits, rather than squash all of them into the primary one.

gdevenyi · 2020-10-17T16:37:23Z

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

codecov · 2020-10-17T22:28:59Z

Codecov Report

Patch coverage: 62.00% and project coverage change: +4.49 🎉

Comparison is base (161ed82) 75.17% compared to head (f168cb6) 79.66%.

❗ Current head f168cb6 differs from pull request most recent head 0c5b2fa. Consider uploading reports for the commit 0c5b2fa to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11082      +/-   ##
==========================================
+ Coverage   75.17%   79.66%   +4.49%     
==========================================
  Files         402      398       -4     
  Lines      128071   126235    -1836     
==========================================
+ Hits        96283   100571    +4288     
+ Misses      31788    25664    -6124

Flag	Coverage Δ
kernel	`80.24% <61.60%> (+1.48%)`	⬆️
user	`65.35% <40.72%> (+17.93%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
include/sys/dmu.h	`100.00% <ø> (ø)`
include/sys/dsl_dataset.h	`100.00% <ø> (ø)`
module/zfs/vdev_initialize.c	`98.10% <ø> (+4.58%)`	⬆️
module/zfs/vdev_label.c	`92.22% <0.00%> (+1.35%)`	⬆️
module/zfs/metaslab.c	`94.52% <12.50%> (+1.91%)`	⬆️
module/zfs/dsl_dataset.c	`90.82% <18.75%> (-0.44%)`	⬇️
module/zfs/vdev_rebuild.c	`90.75% <31.57%> (+13.09%)`	⬆️
module/zfs/dmu_send.c	`84.42% <33.33%> (+5.04%)`	⬆️
module/zfs/vdev_removal.c	`96.50% <33.33%> (+3.79%)`	⬆️
module/zfs/vdev_trim.c	`93.50% <38.46%> (+17.50%)`	⬆️
... and 33 more

... and 194 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

wca · 2020-10-26T03:32:44Z

Ran into #9067 on the last full ZTS run I did locally, but it looks better now.

wca · 2020-10-26T03:38:05Z

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

It depends on the specific issue. Most of the time, if a pool is suspended, zpool/zfs commands simply fail and get kicked out. zpool export in particular will now either exit early if non-hardforced (-F), or will be able to forcibly drop all context and export the pool. Whereas before, non -F would hang forever while holding the namespace lock, blocking everything that depends on it.

There may be other cases that hold the namespace lock and get stuck on a suspension, but they can be fixed in a follow-up.

cmd/zpool/zpool_main.c

include/os/linux/spl/sys/kmem_cache.h

include/sys/spa.h

module/os/linux/zfs/zpl_super.c

man/man8/zpool-export.8

module/zfs/spa.c

behlendorf · 2021-01-12T01:52:14Z

@WCI @allanjude sorry about the delay in getting back to this. Thanks for addressing my initial feedback. I should be able to further dig in to the PR this week and get you some additional comments and testing.

As part of that work I took the liberty of getting this PR rebased on the latest master, resolving the relevant conflicts, and squashing a few commits. I appreciate how you originally kept the commits separate for the benefit of the reviewers. But keeping the individual review fixes separate at this point I'm not sure is really helpful. I didn't make any changes to the real functionality. You can find the changes in my forced-export-squashed branch. The squashed branch looks like this:

48f1e24 zfs: support force exporting pools
a107a70 zpool_export: make test suite more usable
d07a310 check_prop_source: improve output clarity
07f9ca4 zfs_receive: remove duplicate check_prop_source
9d64ea4 logapi: cat output file instead of printing
6e659d8 spa_export_common: refactor common exit points

For next steps what I'd suggest is:

Open a new PRs for each bug fix which isn't really core to this feature. There's no reason we can't get those changes integrated right away as long as they pass testing and are entirely independent. This would be a107a70, d07a310, 07f9ca4, 9d64ea4, and 6e659d8. If there are other unrelated hunks we should pull out in to their own PRs that'd be nice too.
Please review what I've done and go ahead and force update this PR with 48f1e24 (or a modified version of it you're happy with). This way we can get a fresh run from the CI and see where we stand. It looks like the current version did encounter at least one panic with a full ZTS run. That new commit passes the new test cases locally for me on Linux and sanity.run, but so far that's all I've tried.

behlendorf

A few more comments while I'm working my way through this.

lib/libzfs/libzfs_mount.c

lib/libzfs/libzfs_dataset.c

module/os/linux/zfs/zpl_super.c

man/man8/zpool-export.8

include/sys/dmu_objset.h

include/sys/spa_impl.h

module/os/linux/zfs/zfs_vfsops.c

wca · 2021-01-25T00:27:21Z

I just pushed an update that includes @behlendorf suggested break-out, and responds to most of the comments. The earlier commits are now in these PRs: #11514 #11515 #11516 #11517 #11518.

behlendorf

Thanks for updating this, I should be able to give it a more careful look later this week. In the meanwhile I went ahead and merged most of the smaller unrelated cleanup you opened PRs for to master, So you can drop this on master any time and drop those commits.

I also noticed in the CI logs we hit at least one ASSERT during testing. Can you take a look at the follow stack in these console logs:

http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/158/steps/shell_4/logs/console

include/sys/zfs_refcount.h

lib/libzfs/libzfs_dataset.c

lib/libzfs/libzfs_mount.c

behlendorf · 2021-01-26T21:15:04Z

OK, #11514 #11515 #11516 #11517 #11518 have all been merged and can be dropped from this PR with a rebase.

behlendorf · 2021-02-02T17:51:23Z

~~@WCI~~ @wca when you get a chance to rebase this that would be great.

ahrens

Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f, forcing the unmount even if there are fd's open in the filesystem?

include/sys/dmu.h

ahrens · 2021-02-05T00:01:04Z

include/sys/txg.h

+typedef enum {
+	/* Reject the call with EINTR upon receiving a signal. */
+	TXG_WAIT_F_SIGNAL	= (1U << 0),
+	/* Reject the call with EAGAIN upon suspension. */


What's the difference between "suspension", "an exiting pool", and "forced export"?

include/sys/zfs_refcount.h

ahrens · 2021-02-05T00:08:53Z

man/man5/zfs-module-parameters.5

+in which all new I/Os fail, except for those required to unmount it.
+Intended for users trying to forcibly export a pool even when I/Os are in
+progress, without the need to find and stop them.  This option does not
+affect processes that are merely sitting on the filesystem, only those


What does "sitting on" mean? Is this referring to having open file handles or the CWD (current working directory)? Does this mean that regardless of what zfs_forced_export_unmount is set to, we can unmount filesystems that have open fd's? Does that require the -f flag to zfs unmount/zfs destroy?

No. For more detail see my reply to your top-level question.

ahrens · 2021-02-05T00:10:34Z

man/man5/zfs-module-parameters.5

+.ad
+.RS 12n
+During forced unmount, leave the filesystem in a disabled mode of operation,
+in which all new I/Os fail, except for those required to unmount it.


Is the alternative (the default) that a forced unmount will fail if there are i/o's in progress? by "i/o's in progress" do we mean that there's a ZPL syscall (e.g. read/write) in progress with a zio_t outstanding?

Why would we not want the default to be zfs_forced_export_unmount=1? What's the downside?

There's no downside, IMO, but Linux has a long-standing practice of disallowing this on other filesystems, so I created the tunable to conform to that.

wca · 2021-02-06T17:46:17Z

Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f, forcing the unmount even if there are fd's open in the filesystem?

On Linux, it's limited in the extent to which they can be forcibly exported, which is why I tried to explain the limit regarding file descriptors held open by processes. On FreeBSD, however, VFS will disassociate file descriptors from the filesystem, so it can work in that scenario. I believe this also applies to illumos, but I haven't tested there.

behlendorf

@wca since this change modifies the libzfs library ABI you're going to need to generate a new lib/libzfs/libzfs.abi file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.

The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools package, and then invoke the storeabi Makefile target in lib/libzfs/directory. That will generate a newlibzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.

cd lib/libzfs/
make storeabi

Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.

    FAIL l2arc/persist_l2arc_004_pos (expected PASS)
    FAIL l2arc/persist_l2arc_005_pos (expected PASS)

wca · 2021-02-15T04:07:20Z

@wca since this change modifies the libzfs library ABI you're going to need to generate a new lib/libzfs/libzfs.abi file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.

The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools package, and then invoke the storeabi Makefile target in lib/libzfs/directory. That will generate a newlibzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.
cd lib/libzfs/
make storeabi

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.
    FAIL l2arc/persist_l2arc_004_pos (expected PASS)
    FAIL l2arc/persist_l2arc_005_pos (expected PASS)

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

The only thing directly related that's modified in this PR is the addition of l2arc_spa_rebuild_stop, which I think is needed anyway (ie, we could probably factor it out of this PR).

But these tests still fail even if I change the three 'exiting' functions to return B_FALSE, and remove that call in spa_export_common. But all of the behavior modifications can only occur when hardforce=B_TRUE is passed to spa_export_common, or when a filesystem is force unmounted, neither of which applies to these tests...

behlendorf · 2021-02-16T00:44:00Z

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

I'm still pretty new to it myself. But from what I've seen so far the answer is "yes". But that's not really a problem since it's a generated file and we only update it when we're knowingly making ABI changes.

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

Oddly I wasn't able to reproduce these locally in a VM so it may be some kind of race. I'll give it another try latter. But if you're able to easy reproduce it I'd suggest taking a look at the l2_rebuild_* arcstats to see how much data was rebuilt. Maybe the new l2arc_spa_rebuild_stop call in spa_export_common is responsible and we're no writing the log blocks. It looks to me like it is called regardless of if we're passing hardforce option or not.

behlendorf · 2021-02-19T00:01:24Z

@wca would you mind rebasing this again on the latest master. There's a minor conflicts to resolve and you'll need to generate a new libzfs.abi file, the recent "compatibility" feature PR also changed the ABI a little. PR 11468. I'll see if I can reproduce the test failures, if we can get them resolved we can get this PR wrapped up and merged.

behlendorf · 2021-02-21T22:05:14Z

@wca it looks like there's still at least one path in the force export code which needs to be updated. This was caught by the CI

http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/2902/steps/shell_4/logs/console

robn · 2023-05-03T01:50:00Z

I was playing with 801a440 today.

On a fresh pool with no issues, export -F is slow.

# zpool create -f -O atime=off sandisk /dev/sdb
# time zpool export -F sandisk

real	0m20.220s
user	0m0.006s
sys	0m0.020s

(export without -F is sub-second).

If I follow this with an import and another force export, I get a panic:

# zpool import sandisk
# time zpool export -F sandisk

Message from syslogd@fitlet at May  3 11:29:38 ...
 kernel:[  620.886063] VERIFY3(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj) == ENOENT) failed (5 == 2)

Message from syslogd@fitlet at May  3 11:29:38 ...
 kernel:[  620.886086] PANIC at spa_log_spacemap.c:983:spa_generate_syncing_log_sm()

[  620.886063] VERIFY3(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj) == ENOENT) failed (5 == 2)
[  620.886086] PANIC at spa_log_spacemap.c:983:spa_generate_syncing_log_sm()
[  620.886094] Showing stack for process 2149
[  620.886103] CPU: 2 PID: 2149 Comm: txg_sync Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[  620.886106] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[  620.886109] Call Trace:
[  620.886132]  dump_stack+0x6b/0x83
[  620.886169]  spl_panic+0xd4/0xfc [spl]
[  620.886632]  ? zap_lookup_norm+0x59/0xd0 [zfs]
[  620.887064]  ? zap_lookup+0x12/0x20 [zfs]
[  620.887477]  ? zap_lookup_int_key+0x5d/0x80 [zfs]
[  620.887894]  spa_generate_syncing_log_sm+0x225/0x2c0 [zfs]
[  620.888313]  spa_flush_metaslabs+0xad/0x370 [zfs]
[  620.888321]  ? _cond_resched+0x16/0x50
[  620.888749]  spa_sync_iterate_to_convergence+0x165/0x310 [zfs]
[  620.889173]  spa_sync+0x319/0x910 [zfs]
[  620.889585]  txg_sync_thread+0x277/0x3d0 [zfs]
[  620.889998]  ? txg_completion_notify+0xf0/0xf0 [zfs]
[  620.890032]  thread_generic_wrapper+0x78/0xb0 [spl]
[  620.890061]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[  620.890069]  kthread+0x11b/0x140
[  620.890075]  ? __kthread_bind_mask+0x60/0x60
[  620.890084]  ret_from_fork+0x22/0x30

I notice that it seems less likely to happen if I waited a while between import and export. While trying to reduce this to a reliable test, I got a different crash:

[  405.339332] VERIFY(db->db_level != 0 || db->db_state == DB_CACHED || db->db_state == DB_FILL || db->db_state == DB_NOFILL) failed
[  405.339342] PANIC at dbuf.c:2264:dbuf_dirty()
[  405.339345] Showing stack for process 2692
[  405.339349] CPU: 0 PID: 2692 Comm: txg_sync Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[  405.339351] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[  405.339352] Call Trace:
[  405.339367]  dump_stack+0x6b/0x83
[  405.339385]  spl_panic+0xd4/0xfc [spl]
[  405.339393]  ? __wake_up_common_lock+0x8a/0xc0
[  405.339592]  ? dmu_tx_dirty_buf+0x29/0x390 [zfs]
[  405.339758]  spl_assert+0x17/0x20 [zfs]
[  405.339915]  dbuf_dirty+0xc66/0x12f0 [zfs]
[  405.340090]  ? zio_wait+0x2d5/0x530 [zfs]
[  405.340241]  dmu_write_impl+0x46/0x140 [zfs]
[  405.340400]  dmu_write+0x95/0xf0 [zfs]
[  405.340570]  space_map_write_intro_debug+0xab/0xe0 [zfs]
[  405.340744]  space_map_write_impl+0x49/0x2d0 [zfs]
[  405.340892]  ? dbuf_find_dirty_lte+0x14/0x40 [zfs]
[  405.341057]  space_map_write+0xb6/0x1e0 [zfs]
[  405.341222]  metaslab_flush+0x1a3/0x5e0 [zfs]
[  405.341399]  spa_flush_metaslabs+0x143/0x370 [zfs]
[  405.341564]  spa_sync_iterate_to_convergence+0x165/0x310 [zfs]
[  405.341737]  spa_sync+0x319/0x910 [zfs]
[  405.341904]  txg_sync_thread+0x277/0x3d0 [zfs]
[  405.342070]  ? txg_completion_notify+0xf0/0xf0 [zfs]
[  405.342087]  thread_generic_wrapper+0x78/0xb0 [spl]
[  405.342099]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[  405.342103]  kthread+0x11b/0x140
[  405.342106]  ? __kthread_bind_mask+0x60/0x60
[  405.342110]  ret_from_fork+0x22/0x30

If I have time later today I'll try to dig into both the slowness and the crashes.

robn · 2023-05-03T02:41:33Z

include/sys/fs/zfs.h

@@ -1469,6 +1469,8 @@ typedef enum zfs_ioc {
 	ZFS_IOC_USERNS_DETACH = ZFS_IOC_UNJAIL,	/* 0x86 (Linux) */
 	ZFS_IOC_SET_BOOTENV,			/* 0x87 */
 	ZFS_IOC_GET_BOOTENV,			/* 0x88 */
+	ZFS_IOC_HARD_FORCE_UNMOUNT_BEGIN,	/* 0x89 (Linux) */
+	ZFS_IOC_HARD_FORCE_UNMOUNT_END,		/* 0x90 (Linux) */


robn · 2023-05-03T07:18:56Z

More playing. Here's a heavy write+fsync load, on a relatively slow device (a USB-SATA SSD, pushes about 12MB/s).

# zpool create -f -O atime=off sandisk /dev/sdb
# dd if=/dev/random of=/sandisk/file bs=64K oflag=sync status=progress

Then:

# zpool export -F sandisk

This panics instantly:

[ 1664.912436] VERIFY(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL) || zio->io_spa->spa_export_initiator != NULL) failed
[ 1664.912449] PANIC at zio.c:4948:zio_done()
[ 1664.912452] Showing stack for process 662
[ 1664.912457] CPU: 3 PID: 662 Comm: z_cl_iss Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[ 1664.912458] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[ 1664.912460] Call Trace:
[ 1664.912473]  dump_stack+0x6b/0x83
[ 1664.912490]  spl_panic+0xd4/0xfc [spl]
[ 1664.912497]  ? __kmalloc_node+0x141/0x2b0
[ 1664.912507]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
[ 1664.912517]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
[ 1664.912689]  ? fletcher_2_native+0x1b/0x30 [zfs]
[ 1664.912836]  ? arc_hdr_verify+0xa8/0x250 [zfs]
[ 1664.912839]  ? _cond_resched+0x16/0x50
[ 1664.913002]  spl_assert+0x17/0x20 [zfs]
[ 1664.913172]  zio_done+0x10eb/0x1c20 [zfs]
[ 1664.913341]  zio_reexecute+0x46d/0x690 [zfs]
[ 1664.913507]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.913667]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.913833]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914012]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914171]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914331]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914494]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914653]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914812]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914972]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914988]  taskq_thread+0x201/0x440 [spl]
[ 1664.914993]  ? wake_up_q+0xa0/0xa0
[ 1664.915174]  ? zio_deadman_impl+0x310/0x310 [zfs]
[ 1664.915186]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 1664.915189]  kthread+0x11b/0x140
[ 1664.915192]  ? __kthread_bind_mask+0x60/0x60
[ 1664.915196]  ret_from_fork+0x22/0x30

I'm not totally sure of how we get here. I've seen reexecute storms on EIO, and we are forcing returning zios to fail like that. They might be log writes, with a long chain of children. I wanted to look more but I ran out of time this afternoon.

robn · 2023-05-03T10:11:14Z

It is log writes.

dd is waiting in the ZIL fallback path:

[ 1209.435946] task:dd              state:D stack:    0 pid:  825 ppid:   691 flags:0x00004000
[ 1209.435955] Call Trace:
[ 1209.435965]  __schedule+0x282/0x870
[ 1209.435973]  schedule+0x46/0xb0
[ 1209.435978]  io_schedule+0x42/0x70
[ 1209.436009]  cv_wait_common+0x103/0x290 [spl]
[ 1209.436027]  ? add_wait_queue_exclusive+0x70/0x70
[ 1209.436494]  txg_wait_synced_tx+0x1df/0x370 [zfs]
[ 1209.436998]  zil_commit_impl+0x92/0xa0 [zfs]
[ 1209.437451]  zil_commit+0x14b/0x230 [zfs]
[ 1209.437878]  zfs_write+0xa24/0xd80 [zfs]
[ 1209.437900]  ? chacha_block_generic+0x6f/0xb0
[ 1209.438354]  zpl_iter_write+0xe7/0x130 [zfs]
[ 1209.438382]  ? aa_file_perm+0x113/0x480
[ 1209.438393]  new_sync_write+0x11c/0x1b0
[ 1209.438408]  vfs_write+0x1ce/0x260
[ 1209.438416]  ksys_write+0x5f/0xe0
[ 1209.438432]  do_syscall_64+0x33/0x80
[ 1209.438440]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

I added this patch:

commit 2ccf55efab95156271db3c77e7f8bdcd2ba0f1b4
Author: Rob Norris <robn@despairlabs.com>
Date:   Wed May 3 19:58:34 2023 +1000

    forced-export: handle forced-export during ZIL failure.

diff --git module/zfs/zil.c module/zfs/zil.c
index 2538ffbe4..0523a336b 100644
--- module/zfs/zil.c
+++ module/zfs/zil.c
@@ -2594,7 +2594,8 @@ zil_commit_writer_stall(zilog_t *zilog)
 	 */
 	ASSERT(MUTEX_HELD(&zilog->zl_issuer_lock));
 	txg_wait_synced(zilog->zl_dmu_pool, 0);
-	ASSERT3P(list_tail(&zilog->zl_lwb_list), ==, NULL);
+	ASSERT(list_is_empty(&zilog->zl_lwb_list) ||
+	    spa_exiting(zilog->zl_spa));
 }
 
 /*
@@ -3413,7 +3414,7 @@ zil_commit_impl(zilog_t *zilog, uint64_t foid)
 	zil_commit_writer(zilog, zcw);
 	zil_commit_waiter(zilog, zcw);
 
-	if (zcw->zcw_zio_error != 0) {
+	if (zcw->zcw_zio_error != 0 && !dmu_objset_exiting(zilog->zl_os)) {
 		/*
 		 * If there was an error writing out the ZIL blocks that
 		 * this thread is waiting on, then we fallback to

It gets some of the way there, but there are a lot of txg_wait_synced() blocks in zil_commit() and a lot of assertions about the state of the LWB lists. I won't chase that further tonight.

oshogbo · 2023-05-10T13:24:52Z

Small update.
I have applied patches from @robn. Thank you for the patches and testing.
I have fixed the arc issue that we had seen before.
However, I am still fighting with bugs during force export while syncing.

Currently, the issue is with spa_log_sm_increment_current_mscount:

[  835.032540] VERIFY3(last_sls-&gt;sls_txg == spa_syncing_txg(spa)) failed (22 == 23)
[  835.032737] PANIC at spa_log_spacemap.c:570:spa_log_sm_increment_current_mscount()
[  835.032855] Showing stack for process 4062
[  835.032878] CPU: 1 PID: 4062 Comm: txg_sync Tainted: P           OE     5.15.0-53-generic #59-Ubuntu
[  835.032885] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020
[[  835.032891] Call Trace:
[  835.032898]  <TASK>
[  835.032917]  show_stack+0x52/0x5c
[  835.032972]  dump_stack_lvl+0x4a/0x63
[  835.033010]  dump_stack+0x10/0x16
[  835.033013]  spl_dumpstack+0x29/0x2f [spl]
[  835.033064]  spl_panic+0xd1/0xe9 [spl]
[  835.033071]  ? avl_find+0x69/0xe0 [zfs]
[  835.033169]  ? spa_log_sm_decrement_mscount+0x45/0xf0 [zfs]
[  835.033279]  spa_log_sm_increment_current_mscount+0x66/0x80 [zfs]
[  835.033393]  metaslab_unflushed_bump+0x1a6/0x390 [zfs]
[  835.033500]  metaslab_flush_update+0x97/0x100 [zfs]
[  835.033607]  metaslab_flush+0x2f5/0x760 [zfs]
[  835.033714]  spa_flush_metaslabs+0x3c2/0x760 [zfs]
[  835.033828]  spa_sync+0x8b5/0x1b00 [zfs]
[  835.033936]  ? spa_txg_history_init_io+0xe7/0x110 [zfs]
[  835.034043]  txg_sync_thread+0x2f1/0x5a0 [zfs]
[  835.034149]  ? txg_completion_notify+0x110/0x110 [zfs]
[  835.034255]  thread_generic_wrapper+0x6f/0xb0 [spl]
[  835.034262]  ? spl_taskq_fini+0x80/0x80 [spl]
[  835.034268]  kthread+0x12a/0x150
[  835.034307]  ? set_kthread_struct+0x50/0x50
[  835.034309]  ret_from_fork+0x22/0x30
[  835.034332]  </TASK>

behlendorf · 2023-05-18T19:45:38Z

@oshogbo when you get a chance can you please rebase so we can get an updated CI run.

oshogbo · 2023-05-19T17:39:06Z

@behlendorf done

This is primarily of use when a pool has lost its disk, while the user doesn't care about any pending (or otherwise) transactions. Implement various control methods to make this feasible: - txg_wait can now take a NOSUSPEND flag, in which case the caller will be alerted if their txg can't be committed. This is primarily of interest for callers that would normally pass TXG_WAIT, but don't want to wait if the pool becomes suspended, which allows unwinding in some cases, specifically when one is attempting a non-forced export. Without this, the non-forced export would preclude a forced export by virtue of holding the namespace lock indefinitely. - txg_wait also returns failure for TXG_WAIT users if a pool is actually being force exported. Adjust most callers to tolerate this. - spa_config_enter_flags now takes a NOSUSPEND flag to the same effect. - DMU objset initiator which may be set on an objset being forcibly exported / unmounted. - SPA export initiator may be set on a pool being forcibly exported. - DMU send/recv now use an interruption mechanism which relies on the SPA export initiator being able to enumerate datasets and closing any send/recv streams, causing their EINTR paths to be invoked. - ZIO now has a cancel entry point, which tells all suspended zios to fail, and which suppresses the failures for non-CANFAIL users. - metaslab, etc. cleanup, which consists of simply throwing away any changes that were not able to be synced out. - Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled, which allows the filesystem to remain in a modified 'unmounted' state upon exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos, which have VFS-level support for yanking filesystems out from under users. However, this only helps when the user is actively performing I/O, while not sitting on the filesystem. In particular, this allows test #3 below to pass on Linux. - Add basic logic to zpool to indicate a force-exporting pool, instead of crashing due to lack of config, etc. Add tests which cover the basic use cases: - Force export while a send is in progress - Force export while a recv is in progress - Force export while POSIX I/O is in progress This change modifies the libzfs ABI: - New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value. - New field libzfs_force_export for libzfs_handle. Co-Authored-by: Will Andrews <will@firepipe.net> Co-Authored-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#3461 Signed-off-by: Will Andrews <will@firepipe.net> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>

amotin · 2023-06-15T15:30:24Z

module/os/linux/zfs/zfs_vfsops.c

@@ -1441,12 +1460,16 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolean_t unmounting)
 		}
 	}
 	if (!zfs_is_readonly(zfsvfs) && os_dirty) {
-		txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
+		(void) txg_wait_synced_tx(dmu_objset_pool(zfsvfs->z_os), 0,
+		    NULL, wait_flags);


Cosmetics, but here and in other places txg_wait_synced_tx() without tx makes no sense. It should be txg_wait_synced_flags().

amotin · 2023-06-15T15:49:07Z

module/zfs/txg.c

-txg_wait_synced_impl(dsl_pool_t *dp, uint64_t txg, boolean_t wait_sig)
+int
+txg_wait_synced_tx(dsl_pool_t *dp, uint64_t txg, dmu_tx_t *tx,
+    txg_wait_flag_t flags)


This API looks confusing to me, receiving both tx and txg. Submitting tx I would already expect tx_txg, but instead it uses tx only to get tx_objset from it. Looking through the patch I found two places where tx is submitted, but in none of the cases it waits for the actual tx to be committed, but a pretty abstract txg, and the tx specification only allows to check dmu_objset_exiting(), which makes me think: why the txg_wait_synced() should ever care about the objset umount progress? I would understand exiting of forced pool export, but that would not require the objset, already available pool argument would be enough for that.

amotin · 2023-06-15T16:10:43Z

module/zfs/arc.c

@@ -6763,7 +6763,6 @@ arc_write_done(zio_t *zio)
 			arc_access(hdr, 0, B_FALSE);
 		mutex_exit(hash_lock);
 	} else {
-		arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
 		VERIFY3S(remove_reference(hdr, hdr), >, 0);


ARC_FLAG_IO_IN_PROGRESS has its own reference, which is dropped here. Would you explain why are you leaving the flag, but dropping the reference?

amotin · 2023-06-15T16:14:07Z

module/zfs/dbuf.c


-	if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY))
+	if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY) ||
+	    dmu_objset_exiting(db->db_objset))


Either here or in the next chunk dmu_objset_exiting() make no sense. And here it looks bad, like we are saying: "dbuf state is totally insane and we do not care", which should not be so.

amotin · 2023-06-15T16:20:38Z

module/zfs/dmu_objset.c

-		VERIFY0(zap_increment(os, DMU_USERUSED_OBJECT,
-		    uqn->uqn_id, uqn->uqn_delta, tx));
-		mutex_exit(&os->os_userused_lock);
+		if (!dmu_objset_exiting(os)) {


This and the following group checks look excessive, considering the check in VERIFY() below. It may be only needed if there are still some problematic cases inside, but then it is only an ugly and probably unreliable workaround.

amotin · 2023-06-15T16:32:26Z

module/zfs/spa_misc.c

+
+	if ((flags & SCL_FLAG_TRYENTER) != 0)
+		error = SET_ERROR(EAGAIN);
+	if (error == 0 && ((flags & SCL_FLAG_NOSUSPEND) != 0)) {


This would be else if (...

amotin · 2023-06-15T16:35:03Z

module/zfs/spa_misc.c

@@ -511,28 +496,54 @@ spa_config_enter_impl(spa_t *spa, int locks, const void *tag, krw_t rw,
 		mutex_enter(&scl->scl_lock);
 		if (rw == RW_READER) {
 			while (scl->scl_writer ||
-			    (!mmp_flag && scl->scl_write_wanted)) {
+			    ((flags & SCL_FLAG_MMP) && scl->scl_write_wanted)) {
+				error = spa_config_eval_flags(spa, flags);


This code is already congested sometimes, and here you are adding another global spa_suspend_lock acquisition. I am not happy.

In addition to that, I am not sure how the locking primitive should care about pool suspend? Why could not it be regularly acquired and dropped when respective protected operation fails? It feels like some workaround to me.

amotin · 2023-06-15T16:51:49Z

module/zfs/zil.c

@@ -3505,7 +3531,7 @@ zil_commit_impl(zilog_t *zilog, uint64_t foid)
 	zil_commit_writer(zilog, zcw);
 	zil_commit_waiter(zilog, zcw);

-	if (zcw->zcw_zio_error != 0) {
+	if (zcw->zcw_zio_error != 0 && !dmu_objset_exiting(zilog->zl_os)) {


This does not feel right to me. The error may be valid, and as I see txg_wait_synced() should exit normally in case of forced export.

amotin · 2023-06-15T16:59:04Z

module/zfs/zil.c

-		    (u_longlong_t)txg);
-	if (txg < spa_freeze_txg(zilog->zl_spa))
-		VERIFY(!zilog_is_dirty(zilog));
+	if (!dmu_objset_exiting(zilog->zl_os)) {


Again why objset unmount (possibly on healthy pool) should affect its ZIL operation? Wouldn't patching VERIFY() below be sufficient? Other parts should not care.

amotin · 2023-06-15T17:07:00Z

module/zfs/zio.c

@@ -2308,10 +2308,13 @@ zio_wait(zio_t *zio)
 	__zio_execute(zio);

 	mutex_enter(&zio->io_lock);
-	while (zio->io_executor != NULL) {
+	while (zio->io_executor != NULL && !spa_exiting_any(zio->io_spa)) {


How can you exit here and call zio_destroy() below before ZIO processing is officially complete? Is there anything to prevent use-after-free when I/O finally unblocks and try to complete?

mailinglists35 · 2023-10-25T12:54:25Z

any chance completing this PR can be prioritized?

mailinglists35 · 2023-11-30T11:08:42Z

@oshogbo any chance you can respond to what github calls "unresolved conversations"? I think code review remarks from other people.

allanjude · 2024-01-30T17:01:46Z

We are continuing to investigate issues with this feature in production to improve the pull request.

mailinglists35 · 2024-02-01T16:33:27Z

We are continuing to investigate issues with this feature in production to improve the pull request.

but guys you have lots of unanswered code reviews here. can you do that as well?

nerozero · 2024-07-12T07:44:13Z

Any progress ?

mailinglists35 · 2024-07-14T14:44:24Z

@nerozero it looks to me a combination of "it wasn't designed to handle this" and "you're just a minority so there are no rources to fix it".

just go plan with the dm-error workaround where you can safely kick the physical device out and put in back in, I've totally lost hope on this happening natively ever.

allanjude · 2024-07-14T15:39:57Z

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

I am sorry of the pace of progress is not to your liking.

takeda · 2024-07-15T01:48:41Z

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

Perhaps some frustration might come because of different behaviors on different systems and to some people this is a bigger issue than for others?

Based on some responses and workarounds suggested I have feeling that's the case. For example I saw responses, where somebody was able to recover from the issue by performing some operations with dm, and then invoking zpool clear twice. I also saw a response where another person mentioned that this only happen if the device reconnects but gets a different name than the original one. Which is still annoying, but totally understandable behavior.

In my case on FreeBSD 14.1-p2, when I reconnect the device it appears under the same name it was before (I also attached it using GPT label, to prevent issues if device name would change), the device also appears to be fully accessible, I can interact with the device, for example list partitions or calling smartctl on it.

My problem is that invoking zpool clear <pool> which the help linked in status and also man page suggests to do:

             wait      Blocks all I/O access until the device connectivity is
                       recovered and the errors are cleared with zpool clear.
                       This is the default behavior.

doesn't work. It tells me that operations are blocked because the pool is in waiting state or something like that.

The only way I know so far to restore access to the drive back is to reboot the system, which is frustrating, as it feels like it is a bug. I'm wondering if this is maybe implementation issue in FreeBSD or is that's how it is working for everyone? Could someone confirm if that also happens on the other systems, perhaps I need to open a bug with FreeBSD.

raimocom · 2024-07-15T09:20:37Z

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

I am sorry of the pace of progress is not to your liking.

So to my understanding you are focusing in increasing the robustness of ZFS in the case of sudden disconnects/reconnects of storage devices? Does this mean, ZFS then auto-resumes its operation, when the storage device becomes available again? Is that the aim?

Haravikk · 2024-07-15T10:07:08Z

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

Thanks for the update! I haven't been following a lot of the current development, are there are any pull requests/issues in particular that are tracking this related work on recoverability?

behlendorf added the Status: Work in Progress Not yet ready for general review label Oct 22, 2020

mailinglists35 mentioned this pull request Nov 5, 2020

zpool commands block when a disk goes missing / pool suspends #3461

Open

behlendorf reviewed Nov 14, 2020

View reviewed changes

behlendorf mentioned this pull request Dec 8, 2020

feature request: allow unloading a suspended pool from memory without exporting first #5242

Open

ghost reviewed Dec 9, 2020

View reviewed changes

man/man8/zpool-export.8 Outdated Show resolved Hide resolved

module/zfs/spa.c Outdated Show resolved Hide resolved

behlendorf reviewed Jan 13, 2021

View reviewed changes

wca mentioned this pull request Jan 25, 2021

zpool_export: make test suite more usable #11518

Merged

13 tasks

wca force-pushed the forced-export branch from 67fb250 to f3b6ad2 Compare January 25, 2021 00:25

behlendorf reviewed Jan 25, 2021

View reviewed changes

include/sys/zfs_refcount.h Outdated Show resolved Hide resolved

lib/libzfs/libzfs_dataset.c Outdated Show resolved Hide resolved

lib/libzfs/libzfs_mount.c Show resolved Hide resolved

ahrens reviewed Feb 5, 2021

View reviewed changes

wca force-pushed the forced-export branch 3 times, most recently from 6a0280e to e5395d1 Compare February 6, 2021 17:41

behlendorf reviewed Feb 8, 2021

View reviewed changes

wca force-pushed the forced-export branch from e5395d1 to b18aa43 Compare February 15, 2021 04:06

wca force-pushed the forced-export branch from b18aa43 to 0733a50 Compare February 28, 2021 05:17

wca force-pushed the forced-export branch from 0733a50 to e1dbb3c Compare March 7, 2021 22:04

robn reviewed May 3, 2023

View reviewed changes

oshogbo force-pushed the forced-export branch from 801a440 to 5a96767 Compare May 19, 2023 17:38

oshogbo force-pushed the forced-export branch from 5a96767 to 6ec492c Compare May 22, 2023 18:39

allanjude force-pushed the forced-export branch from 6ec492c to 238afce Compare May 30, 2023 13:48

oshogbo force-pushed the forced-export branch 2 times, most recently from 2b6f248 to aef5e28 Compare June 7, 2023 16:54

oshogbo force-pushed the forced-export branch from aef5e28 to 0c5b2fa Compare June 7, 2023 21:20

amotin reviewed Jun 15, 2023

View reviewed changes

grahamperrin mentioned this pull request Aug 10, 2023

A USB-attached pool gets stuck in "state: UNAVAIL" if the bus resets and cannot be brought back online without a reboot #15093

Open

guenther-alka mentioned this pull request Mar 28, 2024

Remove a disk with a basic pool results in a pool suspended. Normally a reboot solves this but driver seems to block a reboot. openzfsonwindows/openzfs#367

Open

stevekerrison mentioned this pull request Aug 21, 2024

Handling erroneous disconnect before upper ZFS pool destruction/export? archiecobbs/s3backer#227

Closed

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Oct 29, 2024

Gendra13 mentioned this pull request Nov 25, 2024

Zpool commands hang if nvme / lvm disk is disconnected (2.2.6) #16806

Open

Support forced export of ZFS pools #11082

Are you sure you want to change the base?

Support forced export of ZFS pools #11082

Conversation

wca commented Oct 17, 2020 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

gdevenyi commented Oct 17, 2020

codecov bot commented Oct 17, 2020 • edited Loading

Codecov Report

wca commented Oct 26, 2020

wca commented Oct 26, 2020

behlendorf commented Jan 12, 2021

behlendorf left a comment

Choose a reason for hiding this comment

wca commented Jan 25, 2021

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf commented Jan 26, 2021

behlendorf commented Feb 2, 2021 • edited Loading

ahrens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wca commented Feb 6, 2021

behlendorf left a comment

Choose a reason for hiding this comment

wca commented Feb 15, 2021

behlendorf commented Feb 16, 2021

behlendorf commented Feb 19, 2021

behlendorf commented Feb 21, 2021

robn commented May 3, 2023

Choose a reason for hiding this comment

robn commented May 3, 2023

robn commented May 3, 2023

oshogbo commented May 10, 2023

behlendorf commented May 18, 2023

oshogbo commented May 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotin Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

mailinglists35 commented Oct 25, 2023

mailinglists35 commented Nov 30, 2023 • edited Loading

allanjude commented Jan 30, 2024

mailinglists35 commented Feb 1, 2024

nerozero commented Jul 12, 2024

mailinglists35 commented Jul 14, 2024 • edited Loading

allanjude commented Jul 14, 2024

takeda commented Jul 15, 2024 • edited Loading

raimocom commented Jul 15, 2024

Haravikk commented Jul 15, 2024

wca commented Oct 17, 2020 •

edited

Loading

codecov bot commented Oct 17, 2020 •

edited

Loading

behlendorf commented Feb 2, 2021 •

edited

Loading

amotin Jun 15, 2023 •

edited

Loading

amotin Jun 15, 2023 •

edited

Loading

mailinglists35 commented Nov 30, 2023 •

edited

Loading

mailinglists35 commented Jul 14, 2024 •

edited

Loading

takeda commented Jul 15, 2024 •

edited

Loading