Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support forced export of ZFS pools #11082

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wca
Copy link
Contributor

@wca wca commented Oct 17, 2020

Motivation and Context

This change enables users to forcibly export a pool in a manner which safely discards all pending dirty data, transaction groups, associated zios, metaslab changes, etc. It allows a regular zpool export to fail out so that it releases the namespace lock to enable a forced export. It is able to do this regardless of whether the current pool activity involves send, receive, or POSIX I/O.

This allows users to continue using their system without rebooting, in the event the disk cannot be recovered in an online manner, or the user prefers to resume use of their pool at a later time. Since ZFS can simply resume from the most recent consistent transaction group, the latter is easily achieved.

Closes #3461

Description

This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:

  • txg_wait can now take a NOSUSPEND flag, in which case the caller will be
    alerted if their txg can't be committed. This is primarily of interest
    for callers that would normally pass TXG_WAIT, but don't want to wait if
    the pool becomes suspended, which allows unwinding in some cases,
    specifically when one is attempting a non-forced export. Without this,
    the non-forced export would preclude a forced export by virtue of holding
    the namespace lock indefinitely.
  • txg_wait also returns failure for TXG_WAIT users if a pool is actually
    being force exported. Adjust most callers to tolerate this.
  • spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
  • DMU objset "killer" flag which may be set on an objset being forcibly
    exported / unmounted.
  • SPA "killer" flag which may be set on a pool being forcibly exported.
  • DMU send/recv now use an interruption mechanism which relies on the SPA
    killer being able to enumerate datasets and closing any send/recv streams,
    causing their EINTR paths to be invoked.
  • ZIO now has a cancel entry point, which tells all suspended zios to fail,
    and which suppresses the failures for non-CANFAIL users.
  • metaslab, etc. cleanup, which consists of simply throwing away any changes
    that were not able to be synced out.
  • Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled,
    which allows the filesystem to remain in a modified 'unmounted' state upon
    exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos,
    which have VFS-level support for yanking filesystems out from under users.
    However, this only helps when the user is actively performing I/O, while
    not sitting on the filesystem. In particular, this allows test Use New BIO_RW_FAILFAST_* API #3 below
    to pass on Linux.
  • Add basic logic to zpool to indicate a force-exporting pool, instead of
    crashing due to lack of config, etc.

Add tests which cover the basic use cases:

  • Force export while a send is in progress
  • Force export while a recv is in progress
  • Force export while POSIX I/O is in progress

How Has This Been Tested?

  • New ZFS Test Suite tests covering the three main pool activity scenarios.
  • Testing in a production environment that focuses around use of send/recv to a network-based disk, which can fail and not return "for a while". User doesn't mind losing the last few transaction groups, and is able to pick up later.
  • Existing ZFS Test Suite tests, to check for any unexpected breakage of non-forced-export scenarios.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Pending items on this checklist are pending review. I am seeking feedback on whether to break out any of the commits, rather than squash all of them into the primary one.

@gdevenyi
Copy link
Contributor

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

@codecov
Copy link

codecov bot commented Oct 17, 2020

Codecov Report

Patch coverage: 62.00% and project coverage change: +4.49 🎉

Comparison is base (161ed82) 75.17% compared to head (f168cb6) 79.66%.

❗ Current head f168cb6 differs from pull request most recent head 0c5b2fa. Consider uploading reports for the commit 0c5b2fa to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11082      +/-   ##
==========================================
+ Coverage   75.17%   79.66%   +4.49%     
==========================================
  Files         402      398       -4     
  Lines      128071   126235    -1836     
==========================================
+ Hits        96283   100571    +4288     
+ Misses      31788    25664    -6124     
Flag Coverage Δ
kernel 80.24% <61.60%> (+1.48%) ⬆️
user 65.35% <40.72%> (+17.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/sys/dmu.h 100.00% <ø> (ø)
include/sys/dsl_dataset.h 100.00% <ø> (ø)
module/zfs/vdev_initialize.c 98.10% <ø> (+4.58%) ⬆️
module/zfs/vdev_label.c 92.22% <0.00%> (+1.35%) ⬆️
module/zfs/metaslab.c 94.52% <12.50%> (+1.91%) ⬆️
module/zfs/dsl_dataset.c 90.82% <18.75%> (-0.44%) ⬇️
module/zfs/vdev_rebuild.c 90.75% <31.57%> (+13.09%) ⬆️
module/zfs/dmu_send.c 84.42% <33.33%> (+5.04%) ⬆️
module/zfs/vdev_removal.c 96.50% <33.33%> (+3.79%) ⬆️
module/zfs/vdev_trim.c 93.50% <38.46%> (+17.50%) ⬆️
... and 33 more

... and 194 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Oct 22, 2020
@wca
Copy link
Contributor Author

wca commented Oct 26, 2020

Ran into #9067 on the last full ZTS run I did locally, but it looks better now.

@wca
Copy link
Contributor Author

wca commented Oct 26, 2020

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

It depends on the specific issue. Most of the time, if a pool is suspended, zpool/zfs commands simply fail and get kicked out. zpool export in particular will now either exit early if non-hardforced (-F), or will be able to forcibly drop all context and export the pool. Whereas before, non -F would hang forever while holding the namespace lock, blocking everything that depends on it.

There may be other cases that hold the namespace lock and get stuck on a suspension, but they can be fixed in a follow-up.

cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
include/os/linux/spl/sys/kmem_cache.h Outdated Show resolved Hide resolved
include/sys/spa.h Outdated Show resolved Hide resolved
include/sys/spa.h Outdated Show resolved Hide resolved
include/sys/spa.h Outdated Show resolved Hide resolved
module/os/linux/zfs/zpl_super.c Outdated Show resolved Hide resolved
man/man8/zpool-export.8 Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
@behlendorf
Copy link
Contributor

@WCI @allanjude sorry about the delay in getting back to this. Thanks for addressing my initial feedback. I should be able to further dig in to the PR this week and get you some additional comments and testing.

As part of that work I took the liberty of getting this PR rebased on the latest master, resolving the relevant conflicts, and squashing a few commits. I appreciate how you originally kept the commits separate for the benefit of the reviewers. But keeping the individual review fixes separate at this point I'm not sure is really helpful. I didn't make any changes to the real functionality. You can find the changes in my forced-export-squashed branch. The squashed branch looks like this:

48f1e24 zfs: support force exporting pools
a107a70 zpool_export: make test suite more usable
d07a310 check_prop_source: improve output clarity
07f9ca4 zfs_receive: remove duplicate check_prop_source
9d64ea4 logapi: cat output file instead of printing
6e659d8 spa_export_common: refactor common exit points

For next steps what I'd suggest is:

  1. Open a new PRs for each bug fix which isn't really core to this feature. There's no reason we can't get those changes integrated right away as long as they pass testing and are entirely independent. This would be a107a70, d07a310, 07f9ca4, 9d64ea4, and 6e659d8. If there are other unrelated hunks we should pull out in to their own PRs that'd be nice too.

  2. Please review what I've done and go ahead and force update this PR with 48f1e24 (or a modified version of it you're happy with). This way we can get a fresh run from the CI and see where we stand. It looks like the current version did encounter at least one panic with a full ZTS run. That new commit passes the new test cases locally for me on Linux and sanity.run, but so far that's all I've tried.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments while I'm working my way through this.

lib/libzfs/libzfs_mount.c Show resolved Hide resolved
lib/libzfs/libzfs_dataset.c Show resolved Hide resolved
lib/libzfs/libzfs_dataset.c Outdated Show resolved Hide resolved
lib/libzfs/libzfs_dataset.c Outdated Show resolved Hide resolved
module/os/linux/zfs/zpl_super.c Outdated Show resolved Hide resolved
man/man8/zpool-export.8 Outdated Show resolved Hide resolved
man/man8/zpool-export.8 Outdated Show resolved Hide resolved
include/sys/dmu_objset.h Outdated Show resolved Hide resolved
include/sys/spa_impl.h Outdated Show resolved Hide resolved
module/os/linux/zfs/zfs_vfsops.c Outdated Show resolved Hide resolved
@wca
Copy link
Contributor Author

wca commented Jan 25, 2021

I just pushed an update that includes @behlendorf suggested break-out, and responds to most of the comments. The earlier commits are now in these PRs: #11514 #11515 #11516 #11517 #11518.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this, I should be able to give it a more careful look later this week. In the meanwhile I went ahead and merged most of the smaller unrelated cleanup you opened PRs for to master, So you can drop this on master any time and drop those commits.

I also noticed in the CI logs we hit at least one ASSERT during testing. Can you take a look at the follow stack in these console logs:

http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/158/steps/shell_4/logs/console

include/sys/zfs_refcount.h Outdated Show resolved Hide resolved
lib/libzfs/libzfs_dataset.c Outdated Show resolved Hide resolved
lib/libzfs/libzfs_mount.c Show resolved Hide resolved
@behlendorf
Copy link
Contributor

OK, #11514 #11515 #11516 #11517 #11518 have all been merged and can be dropped from this PR with a rebase.

@behlendorf
Copy link
Contributor

behlendorf commented Feb 2, 2021

@WCI @wca when you get a chance to rebase this that would be great.

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f, forcing the unmount even if there are fd's open in the filesystem?

include/sys/dmu.h Outdated Show resolved Hide resolved
typedef enum {
/* Reject the call with EINTR upon receiving a signal. */
TXG_WAIT_F_SIGNAL = (1U << 0),
/* Reject the call with EAGAIN upon suspension. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between "suspension", "an exiting pool", and "forced export"?

include/sys/zfs_refcount.h Outdated Show resolved Hide resolved
in which all new I/Os fail, except for those required to unmount it.
Intended for users trying to forcibly export a pool even when I/Os are in
progress, without the need to find and stop them. This option does not
affect processes that are merely sitting on the filesystem, only those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "sitting on" mean? Is this referring to having open file handles or the CWD (current working directory)? Does this mean that regardless of what zfs_forced_export_unmount is set to, we can unmount filesystems that have open fd's? Does that require the -f flag to zfs unmount/zfs destroy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. For more detail see my reply to your top-level question.

.ad
.RS 12n
During forced unmount, leave the filesystem in a disabled mode of operation,
in which all new I/Os fail, except for those required to unmount it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the alternative (the default) that a forced unmount will fail if there are i/o's in progress? by "i/o's in progress" do we mean that there's a ZPL syscall (e.g. read/write) in progress with a zio_t outstanding?

Why would we not want the default to be zfs_forced_export_unmount=1? What's the downside?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no downside, IMO, but Linux has a long-standing practice of disallowing this on other filesystems, so I created the tunable to conform to that.

@wca wca force-pushed the forced-export branch 3 times, most recently from 6a0280e to e5395d1 Compare February 6, 2021 17:41
@wca
Copy link
Contributor Author

wca commented Feb 6, 2021

Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f, forcing the unmount even if there are fd's open in the filesystem?

On Linux, it's limited in the extent to which they can be forcibly exported, which is why I tried to explain the limit regarding file descriptors held open by processes. On FreeBSD, however, VFS will disassociate file descriptors from the filesystem, so it can work in that scenario. I believe this also applies to illumos, but I haven't tested there.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wca since this change modifies the libzfs library ABI you're going to need to generate a new lib/libzfs/libzfs.abi file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.

The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools package, and then invoke the storeabi Makefile target in lib/libzfs/directory. That will generate a newlibzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.

cd lib/libzfs/
make storeabi

Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.

    FAIL l2arc/persist_l2arc_004_pos (expected PASS)
    FAIL l2arc/persist_l2arc_005_pos (expected PASS)

@wca
Copy link
Contributor Author

wca commented Feb 15, 2021

@wca since this change modifies the libzfs library ABI you're going to need to generate a new lib/libzfs/libzfs.abi file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.

The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools package, and then invoke the storeabi Makefile target in lib/libzfs/directory. That will generate a newlibzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.

cd lib/libzfs/
make storeabi

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.

    FAIL l2arc/persist_l2arc_004_pos (expected PASS)
    FAIL l2arc/persist_l2arc_005_pos (expected PASS)

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

The only thing directly related that's modified in this PR is the addition of l2arc_spa_rebuild_stop, which I think is needed anyway (ie, we could probably factor it out of this PR).

But these tests still fail even if I change the three 'exiting' functions to return B_FALSE, and remove that call in spa_export_common. But all of the behavior modifications can only occur when hardforce=B_TRUE is passed to spa_export_common, or when a filesystem is force unmounted, neither of which applies to these tests...

@behlendorf
Copy link
Contributor

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

I'm still pretty new to it myself. But from what I've seen so far the answer is "yes". But that's not really a problem since it's a generated file and we only update it when we're knowingly making ABI changes.

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

Oddly I wasn't able to reproduce these locally in a VM so it may be some kind of race. I'll give it another try latter. But if you're able to easy reproduce it I'd suggest taking a look at the l2_rebuild_* arcstats to see how much data was rebuilt. Maybe the new l2arc_spa_rebuild_stop call in spa_export_common is responsible and we're no writing the log blocks. It looks to me like it is called regardless of if we're passing hardforce option or not.

@behlendorf
Copy link
Contributor

@wca would you mind rebasing this again on the latest master. There's a minor conflicts to resolve and you'll need to generate a new libzfs.abi file, the recent "compatibility" feature PR also changed the ABI a little. PR 11468. I'll see if I can reproduce the test failures, if we can get them resolved we can get this PR wrapped up and merged.

@behlendorf
Copy link
Contributor

@wca it looks like there's still at least one path in the force export code which needs to be updated. This was caught by the CI

http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/2902/steps/shell_4/logs/console

@robn
Copy link
Member

robn commented May 3, 2023

I was playing with 801a440 today.

On a fresh pool with no issues, export -F is slow.

# zpool create -f -O atime=off sandisk /dev/sdb
# time zpool export -F sandisk

real	0m20.220s
user	0m0.006s
sys	0m0.020s

(export without -F is sub-second).

If I follow this with an import and another force export, I get a panic:

# zpool import sandisk
# time zpool export -F sandisk

Message from syslogd@fitlet at May  3 11:29:38 ...
 kernel:[  620.886063] VERIFY3(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj) == ENOENT) failed (5 == 2)

Message from syslogd@fitlet at May  3 11:29:38 ...
 kernel:[  620.886086] PANIC at spa_log_spacemap.c:983:spa_generate_syncing_log_sm()
[  620.886063] VERIFY3(zap_lookup_int_key(mos, spacemap_zap, txg, &sm_obj) == ENOENT) failed (5 == 2)
[  620.886086] PANIC at spa_log_spacemap.c:983:spa_generate_syncing_log_sm()
[  620.886094] Showing stack for process 2149
[  620.886103] CPU: 2 PID: 2149 Comm: txg_sync Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[  620.886106] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[  620.886109] Call Trace:
[  620.886132]  dump_stack+0x6b/0x83
[  620.886169]  spl_panic+0xd4/0xfc [spl]
[  620.886632]  ? zap_lookup_norm+0x59/0xd0 [zfs]
[  620.887064]  ? zap_lookup+0x12/0x20 [zfs]
[  620.887477]  ? zap_lookup_int_key+0x5d/0x80 [zfs]
[  620.887894]  spa_generate_syncing_log_sm+0x225/0x2c0 [zfs]
[  620.888313]  spa_flush_metaslabs+0xad/0x370 [zfs]
[  620.888321]  ? _cond_resched+0x16/0x50
[  620.888749]  spa_sync_iterate_to_convergence+0x165/0x310 [zfs]
[  620.889173]  spa_sync+0x319/0x910 [zfs]
[  620.889585]  txg_sync_thread+0x277/0x3d0 [zfs]
[  620.889998]  ? txg_completion_notify+0xf0/0xf0 [zfs]
[  620.890032]  thread_generic_wrapper+0x78/0xb0 [spl]
[  620.890061]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[  620.890069]  kthread+0x11b/0x140
[  620.890075]  ? __kthread_bind_mask+0x60/0x60
[  620.890084]  ret_from_fork+0x22/0x30

I notice that it seems less likely to happen if I waited a while between import and export. While trying to reduce this to a reliable test, I got a different crash:

[  405.339332] VERIFY(db->db_level != 0 || db->db_state == DB_CACHED || db->db_state == DB_FILL || db->db_state == DB_NOFILL) failed
[  405.339342] PANIC at dbuf.c:2264:dbuf_dirty()
[  405.339345] Showing stack for process 2692
[  405.339349] CPU: 0 PID: 2692 Comm: txg_sync Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[  405.339351] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[  405.339352] Call Trace:
[  405.339367]  dump_stack+0x6b/0x83
[  405.339385]  spl_panic+0xd4/0xfc [spl]
[  405.339393]  ? __wake_up_common_lock+0x8a/0xc0
[  405.339592]  ? dmu_tx_dirty_buf+0x29/0x390 [zfs]
[  405.339758]  spl_assert+0x17/0x20 [zfs]
[  405.339915]  dbuf_dirty+0xc66/0x12f0 [zfs]
[  405.340090]  ? zio_wait+0x2d5/0x530 [zfs]
[  405.340241]  dmu_write_impl+0x46/0x140 [zfs]
[  405.340400]  dmu_write+0x95/0xf0 [zfs]
[  405.340570]  space_map_write_intro_debug+0xab/0xe0 [zfs]
[  405.340744]  space_map_write_impl+0x49/0x2d0 [zfs]
[  405.340892]  ? dbuf_find_dirty_lte+0x14/0x40 [zfs]
[  405.341057]  space_map_write+0xb6/0x1e0 [zfs]
[  405.341222]  metaslab_flush+0x1a3/0x5e0 [zfs]
[  405.341399]  spa_flush_metaslabs+0x143/0x370 [zfs]
[  405.341564]  spa_sync_iterate_to_convergence+0x165/0x310 [zfs]
[  405.341737]  spa_sync+0x319/0x910 [zfs]
[  405.341904]  txg_sync_thread+0x277/0x3d0 [zfs]
[  405.342070]  ? txg_completion_notify+0xf0/0xf0 [zfs]
[  405.342087]  thread_generic_wrapper+0x78/0xb0 [spl]
[  405.342099]  ? spl_assert.constprop.0+0x20/0x20 [spl]
[  405.342103]  kthread+0x11b/0x140
[  405.342106]  ? __kthread_bind_mask+0x60/0x60
[  405.342110]  ret_from_fork+0x22/0x30

If I have time later today I'll try to dig into both the slowness and the crashes.

@@ -1469,6 +1469,8 @@ typedef enum zfs_ioc {
ZFS_IOC_USERNS_DETACH = ZFS_IOC_UNJAIL, /* 0x86 (Linux) */
ZFS_IOC_SET_BOOTENV, /* 0x87 */
ZFS_IOC_GET_BOOTENV, /* 0x88 */
ZFS_IOC_HARD_FORCE_UNMOUNT_BEGIN, /* 0x89 (Linux) */
ZFS_IOC_HARD_FORCE_UNMOUNT_END, /* 0x90 (Linux) */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x8a :)

@robn
Copy link
Member

robn commented May 3, 2023

More playing. Here's a heavy write+fsync load, on a relatively slow device (a USB-SATA SSD, pushes about 12MB/s).

# zpool create -f -O atime=off sandisk /dev/sdb
# dd if=/dev/random of=/sandisk/file bs=64K oflag=sync status=progress

Then:

# zpool export -F sandisk

This panics instantly:

[ 1664.912436] VERIFY(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL) || zio->io_spa->spa_export_initiator != NULL) failed
[ 1664.912449] PANIC at zio.c:4948:zio_done()
[ 1664.912452] Showing stack for process 662
[ 1664.912457] CPU: 3 PID: 662 Comm: z_cl_iss Tainted: P           OE     5.10.0-22-amd64 #1 Debian 5.10.178-3
[ 1664.912458] Hardware name: Compulab fitlet2/fitlet2, BIOS FLT2.0.46.01.00 09/17/2018
[ 1664.912460] Call Trace:
[ 1664.912473]  dump_stack+0x6b/0x83
[ 1664.912490]  spl_panic+0xd4/0xfc [spl]
[ 1664.912497]  ? __kmalloc_node+0x141/0x2b0
[ 1664.912507]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
[ 1664.912517]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
[ 1664.912689]  ? fletcher_2_native+0x1b/0x30 [zfs]
[ 1664.912836]  ? arc_hdr_verify+0xa8/0x250 [zfs]
[ 1664.912839]  ? _cond_resched+0x16/0x50
[ 1664.913002]  spl_assert+0x17/0x20 [zfs]
[ 1664.913172]  zio_done+0x10eb/0x1c20 [zfs]
[ 1664.913341]  zio_reexecute+0x46d/0x690 [zfs]
[ 1664.913507]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.913667]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.913833]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914012]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914171]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914331]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914494]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914653]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914812]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914972]  zio_reexecute+0x2a4/0x690 [zfs]
[ 1664.914988]  taskq_thread+0x201/0x440 [spl]
[ 1664.914993]  ? wake_up_q+0xa0/0xa0
[ 1664.915174]  ? zio_deadman_impl+0x310/0x310 [zfs]
[ 1664.915186]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 1664.915189]  kthread+0x11b/0x140
[ 1664.915192]  ? __kthread_bind_mask+0x60/0x60
[ 1664.915196]  ret_from_fork+0x22/0x30

I'm not totally sure of how we get here. I've seen reexecute storms on EIO, and we are forcing returning zios to fail like that. They might be log writes, with a long chain of children. I wanted to look more but I ran out of time this afternoon.

@robn
Copy link
Member

robn commented May 3, 2023

It is log writes.

dd is waiting in the ZIL fallback path:

[ 1209.435946] task:dd              state:D stack:    0 pid:  825 ppid:   691 flags:0x00004000
[ 1209.435955] Call Trace:
[ 1209.435965]  __schedule+0x282/0x870
[ 1209.435973]  schedule+0x46/0xb0
[ 1209.435978]  io_schedule+0x42/0x70
[ 1209.436009]  cv_wait_common+0x103/0x290 [spl]
[ 1209.436027]  ? add_wait_queue_exclusive+0x70/0x70
[ 1209.436494]  txg_wait_synced_tx+0x1df/0x370 [zfs]
[ 1209.436998]  zil_commit_impl+0x92/0xa0 [zfs]
[ 1209.437451]  zil_commit+0x14b/0x230 [zfs]
[ 1209.437878]  zfs_write+0xa24/0xd80 [zfs]
[ 1209.437900]  ? chacha_block_generic+0x6f/0xb0
[ 1209.438354]  zpl_iter_write+0xe7/0x130 [zfs]
[ 1209.438382]  ? aa_file_perm+0x113/0x480
[ 1209.438393]  new_sync_write+0x11c/0x1b0
[ 1209.438408]  vfs_write+0x1ce/0x260
[ 1209.438416]  ksys_write+0x5f/0xe0
[ 1209.438432]  do_syscall_64+0x33/0x80
[ 1209.438440]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

I added this patch:

commit 2ccf55efab95156271db3c77e7f8bdcd2ba0f1b4
Author: Rob Norris <robn@despairlabs.com>
Date:   Wed May 3 19:58:34 2023 +1000

    forced-export: handle forced-export during ZIL failure.

diff --git module/zfs/zil.c module/zfs/zil.c
index 2538ffbe4..0523a336b 100644
--- module/zfs/zil.c
+++ module/zfs/zil.c
@@ -2594,7 +2594,8 @@ zil_commit_writer_stall(zilog_t *zilog)
 	 */
 	ASSERT(MUTEX_HELD(&zilog->zl_issuer_lock));
 	txg_wait_synced(zilog->zl_dmu_pool, 0);
-	ASSERT3P(list_tail(&zilog->zl_lwb_list), ==, NULL);
+	ASSERT(list_is_empty(&zilog->zl_lwb_list) ||
+	    spa_exiting(zilog->zl_spa));
 }
 
 /*
@@ -3413,7 +3414,7 @@ zil_commit_impl(zilog_t *zilog, uint64_t foid)
 	zil_commit_writer(zilog, zcw);
 	zil_commit_waiter(zilog, zcw);
 
-	if (zcw->zcw_zio_error != 0) {
+	if (zcw->zcw_zio_error != 0 && !dmu_objset_exiting(zilog->zl_os)) {
 		/*
 		 * If there was an error writing out the ZIL blocks that
 		 * this thread is waiting on, then we fallback to

It gets some of the way there, but there are a lot of txg_wait_synced() blocks in zil_commit() and a lot of assertions about the state of the LWB lists. I won't chase that further tonight.

@oshogbo
Copy link
Contributor

oshogbo commented May 10, 2023

Small update.
I have applied patches from @robn. Thank you for the patches and testing.
I have fixed the arc issue that we had seen before.
However, I am still fighting with bugs during force export while syncing.

Currently, the issue is with spa_log_sm_increment_current_mscount:

[  835.032540] VERIFY3(last_sls-&gt;sls_txg == spa_syncing_txg(spa)) failed (22 == 23)
[  835.032737] PANIC at spa_log_spacemap.c:570:spa_log_sm_increment_current_mscount()
[  835.032855] Showing stack for process 4062
[  835.032878] CPU: 1 PID: 4062 Comm: txg_sync Tainted: P           OE     5.15.0-53-generic #59-Ubuntu
[  835.032885] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 13.0 11/10/2020
[[  835.032891] Call Trace:
[  835.032898]  <TASK>
[  835.032917]  show_stack+0x52/0x5c
[  835.032972]  dump_stack_lvl+0x4a/0x63
[  835.033010]  dump_stack+0x10/0x16
[  835.033013]  spl_dumpstack+0x29/0x2f [spl]
[  835.033064]  spl_panic+0xd1/0xe9 [spl]
[  835.033071]  ? avl_find+0x69/0xe0 [zfs]
[  835.033169]  ? spa_log_sm_decrement_mscount+0x45/0xf0 [zfs]
[  835.033279]  spa_log_sm_increment_current_mscount+0x66/0x80 [zfs]
[  835.033393]  metaslab_unflushed_bump+0x1a6/0x390 [zfs]
[  835.033500]  metaslab_flush_update+0x97/0x100 [zfs]
[  835.033607]  metaslab_flush+0x2f5/0x760 [zfs]
[  835.033714]  spa_flush_metaslabs+0x3c2/0x760 [zfs]
[  835.033828]  spa_sync+0x8b5/0x1b00 [zfs]
[  835.033936]  ? spa_txg_history_init_io+0xe7/0x110 [zfs]
[  835.034043]  txg_sync_thread+0x2f1/0x5a0 [zfs]
[  835.034149]  ? txg_completion_notify+0x110/0x110 [zfs]
[  835.034255]  thread_generic_wrapper+0x6f/0xb0 [spl]
[  835.034262]  ? spl_taskq_fini+0x80/0x80 [spl]
[  835.034268]  kthread+0x12a/0x150
[  835.034307]  ? set_kthread_struct+0x50/0x50
[  835.034309]  ret_from_fork+0x22/0x30
[  835.034332]  </TASK>

@behlendorf
Copy link
Contributor

@oshogbo when you get a chance can you please rebase so we can get an updated CI run.

@oshogbo
Copy link
Contributor

oshogbo commented May 19, 2023

@behlendorf done

@oshogbo oshogbo force-pushed the forced-export branch 2 times, most recently from 2b6f248 to aef5e28 Compare June 7, 2023 16:54
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will
  be alerted if their txg can't be committed.  This is primarily of
  interest for callers that would normally pass TXG_WAIT, but don't want
  to wait if the pool becomes suspended, which allows unwinding in some
  cases, specifically when one is attempting a non-forced export.
  Without this, the non-forced export would preclude a forced export
  by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually
  being force exported.  Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset initiator which may be set on an objset being forcibly
  exported / unmounted.
- SPA export initiator may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the
  SPA export initiator being able to enumerate datasets and closing any
  send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to
  fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any
  changes that were not able to be synced out.
- Linux specific: introduce a new tunable,
  zfs_forced_export_unmount_enabled, which allows the filesystem to
  remain in a modified 'unmounted' state upon exiting zpl_umount_begin,
  to achieve parity with FreeBSD and illumos,
  which have VFS-level support for yanking filesystems out from under
  users.  However, this only helps when the user is actively performing
  I/O, while not sitting on the filesystem.  In particular, this allows
  test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead
  of crashing due to lack of config, etc.

Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress

This change modifies the libzfs ABI:
- New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value.
- New field libzfs_force_export for libzfs_handle.

Co-Authored-by: Will Andrews <will@firepipe.net>
Co-Authored-by: Allan Jude <allan@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#3461
Signed-off-by: Will Andrews <will@firepipe.net>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
@@ -1441,12 +1460,16 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolean_t unmounting)
}
}
if (!zfs_is_readonly(zfsvfs) && os_dirty) {
txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
(void) txg_wait_synced_tx(dmu_objset_pool(zfsvfs->z_os), 0,
NULL, wait_flags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetics, but here and in other places txg_wait_synced_tx() without tx makes no sense. It should be txg_wait_synced_flags().

txg_wait_synced_impl(dsl_pool_t *dp, uint64_t txg, boolean_t wait_sig)
int
txg_wait_synced_tx(dsl_pool_t *dp, uint64_t txg, dmu_tx_t *tx,
txg_wait_flag_t flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API looks confusing to me, receiving both tx and txg. Submitting tx I would already expect tx_txg, but instead it uses tx only to get tx_objset from it. Looking through the patch I found two places where tx is submitted, but in none of the cases it waits for the actual tx to be committed, but a pretty abstract txg, and the tx specification only allows to check dmu_objset_exiting(), which makes me think: why the txg_wait_synced() should ever care about the objset umount progress? I would understand exiting of forced pool export, but that would not require the objset, already available pool argument would be enough for that.

@@ -6763,7 +6763,6 @@ arc_write_done(zio_t *zio)
arc_access(hdr, 0, B_FALSE);
mutex_exit(hash_lock);
} else {
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
VERIFY3S(remove_reference(hdr, hdr), >, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARC_FLAG_IO_IN_PROGRESS has its own reference, which is dropped here. Would you explain why are you leaving the flag, but dropping the reference?


if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY))
if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY) ||
dmu_objset_exiting(db->db_objset))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in the next chunk dmu_objset_exiting() make no sense. And here it looks bad, like we are saying: "dbuf state is totally insane and we do not care", which should not be so.

VERIFY0(zap_increment(os, DMU_USERUSED_OBJECT,
uqn->uqn_id, uqn->uqn_delta, tx));
mutex_exit(&os->os_userused_lock);
if (!dmu_objset_exiting(os)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following group checks look excessive, considering the check in VERIFY() below. It may be only needed if there are still some problematic cases inside, but then it is only an ugly and probably unreliable workaround.


if ((flags & SCL_FLAG_TRYENTER) != 0)
error = SET_ERROR(EAGAIN);
if (error == 0 && ((flags & SCL_FLAG_NOSUSPEND) != 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be else if (...

@@ -511,28 +496,54 @@ spa_config_enter_impl(spa_t *spa, int locks, const void *tag, krw_t rw,
mutex_enter(&scl->scl_lock);
if (rw == RW_READER) {
while (scl->scl_writer ||
(!mmp_flag && scl->scl_write_wanted)) {
((flags & SCL_FLAG_MMP) && scl->scl_write_wanted)) {
error = spa_config_eval_flags(spa, flags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is already congested sometimes, and here you are adding another global spa_suspend_lock acquisition. I am not happy.

In addition to that, I am not sure how the locking primitive should care about pool suspend? Why could not it be regularly acquired and dropped when respective protected operation fails? It feels like some workaround to me.

@@ -3505,7 +3531,7 @@ zil_commit_impl(zilog_t *zilog, uint64_t foid)
zil_commit_writer(zilog, zcw);
zil_commit_waiter(zilog, zcw);

if (zcw->zcw_zio_error != 0) {
if (zcw->zcw_zio_error != 0 && !dmu_objset_exiting(zilog->zl_os)) {
Copy link
Member

@amotin amotin Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not feel right to me. The error may be valid, and as I see txg_wait_synced() should exit normally in case of forced export.

(u_longlong_t)txg);
if (txg < spa_freeze_txg(zilog->zl_spa))
VERIFY(!zilog_is_dirty(zilog));
if (!dmu_objset_exiting(zilog->zl_os)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again why objset unmount (possibly on healthy pool) should affect its ZIL operation? Wouldn't patching VERIFY() below be sufficient? Other parts should not care.

@@ -2308,10 +2308,13 @@ zio_wait(zio_t *zio)
__zio_execute(zio);

mutex_enter(&zio->io_lock);
while (zio->io_executor != NULL) {
while (zio->io_executor != NULL && !spa_exiting_any(zio->io_spa)) {
Copy link
Member

@amotin amotin Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you exit here and call zio_destroy() below before ZIO processing is officially complete? Is there anything to prevent use-after-free when I/O finally unblocks and try to complete?

@mailinglists35
Copy link

any chance completing this PR can be prioritized?

@mailinglists35
Copy link

mailinglists35 commented Nov 30, 2023

@oshogbo any chance you can respond to what github calls "unresolved conversations"? I think code review remarks from other people.

@allanjude
Copy link
Contributor

We are continuing to investigate issues with this feature in production to improve the pull request.

@mailinglists35
Copy link

We are continuing to investigate issues with this feature in production to improve the pull request.

but guys you have lots of unanswered code reviews here. can you do that as well?

@nerozero
Copy link

Any progress ?

@mailinglists35
Copy link

mailinglists35 commented Jul 14, 2024

@nerozero it looks to me a combination of "it wasn't designed to handle this" and "you're just a minority so there are no rources to fix it".

just go plan with the dm-error workaround where you can safely kick the physical device out and put in back in, I've totally lost hope on this happening natively ever.

@allanjude
Copy link
Contributor

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

I am sorry of the pace of progress is not to your liking.

@takeda
Copy link

takeda commented Jul 15, 2024

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

Perhaps some frustration might come because of different behaviors on different systems and to some people this is a bigger issue than for others?

Based on some responses and workarounds suggested I have feeling that's the case. For example I saw responses, where somebody was able to recover from the issue by performing some operations with dm, and then invoking zpool clear twice. I also saw a response where another person mentioned that this only happen if the device reconnects but gets a different name than the original one. Which is still annoying, but totally understandable behavior.

In my case on FreeBSD 14.1-p2, when I reconnect the device it appears under the same name it was before (I also attached it using GPT label, to prevent issues if device name would change), the device also appears to be fully accessible, I can interact with the device, for example list partitions or calling smartctl on it.

My problem is that invoking zpool clear <pool> which the help linked in status and also man page suggests to do:

             wait      Blocks all I/O access until the device connectivity is
                       recovered and the errors are cleared with zpool clear.
                       This is the default behavior.

doesn't work. It tells me that operations are blocked because the pool is in waiting state or something like that.

The only way I know so far to restore access to the drive back is to reboot the system, which is frustrating, as it feels like it is a bug. I'm wondering if this is maybe implementation issue in FreeBSD or is that's how it is working for everyone? Could someone confirm if that also happens on the other systems, perhaps I need to open a bug with FreeBSD.

@raimocom
Copy link

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

I am sorry of the pace of progress is not to your liking.

So to my understanding you are focusing in increasing the robustness of ZFS in the case of sudden disconnects/reconnects of storage devices? Does this mean, ZFS then auto-resumes its operation, when the storage device becomes available again? Is that the aim?

@Haravikk
Copy link

While we are still actively working on this issue, priority is being given to related work to improve the safety of ZFS in the face of device failures, and to make some situations more recoverable, to avoid the need to do a forced export.

Thanks for the update! I haven't been following a lot of the current development, are there are any pull requests/issues in particular that are tracking this related work on recoverability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zpool commands block when a disk goes missing / pool suspends