-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receiving invalid backup stream hangs pool #6524
Comments
Ran into this while hunting a reproducer for #6366. |
By chance did you try writing additional data to the pool to force trigger a few more txg syncs. It looks like the txg_sync thread is waiting for a signal in |
Nope that doesn't work. Eventually the writes hang waiting for a new open txg. Here is the tail of the debug log when the hang occurs:
|
It's actually in /*
* Quiesce the transaction group by waiting for everyone to txg_exit().
*/
for (c = 0; c < max_ncpus; c++) {
tx_cpu_t *tc = &tx->tx_cpu[c];
mutex_enter(&tc->tc_lock);
while (tc->tc_count[g] != 0)
cv_wait(&tc->tc_cv[g], &tc->tc_lock);
mutex_exit(&tc->tc_lock);
} |
Interesting, are there any clues as to which process didn't call |
I didn't see any obvious clues in the stack traces for all processes: https://gist.github.com/nedbass/9a954e356b99910d6e06f01005e043b4 |
Under a debug build I hit the following ASSERT when receiving the incremental stream:
|
Changed it to a
|
It looks like that |
I confirmed that the hang doesn't happen on the commit prior to the encryption patch. |
It turns out that the test script here is a reproducer for #6366, and the hang happens during the cleanup of the failed receive operation. |
Does this fix the issue as expected? diff --git a/module/zfs/dmu_send.c b/module/zfs/dmu_send.c
index aca5019..1f4410d 100644
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -2492,7 +2492,7 @@ receive_object(struct receive_writer_arg *rwa, struct drr_
drro->drr_bonustype, drro->drr_bonuslen, tx);
}
if (err != 0) {
- dmu_tx_abort(tx);
+ dmu_tx_commit(tx);
return (SET_ERROR(EINVAL));
}
|
Yes. |
Sorry. It looks like that was a typo on my part. I believe I thought that was the error handling for |
Talking of dmu_tx_assign, and chance #6317 is related? That hang seems to be kicked along by forcing extra txg syncs, and one of the stack traces involves dmu_tx_assign(). |
@spacelama Probably not. This bug was merged along with the encryption patch. |
@tcaputi I also saw some recent discussion in the OpenZFS Encryption PR regarding a crash when trying to zfs send/receive from an unencrypted to an encrypted dataset. Have we seen this issue in ZoL? |
Yes. Thats one thing I'm fixing as well. That one should be relatively trivial. It only happens when doing a |
This patch fixes several issues discovered after the encryption patch was merged: * Fixed a bug where encrypted datasets could attempt to receive embedded data records. * Fixed a bug where dirty records created by the recv code wasn't properly setting the dr_raw flag. * Fixed a typo where a dmu_tx_commit() was changed to dmu_tx_abort() * Fixed a few error handling bugs unrelated to the encryption patch in dmu_recv_stream() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#6512 Closes openzfs#6524 Closes openzfs#6545
System information
Describe the problem you're observing
ZFS hangs when receiving an incremental backup stream.
Describe how to reproduce the problem
I can reproduce the problem with this script. It only happens if I write enough data to the dataset and if I change the
dnodesize
property from auto to legacy between the two snapshots.Include any warning/errors/backtraces from the system logs
More compactly:
The text was updated successfully, but these errors were encountered: