-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect IO errors during device removal #8161
Conversation
While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
WORDS_FILE2="/usr/share/dict/words" | ||
FILE_CONTENTS="Leeloo Dallas mul-ti-pass." | ||
|
||
if [[ -f $WORDS_FILE1 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The $WORDS_FILE*
cases look to be optional. Can they be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an Illumos thing, I can drop it and do something generic.
Codecov Report
@@ Coverage Diff @@
## master #8161 +/- ##
==========================================
+ Coverage 78.45% 78.62% +0.17%
==========================================
Files 378 378
Lines 114765 114793 +28
==========================================
+ Hits 90035 90260 +225
+ Misses 24730 24533 -197
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a question and a few nits for now. But this looks good to me as a first pass.
Thanks for writing those regression tests btw. This is great!
log_onexit cleanup | ||
|
||
# | ||
# Fault the first side of mirror-0 and the second side of mirror-1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment on that other test file you can explain this in a high-level comment close to the top of the file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally a good incremental improvement, and it certainly fixes some pretty glaring issues. Before 0.8.0 is released we might want to look at the UI piece of this again and make sure it makes sense.
* Added comment to vdevs_in_pool() helper function. * Moved lock in to conditional in spa_vdev_copy_segment_read_done(). We don't need to take it unconditionally. * Updated comments. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update: after several days of running |
* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6900 Closes openzfs#8161
Motivation and Context
The
zpool remove
command should never be able to damage a poolduring device removal for situations where it's detectable and avoidable.
This particular issue was uncovered by long runs of
ztest
which wouldoccasionally produce pools with a handful of non-reconstructable blocks.
@tcaputi identified the root cause as scenario 1 described below.
Description
While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs. Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.
In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool. Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable. However, device removal
will complete successfully since all IO errors are ignored.
This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs. Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.
During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled. While it may be possible to reconstruct
the data after removal that cannot be guaranteed. The only
strictly safe thing to do is to cancel the removal.
As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried. But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.
How Has This Been Tested?
A test case for each scenario described above was added. Prior to
this change the pool would be corrupted, afterwards it is not.
A 14 hour run of
ztest
resulted in no failures due to a pool whichcould not be reconstructed. Normally, I'd see one of these failures
after about ~10 hours. I'll do a longer run of
ztest
for additonalverification.
Types of changes
Checklist:
Signed-off-by
.