Detect IO errors during device removal #8161

behlendorf · 2018-11-29T20:08:00Z

Motivation and Context

The zpool remove command should never be able to damage a pool
during device removal for situations where it's detectable and avoidable.
This particular issue was uncovered by long runs of ztest which would
occasionally produce pools with a handful of non-reconstructable blocks.
@tcaputi identified the root cause as scenario 1 described below.

Description

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs. Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool. Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable. However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs. Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled. While it may be possible to reconstruct
the data after removal that cannot be guaranteed. The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried. But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

How Has This Been Tested?

A test case for each scenario described above was added. Prior to
this change the pool would be corrupted, afterwards it is not.

A 14 hour run of ztest resulted in no failures due to a pool which
could not be reconstructed. Normally, I'd see one of these failures
after about ~10 hours. I'll do a longer run of ztest for additonal
verification.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

module/zfs/vdev_removal.c

tests/zfs-tests/include/libtest.shlib

tonyhutter · 2018-11-30T01:13:07Z

tests/zfs-tests/tests/functional/removal/removal_with_faulted.ksh

+WORDS_FILE2="/usr/share/dict/words"
+FILE_CONTENTS="Leeloo Dallas mul-ti-pass."
+
+if [[ -f $WORDS_FILE1 ]]; then


The $WORDS_FILE* cases look to be optional. Can they be removed?

This was an Illumos thing, I can drop it and do something generic.

codecov · 2018-11-30T04:33:39Z

Codecov Report

Merging #8161 into master will increase coverage by 0.17%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master    #8161      +/-   ##
==========================================
+ Coverage   78.45%   78.62%   +0.17%     
==========================================
  Files         378      378              
  Lines      114765   114793      +28     
==========================================
+ Hits        90035    90260     +225     
+ Misses      24730    24533     -197

Flag	Coverage Δ
#kernel	`78.95% <87.09%> (+0.31%)`	⬆️
#user	`67.67% <50%> (+0.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c40a112...ba25605. Read the comment docs.

sdimitro

I just have a question and a few nits for now. But this looks good to me as a first pass.
Thanks for writing those regression tests btw. This is great!

tests/zfs-tests/include/libtest.shlib

tests/zfs-tests/tests/functional/removal/removal_with_errors.ksh

sdimitro · 2018-11-30T22:04:46Z

tests/zfs-tests/tests/functional/removal/removal_with_faulted.ksh

+log_onexit cleanup
+
+#
+# Fault the first side of mirror-0 and the second side of mirror-1.


Similar to my comment on that other test file you can explain this in a high-level comment close to the top of the file.

module/zfs/vdev_removal.c

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

tests/zfs-tests/include/libtest.shlib

tcaputi

This is generally a good incremental improvement, and it certainly fixes some pretty glaring issues. Before 0.8.0 is released we might want to look at the UI piece of this again and make sure it makes sense.

man/man5/zfs-module-parameters.5

module/zfs/vdev_removal.c

* Added comment to vdevs_in_pool() helper function. * Moved lock in to conditional in spa_vdev_copy_segment_read_done(). We don't need to take it unconditionally. * Updated comments. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2018-12-03T22:12:02Z

Update: after several days of running ztest I can confirm this has resolved the observed issue.

* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6900 Closes openzfs#8161

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 29, 2018

behlendorf requested review from ahrens, tcaputi and sdimitro November 29, 2018 20:08

behlendorf mentioned this pull request Nov 29, 2018

ztest: zdb -Yoption for use by ztest(8) #8113

Closed

12 tasks

tonyhutter reviewed Nov 29, 2018

View reviewed changes

module/zfs/vdev_removal.c Show resolved Hide resolved

tonyhutter reviewed Nov 30, 2018

View reviewed changes

tests/zfs-tests/include/libtest.shlib Show resolved Hide resolved

tonyhutter reviewed Nov 30, 2018

View reviewed changes

sdimitro reviewed Nov 30, 2018

View reviewed changes

Review feedback

93e30a7

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

sdimitro approved these changes Dec 3, 2018

View reviewed changes

tests/zfs-tests/include/libtest.shlib Show resolved Hide resolved

tcaputi approved these changes Dec 3, 2018

View reviewed changes

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved

module/zfs/vdev_removal.c Show resolved Hide resolved

Review feedback 2

ba25605

* Added comment to vdevs_in_pool() helper function. * Moved lock in to conditional in spa_vdev_copy_segment_read_done(). We don't need to take it unconditionally. * Updated comments. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 3, 2018

tonyhutter approved these changes Dec 3, 2018

View reviewed changes

behlendorf merged commit 7c9a429 into openzfs:master Dec 4, 2018

behlendorf deleted the removal-errors branch April 19, 2021 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect IO errors during device removal #8161

Detect IO errors during device removal #8161

behlendorf commented Nov 29, 2018

tonyhutter Nov 30, 2018

behlendorf Nov 30, 2018

codecov bot commented Nov 30, 2018 •

edited

Loading

sdimitro left a comment

sdimitro Nov 30, 2018

tcaputi left a comment

behlendorf commented Dec 3, 2018

Detect IO errors during device removal #8161

Detect IO errors during device removal #8161

Conversation

behlendorf commented Nov 29, 2018

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter Nov 30, 2018

Choose a reason for hiding this comment

behlendorf Nov 30, 2018

Choose a reason for hiding this comment

codecov bot commented Nov 30, 2018 • edited Loading

Codecov Report

sdimitro left a comment

Choose a reason for hiding this comment

sdimitro Nov 30, 2018

Choose a reason for hiding this comment

tcaputi left a comment

Choose a reason for hiding this comment

behlendorf commented Dec 3, 2018

codecov bot commented Nov 30, 2018 •

edited

Loading