Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ztest: zdb -Yoption for use by ztest(8) #8113

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Even with all of the optimizations made to speed up split block reconstruction
it's still possible this can lead to ztest failures because it gives up on
reconstruction to soon. Ideally, we want to completely eliminate these
failures.

For the last three failures I inspected, two were recoverable if zdb had
been allowed to check all the possible combinations. Since ztest
should never be able to damage a pool beyond repair, and maximum
number of splits has been reduced significantly, it's reasonable to allow
zdb to attempt them all. The -Y flag was added for this purpose.

The one failure which was not recoverable may have been caused by
the issue PR #8105 was designed to fix. Additional testing is under
way to determine if there are still failures with both of these changes
applied.

Description

  • Add zdb -Y for split block reconstruction

Allows ztest to request that zdb attempt all possible combinations.

  • Prefer non-zero split versions for reconstruction

Additional optimization to check zeroed splits last since they are
unlikely to be correct. See comments for details as to why this is
the case.

How Has This Been Tested?

Locally run against 3 ztest failures which otherwise resulted in a zdb failure.
With this change 2/3 pools were verified intact.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 8, 2018
@behlendorf behlendorf requested a review from tcaputi November 8, 2018 21:20
Copy link
Contributor

@tcaputi tcaputi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine code-wise. I'm just not sure if this is really a good idea. To add to ztest. I know we've done some work to make the number of possibilities more reasonable, but I think that at some point this will just turn our current ECKSUM bug into a "ztest killed because zdb took too long" bug.

@@ -5903,6 +5906,10 @@ main(int argc, char **argv)
case 'X':
dump_opt[c]++;
break;
case 'Y':
zfs_reconstruct_indirect_combinations_max = INT_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically I think the rest of the code uses UINT64_MAX here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets assigned to a uint64_t which latter gets set to UINT64_MAX but the global is an int. We could change it to an unsigned long and set it to ULONG_MAX. Or update it to take a specific value, which is basically what we do now with -o.

@behlendorf behlendorf added Status: Design Review Needed Architecture or design is under discussion and removed Status: Work in Progress Not yet ready for general review labels Nov 8, 2018
@behlendorf
Copy link
Contributor Author

turn our current ECKSUM bug into a "ztest killed because zdb took too long" bug.

Yes, that's definitely a concern. If we had a hard bound on the worst case number of splits (18?) we could limit it to that to avoid the indefinite hang.

@codecov
Copy link

codecov bot commented Nov 9, 2018

Codecov Report

Merging #8113 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8113      +/-   ##
==========================================
+ Coverage   78.45%    78.5%   +0.04%     
==========================================
  Files         378      378              
  Lines      114765   114769       +4     
==========================================
+ Hits        90035    90094      +59     
+ Misses      24730    24675      -55
Flag Coverage Δ
#kernel 78.69% <ø> (+0.05%) ⬆️
#user 67.64% <100%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c40a112...d056ddd. Read the comment docs.

@behlendorf
Copy link
Contributor Author

These changes should be reevaluated after #8161 is finalized and merged. They may no longer be needed, and might simply be nice to have optimizations.

The new -Y flag allows `zdb` to try all possible combinations
when performing split block reconstruction.

Depending on the extent of the damage this may not be able to
complete in a reasonable amount of time.  However, it is primarily
intended to be used by ztest(8) which by design should never be
able to damage a pool beyond repair.  The worst case observed
has been blocks with 18 splits which can be recovered in a few
minutes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

Closing. This work may be included a more comprehensive change to ztest.

@behlendorf behlendorf closed this Dec 5, 2018
@behlendorf behlendorf deleted the non-zero-split branch April 19, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants