-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ztest: zdb -Y
option for use by ztest(8)
#8113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine code-wise. I'm just not sure if this is really a good idea. To add to ztest. I know we've done some work to make the number of possibilities more reasonable, but I think that at some point this will just turn our current ECKSUM
bug into a "ztest killed because zdb took too long" bug.
@@ -5903,6 +5906,10 @@ main(int argc, char **argv) | |||
case 'X': | |||
dump_opt[c]++; | |||
break; | |||
case 'Y': | |||
zfs_reconstruct_indirect_combinations_max = INT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically I think the rest of the code uses UINT64_MAX here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets assigned to a uint64_t
which latter gets set to UINT64_MAX
but the global is an int
. We could change it to an unsigned long
and set it to ULONG_MAX
. Or update it to take a specific value, which is basically what we do now with -o
.
Yes, that's definitely a concern. If we had a hard bound on the worst case number of splits (18?) we could limit it to that to avoid the indefinite hang. |
Codecov Report
@@ Coverage Diff @@
## master #8113 +/- ##
==========================================
+ Coverage 78.45% 78.5% +0.04%
==========================================
Files 378 378
Lines 114765 114769 +4
==========================================
+ Hits 90035 90094 +59
+ Misses 24730 24675 -55
Continue to review full report at Codecov.
|
These changes should be reevaluated after #8161 is finalized and merged. They may no longer be needed, and might simply be nice to have optimizations. |
The new -Y flag allows `zdb` to try all possible combinations when performing split block reconstruction. Depending on the extent of the damage this may not be able to complete in a reasonable amount of time. However, it is primarily intended to be used by ztest(8) which by design should never be able to damage a pool beyond repair. The worst case observed has been blocks with 18 splits which can be recovered in a few minutes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
a25ecd3
to
d056ddd
Compare
Closing. This work may be included a more comprehensive change to |
Motivation and Context
Even with all of the optimizations made to speed up split block reconstruction
it's still possible this can lead to
ztest
failures because it gives up onreconstruction to soon. Ideally, we want to completely eliminate these
failures.
For the last three failures I inspected, two were recoverable if
zdb
hadbeen allowed to check all the possible combinations. Since
ztest
should never be able to damage a pool beyond repair, and maximum
number of splits has been reduced significantly, it's reasonable to allow
zdb
to attempt them all. The-Y
flag was added for this purpose.The one failure which was not recoverable may have been caused by
the issue PR #8105 was designed to fix. Additional testing is under
way to determine if there are still failures with both of these changes
applied.
Description
zdb -Y
for split block reconstructionAllows
ztest
to request thatzdb
attempt all possible combinations.Additional optimization to check zeroed splits last since they are
unlikely to be correct. See comments for details as to why this is
the case.
How Has This Been Tested?
Locally run against 3
ztest
failures which otherwise resulted in azdb
failure.With this change 2/3 pools were verified intact.
Types of changes
Checklist:
Signed-off-by
.