Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More ztest fixes #8061

Closed
wants to merge 7 commits into from
Closed

More ztest fixes #8061

wants to merge 7 commits into from

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Oct 25, 2018

This PR is a follow-up to #8010 with more fixes for ztest.

Signed-off-by: Tom Caputi tcaputi@datto.com

How Has This Been Tested?

All problems and fixes in this patch were discovered and verified with ztest (as much as ztest can verify anything).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tcaputi tcaputi added the Status: Work in Progress Not yet ready for general review label Oct 25, 2018
Currently, zthr_resume() is not safe to call while zthr_cancel()
is running. This is problematic because zthr_cancel() is called from
spa_export_common() without holding any locks to prevent this race.
This patch simply changes zthr_resume() so that it can handle races
against other calls to zthr_cancel() and zthr_resume().

Signed-off-by: Tom Caputi <tcaputi@datto.com>
@sdimitro
Copy link
Contributor

Hey Tom!

I think I know what's the issue that you are trying to fix here. @shartse and I have worked in some fixes in the zthr infrastructure (some of them are in her fast clone deletion project stuck behind send/receive and the rest are in an internal review where I fix this issue #7744 ).

I think Sara's code specifically fix the above issue.

I think it would be best to take some time and maybe extract the zthr fixes from Sara's project and combine them with mine, so there are no code differences in the future. Does this sound good to you? Also, out of curiosity, does this issue come up very often?

@tcaputi
Copy link
Contributor Author

tcaputi commented Oct 25, 2018

@sdimitro That sounds fine. I'm all for anything that gets the problem fixed. I would just want to see what you guys came up with to fix this issue so I can verify we're talking about the same problem.

Also, out of curiosity, does this issue come up very often?

With all the ztest fixes that went in yesterday, I was actually able to run ztest locally for 12 hours with this being the only failure, which only occurred once. I bet you could get it to happen fairly frequently if you configured ztest to just run ztest_spa_create_destroy() in several threads, but this is the only time I have seen this particular failure. We are still working our way through the less common ztest failures and @behlendorf has setup zloop to run for 4 days while he is away to see what results accumulate over an extended run.

Tom Caputi added 3 commits October 26, 2018 15:57
zfs_rename_006_pos has been flaky in the past because it was
missing a call to block_device_wait to ensure the zvols it creates
are present before running dd. Whenever this this happened,
zfs_rename_009_neg would also fail because the first test would
leak a zvol clone that it did not know how to clean up. This patch
fixes the root cause and reenables the test. It also fixes some
minor grammar errors.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
This patch simply corrects an issue where vdev_dtl_reassess()
could attempt to dirty the vdev config even when the spa was
not elligable for writing.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
vdev_clear() can call vdev_set_deferred_resilver() with a non-leaf
vdev to setup a deferred resilver. However, this function is
currently written to only handle leaf vdevs. This patch makes this
function recursive so that it can find appropriate vdevs to
resilver and set vdev_resilver_deferred on them.

Signed-off-by: Tom Caputi <tcaputi@datto.com>

TEST_ZTEST_TIMEOUT=7200
@behlendorf
Copy link
Contributor

Let's get independent PRs open for each of these.

@tcaputi tcaputi force-pushed the more_ztest_fixes branch 4 times, most recently from c693cc6 to d9df89d Compare October 31, 2018 20:30
Tom Caputi added 3 commits October 31, 2018 17:10
This patch ensures that logs are replayed on all datasets prior
to starting ztest workers. This ensures that the call to
vdev_offline() a log device in ztest_fault_inject() will not fail
due to the log device being required for replay.

This patch also fixes a small issue found during testing where
spa_keystore_load_wkey() does not check that the dataset specified
is an encryption root. This check was present in libzfs, however.

Signed-off-by: Tom Caputi <tcaputi@datto.com>

TEST_ZTEST_TIMEOUT=3600
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Signed-off-by: Tom Caputi <tcaputi@datto.com>
This patch simply ensures that vdev_indirect_splits_damage()
cannot hit a divide by zero exception if a split has no
children with valid data. The normal reconstruction code
path in vdev_indirect_reconstruct_io_done() already has this
check.

Signed-off-by: Tom Caputi <tcaputi@datto.com>

TEST_ZTEST_TIMEOUT=3600
@codecov
Copy link

codecov bot commented Nov 1, 2018

Codecov Report

Merging #8061 into master will decrease coverage by 0.07%.
The diff coverage is 86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8061      +/-   ##
==========================================
- Coverage   78.43%   78.35%   -0.08%     
==========================================
  Files         377      377              
  Lines      114518   114521       +3     
==========================================
- Hits        89817    89737      -80     
- Misses      24701    24784      +83
Flag Coverage Δ
#kernel 78.78% <72.72%> (-0.12%) ⬇️
#user 67.08% <86%> (-0.45%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58aeb87...a065662. Read the comment docs.

@tcaputi
Copy link
Contributor Author

tcaputi commented Nov 2, 2018

Closing in favor of the following cherry-picked PRs:

#8082
#8083
#8084
#8085
#8086
#8087
#8088

@tcaputi tcaputi closed this Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants