Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread safety to zthr_{cancel|resume}() #8087

Closed
wants to merge 1 commit into from

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Nov 2, 2018

Currently, zthr_resume() is not safe to call while zthr_cancel()
is running. This is problematic because zthr_cancel() is called from
spa_export_common() without holding any locks to prevent this race.
This patch simply changes zthr_resume() so that it can handle races
against other calls to zthr_cancel() and zthr_resume().

Signed-off-by: Tom Caputi tcaputi@datto.com

How Has This Been Tested?

Observed and verified with ztest.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Currently, zthr_resume() is not safe to call while zthr_cancel()
is running. This is problematic because zthr_cancel() is called from
spa_export_common() without holding any locks to prevent this race.
This patch simply changes zthr_resume() so that it can handle races
against other calls to zthr_cancel() and zthr_resume().

Signed-off-by: Tom Caputi <tcaputi@datto.com>
@tcaputi tcaputi mentioned this pull request Nov 2, 2018
12 tasks
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 2, 2018
@codecov
Copy link

codecov bot commented Nov 3, 2018

Codecov Report

Merging #8087 into master will decrease coverage by <.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8087      +/-   ##
==========================================
- Coverage   78.46%   78.45%   -0.01%     
==========================================
  Files         377      377              
  Lines      114518   114519       +1     
==========================================
- Hits        89851    89846       -5     
- Misses      24667    24673       +6
Flag Coverage Δ
#kernel 78.75% <83.33%> (-0.02%) ⬇️
#user 67.57% <83.33%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6644e5b...8b6298c. Read the comment docs.

@behlendorf
Copy link
Contributor

Related to #8070.

Copy link
Contributor

@brad-lewis brad-lewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may still be potential for a hang with a scenario involving a running zthr and simultaneous calls to zthr_cancel() and zthr_resume(). Could this happen?

  1. The zthr is running and a call is made to zthr_cancel() which acquires the zthr_lock, broadcasts to waken the zthr, and then cv_waits.
  2. A call was made to the zthr_resume() as well and it acquires the zthr_lock before the zthr thread
    and then ends up in the loop waiting for the in progress cancel to complete and cv_waits.
  3. The zthr thread then acquires the lock and exits setting zthr_thread = NULL and broadcasting.
  4. Both the resume and cancel threads are woken but the resume thread wins the race and gets
    the lock. There is still an ongoing cancel so it stays in the loop and cv_waits() again.
  5. The cancel thread then acquires the zthr_lock and completes the cancel and gives up the lock.
  6. The resume thread is left hanging with no one to wake it.

One way to eliminate that hang, assuming it's possible, is to add a cv_signal() for the to the end of zthr_cancel().

@tcaputi
Copy link
Contributor Author

tcaputi commented Dec 8, 2018

@brad-lewis Thats a good point. I spoke with @sdimitro about this a while ago and we were saying that the zthr code could use a bit of work to remove races like this in general. @sdimitro has that happened yet or have other plans been made?

@sdimitro
Copy link
Contributor

sdimitro commented Dec 10, 2018

@tcaputi I've opened this PR before: #8070 and I was thinking of working on that but per our last conversation my understanding was that you wanted to re-design this piece of code (we had a google doc somewhere) so I just paused my work on the above PR. Do you want me to pick it up again?

@behlendorf
Copy link
Contributor

Closing is favor of the approach in #8229.

@behlendorf behlendorf closed this Jan 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants