Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate ZTHR races by serializing ZTHR operations. #8229

Merged
merged 1 commit into from
Jan 13, 2019

Conversation

sdimitro
Copy link
Contributor

Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com

Description

Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

How Has This Been Tested?

  • Ran the ZFS test suite.
  • Have been running zloop for around 1 day with no problems so far.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 27, 2018
@codecov
Copy link

codecov bot commented Dec 28, 2018

Codecov Report

Merging #8229 into master will decrease coverage by 1.03%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8229      +/-   ##
==========================================
- Coverage   68.24%   67.21%   -1.04%     
==========================================
  Files         335      316      -19     
  Lines      109770    98012   -11758     
==========================================
- Hits        74917    65880    -9037     
+ Misses      34853    32132    -2721
Flag Coverage Δ
#kernel ?
#user 67.21% <96%> (+6.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6955b40...7571c5a. Read the comment docs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

As possible future work, since zthe cancel/resume operations are now fully serialized and synchronous we may find that we need to add an interface to cancel/resume multiple zthrs. That's not needed now, but it could be if vdev initialize is converted to zthrs and we decide to keep all the threads.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 3, 2019
Copy link
Contributor

@tcaputi tcaputi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot better and seems much less racy. Just a few minor notes.

}

/*
* This function is intended to be used by the zthr itself
* to check if another thread has signal it to stop running.
* (specifically the zthr_func callback provided) to check
* if another thread has signal it to stop running before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signaled

* a reason (e.g. we are exporting the pool). That's ok,
* since the checkfunc is the first to run, so next time
* it is spawned it will pick up the work that we wanted
* to wake it up for. In this case, this request is a no-op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this case. If the thread is cancelled, this function does nothing. Is this trying to say "waking up a cancelled thread is a no-op"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right. I just wanted to highlight the fact that the consumer should no worry about this being a no-op. Next time the thread gets spawned, it will look for work before going back to sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think this point could use a bit of clarification. In particular, I don't really see how the checkfunc is related to any of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my previous comment when I say:
Next time the thread gets spawned, it will look for work before going back to sleep.
Running the checkfunc is the look for work part.

static void
zthr_procedure(void *arg)
{
	...
	while (!t->zthr_cancel) {
		if (t->zthr_checkfunc(t->zthr_arg, t)) {
			mutex_exit(&t->zthr_state_lock);
			t->zthr_func(t->zthr_arg, t);
			mutex_enter(&t->zthr_state_lock);
		} else {
			/* go to sleep */
			...
		}
	...
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be reworded to say:

[2] The thread is currently being cancelled. Waking up a cancelled thread is a no-op. Any work that is still left to be done should be handled the next time the thread is resumed.


/* broadcast in case the zthr is sleeping */
cv_broadcast(&t->zthr_cv);
t->zthr_cancel = B_TRUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would reverse the cv_broadcast() and t->zthr_cancel = B_TRUE; lines

* as this is called from the zthr_func callback and it wants
* to check if we have an active zthr_cancel(). If that's the
* case then that zthr_cancel() will be holding the request
* lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing. As a caller, I'm not sure if I should be holding request_lock or not.

Copy link
Contributor Author

@sdimitro sdimitro Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callers of any function from the ZTHR API highlighted in zthr.h, must never interact directly with any of the locks from the zthr_t structure - that's why I moved the structure in zthr.c from zthr.h.

The convention is that potential consumers of this would only read the function-level comment. If you want to change the zthr code, then the comment that you highlight within the function is for you.

As for the comment itself, I'm basically trying to say the following:

The majority of the functions here grab `zthr_request_lock` first and then
`zthr_state_lock`.  This function only grabs the `zthr_state_lock`. That is
because this function should only be called from the zthr_func to check if
someone has issued a zthr_cancel() on the thread. If there is a zthr_cancel()
happening concurrently, attempting to grab the request lock here would
result in a deadlock.

By grabbing only the `zthr_state_lock` this function is allowed to run
concurrently with a zthr_cancel() request.

Does the above make sense? Let me know, and I'll update the review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense. I think some more clarification somewhere about the use of the request lock vs the state lock would be good. Perhaps up in the paragraph titled == Implementation of ZTHR requests.

In general I'm just confused why we have 2 locks when the request lock is only ever taken before taking the state lock. Up until when you pointed out that the structure is defined within this file, I thought the request lock was meant to be held by callers in some situations. If its not, is the request lock really necessary?

Copy link
Contributor Author

@sdimitro sdimitro Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm copy pasting the paragraph here:

 * ZTHR wakeup, cancel, and resume are requests on a zthr to change
 * its internal state. Requests on a zthr are serialized using the
 * zthr_request_lock, while changes in its internal state are
 * protected by the zthr_state_lock. In general a request will first
 * acquire the zthr_request_lock to ensure that other requests can't
 * be served at the same time, and then will acquire the zthr_state_lock
 * to apply its changes. In cases like zthr_cancel() where we need
 * to coordinate the thread issuing the request and the zthr, zthr_cv
 * is used as the mechanism of communication.

The request lock is necessary because it is held by requests like
zthr_cancel() and zthr_resume() in order to get serialized. If we
only had the state lock, then the same race conditions that we had
before will come back. That is because zthr_cancel() has to drop the
state lock while doing a cv_wait(). Holding the request lock while
cv_waiting in zthr_cancel() we ensure that no other zthr request
takes place at the same time.

Is this making more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it. What if we change everything in that paragraph after (and including) In general a request .... to:

A request will first acquire the zthr_request_lock and then immediately acquire the zthr_state_lock. We do this so that incoming requests are serialized using the request lock, while still allowing us to use the state lock for thread communication via zthr_cv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely more precise and clear. I will replace the existing part with your suggestion.

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Jan 3, 2019
@sdimitro sdimitro force-pushed the zthr_pr branch 2 times, most recently from 8f8d9d3 to 10535f9 Compare January 10, 2019 18:49
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Jan 10, 2019
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants