Eliminate ZTHR races by serializing ZTHR operations. #8229

sdimitro · 2018-12-27T20:15:22Z

Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com

Description

Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

How Has This Been Tested?

Ran the ZFS test suite.
Have been running zloop for around 1 day with no problems so far.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

codecov · 2018-12-28T04:40:23Z

Codecov Report

Merging #8229 into master will decrease coverage by 1.03%.
The diff coverage is 96%.

@@            Coverage Diff             @@
##           master    #8229      +/-   ##
==========================================
- Coverage   68.24%   67.21%   -1.04%     
==========================================
  Files         335      316      -19     
  Lines      109770    98012   -11758     
==========================================
- Hits        74917    65880    -9037     
+ Misses      34853    32132    -2721

Flag	Coverage Δ
#kernel	`?`
#user	`67.21% <96%> (+6.89%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6955b40...7571c5a. Read the comment docs.

behlendorf

Looks good.

As possible future work, since zthe cancel/resume operations are now fully serialized and synchronous we may find that we need to add an interface to cancel/resume multiple zthrs. That's not needed now, but it could be if vdev initialize is converted to zthrs and we decide to keep all the threads.

tcaputi

This is a lot better and seems much less racy. Just a few minor notes.

tcaputi · 2019-01-03T21:38:39Z

module/zfs/zthr.c

 }

 /*
 * This function is intended to be used by the zthr itself
- * to check if another thread has signal it to stop running.
+ * (specifically the zthr_func callback provided) to check
+ * if another thread has signal it to stop running before


tcaputi · 2019-01-03T21:45:48Z

module/zfs/zthr.c

+	 *     a reason (e.g. we are exporting the pool). That's ok,
+	 *     since the checkfunc is the first to run, so next time
+	 *     it is spawned it will pick up the work that we wanted
+	 *     to wake it up for. In this case, this request is a no-op.


I'm confused about this case. If the thread is cancelled, this function does nothing. Is this trying to say "waking up a cancelled thread is a no-op"?

That is right. I just wanted to highlight the fact that the consumer should no worry about this being a no-op. Next time the thread gets spawned, it will look for work before going back to sleep.

Ok. I think this point could use a bit of clarification. In particular, I don't really see how the checkfunc is related to any of this.

On my previous comment when I say:
Next time the thread gets spawned, it will look for work before going back to sleep.
Running the checkfunc is the look for work part.

static void zthr_procedure(void *arg) { ... while (!t->zthr_cancel) { if (t->zthr_checkfunc(t->zthr_arg, t)) { mutex_exit(&t->zthr_state_lock); t->zthr_func(t->zthr_arg, t); mutex_enter(&t->zthr_state_lock); } else { /* go to sleep */ ... } ... }

I think this could be reworded to say:

[2] The thread is currently being cancelled. Waking up a cancelled thread is a no-op. Any work that is still left to be done should be handled the next time the thread is resumed.

tcaputi · 2019-01-03T21:47:17Z

module/zfs/zthr.c


-	/* broadcast in case the zthr is sleeping */
-	cv_broadcast(&t->zthr_cv);
+		t->zthr_cancel = B_TRUE;


nit: I would reverse the cv_broadcast() and t->zthr_cancel = B_TRUE; lines

tcaputi · 2019-01-03T21:48:36Z

module/zfs/zthr.c

+	 * as this is called from the zthr_func callback and it wants
+	 * to check if we have an active zthr_cancel(). If that's the
+	 * case then that zthr_cancel() will be holding the request
+	 * lock.


This comment is confusing. As a caller, I'm not sure if I should be holding request_lock or not.

Callers of any function from the ZTHR API highlighted in zthr.h, must never interact directly with any of the locks from the zthr_t structure - that's why I moved the structure in zthr.c from zthr.h.

The convention is that potential consumers of this would only read the function-level comment. If you want to change the zthr code, then the comment that you highlight within the function is for you.

As for the comment itself, I'm basically trying to say the following:

The majority of the functions here grab `zthr_request_lock` first and then `zthr_state_lock`. This function only grabs the `zthr_state_lock`. That is because this function should only be called from the zthr_func to check if someone has issued a zthr_cancel() on the thread. If there is a zthr_cancel() happening concurrently, attempting to grab the request lock here would result in a deadlock. By grabbing only the `zthr_state_lock` this function is allowed to run concurrently with a zthr_cancel() request.

Does the above make sense? Let me know, and I'll update the review.

That makes more sense. I think some more clarification somewhere about the use of the request lock vs the state lock would be good. Perhaps up in the paragraph titled == Implementation of ZTHR requests.

In general I'm just confused why we have 2 locks when the request lock is only ever taken before taking the state lock. Up until when you pointed out that the structure is defined within this file, I thought the request lock was meant to be held by callers in some situations. If its not, is the request lock really necessary?

I'm copy pasting the paragraph here:

* ZTHR wakeup, cancel, and resume are requests on a zthr to change * its internal state. Requests on a zthr are serialized using the * zthr_request_lock, while changes in its internal state are * protected by the zthr_state_lock. In general a request will first * acquire the zthr_request_lock to ensure that other requests can't * be served at the same time, and then will acquire the zthr_state_lock * to apply its changes. In cases like zthr_cancel() where we need * to coordinate the thread issuing the request and the zthr, zthr_cv * is used as the mechanism of communication.

The request lock is necessary because it is held by requests like
zthr_cancel() and zthr_resume() in order to get serialized. If we
only had the state lock, then the same race conditions that we had
before will come back. That is because zthr_cancel() has to drop the
state lock while doing a cv_wait(). Holding the request lock while
cv_waiting in zthr_cancel() we ensure that no other zthr request
takes place at the same time.

Is this making more sense?

Ok got it. What if we change everything in that paragraph after (and including) In general a request .... to:

A request will first acquire the zthr_request_lock and then immediately acquire the zthr_state_lock. We do this so that incoming requests are serialized using the request lock, while still allowing us to use the state lock for thread communication via zthr_cv.

That's definitely more precise and clear. I will replace the existing part with your suggestion.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 27, 2018

behlendorf requested review from tcaputi and behlendorf December 28, 2018 18:16

sdimitro requested a review from ahrens December 28, 2018 21:14

sdimitro mentioned this pull request Jan 2, 2019

ZTHR refactoring - eliminates multiple races #8070

Closed

12 tasks

ahrens approved these changes Jan 2, 2019

View reviewed changes

behlendorf approved these changes Jan 2, 2019

View reviewed changes

behlendorf mentioned this pull request Jan 2, 2019

Add thread safety to zthr_{cancel|resume}() #8087

Closed

12 tasks

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 3, 2019

tcaputi suggested changes Jan 3, 2019

View reviewed changes

tcaputi approved these changes Jan 3, 2019

View reviewed changes

behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Accepted Ready to integrate (reviewed, tested) labels Jan 3, 2019

sdimitro force-pushed the zthr_pr branch 2 times, most recently from 8f8d9d3 to 10535f9 Compare January 10, 2019 18:49

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Jan 10, 2019

Serialize ZTHR operations to eliminate races.

7571c5a

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

sdimitro force-pushed the zthr_pr branch from 10535f9 to 7571c5a Compare January 12, 2019 01:03

behlendorf merged commit 61c3391 into openzfs:master Jan 13, 2019

sdimitro mentioned this pull request Apr 29, 2019

ztest: zthr cancel/resume race in spa_export_common() #7744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate ZTHR races by serializing ZTHR operations. #8229

Eliminate ZTHR races by serializing ZTHR operations. #8229

sdimitro commented Dec 27, 2018

codecov bot commented Dec 28, 2018 •

edited

Loading

behlendorf left a comment •

edited

Loading

tcaputi left a comment

tcaputi Jan 3, 2019

tcaputi Jan 3, 2019

sdimitro Jan 10, 2019

tcaputi Jan 10, 2019

sdimitro Jan 10, 2019

tcaputi Jan 10, 2019

tcaputi Jan 3, 2019

tcaputi Jan 3, 2019

sdimitro Jan 10, 2019 •

edited

Loading

tcaputi Jan 10, 2019

sdimitro Jan 10, 2019 •

edited

Loading

tcaputi Jan 10, 2019

sdimitro Jan 10, 2019

Eliminate ZTHR races by serializing ZTHR operations. #8229

Eliminate ZTHR races by serializing ZTHR operations. #8229

Conversation

sdimitro commented Dec 27, 2018

Description

How Has This Been Tested?

Types of changes

Checklist:

codecov bot commented Dec 28, 2018 • edited Loading

Codecov Report

behlendorf left a comment • edited Loading

Choose a reason for hiding this comment

tcaputi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdimitro Jan 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdimitro Jan 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 28, 2018 •

edited

Loading

behlendorf left a comment •

edited

Loading

sdimitro Jan 10, 2019 •

edited

Loading

sdimitro Jan 10, 2019 •

edited

Loading