-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite #13565
Conversation
…ests so ignore for now" This reverts commit 36d3dfa.
@@ -24,6 +24,7 @@ import org.apache.spark._ | |||
class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorMockBackend]{ | |||
|
|||
val badHost = "host-0" | |||
val duration = Duration(10, SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure that such a long duration isn't really necessary, but I don't think it hurts to make it longer just in case.
@skonto since you seemed to be able to trigger the problems very reliably, do you mind giving this a spin and seeing if it works for you? :) |
Test build #60188 has finished for PR 13565 at commit
|
Jenkins, retest this please |
Test build #60199 has finished for PR 13565 at commit
|
LGTM but let's see what Stavros says. |
Test build #3071 has finished for PR 13565 at commit
|
Test build #3072 has finished for PR 13565 at commit
|
Test build #60548 has finished for PR 13565 at commit
|
tests seem relatively stable now, and this passes regularly for me, so I'm going to merge it and keep an eye on builds. |
merged to master |
What changes were proposed in this pull request?
Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite
assertEmptyDataStructures
would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.DAGSchedulerSuite
was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too.assertMapOutputAvailable
is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.How was this patch tested?
I ran all the tests in
BlacklistIntegrationSuite
5k times and everything inDAGSchedulerSuite
1k times on my laptop. Also I ran a full jenkins build withBlacklistIntegrationSuite
500 times andDAGSchedulerSuite
50 times, see #13548. (I tried more times but jenkins timed out.)To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.)
And I'll run Jenkins on this a couple of times to do one more check.