Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite #13565

Closed
wants to merge 10 commits into from

Conversation

squito
Copy link
Contributor

@squito squito commented Jun 8, 2016

What changes were proposed in this pull request?

Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite

  1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in. So added in the periodic revive offers, just like the real scheduler.
  2. assertEmptyDataStructures would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.
  3. DAGSchedulerSuite was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too.
  4. Turns out that assertMapOutputAvailable is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.
  5. When there is an exception in the backend, try to improve the error msg a little bit. Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything.

How was this patch tested?

I ran all the tests in BlacklistIntegrationSuite 5k times and everything in DAGSchedulerSuite 1k times on my laptop. Also I ran a full jenkins build with BlacklistIntegrationSuite 500 times and DAGSchedulerSuite 50 times, see #13548. (I tried more times but jenkins timed out.)

To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.)

And I'll run Jenkins on this a couple of times to do one more check.

@@ -24,6 +24,7 @@ import org.apache.spark._
class BlacklistIntegrationSuite extends SchedulerIntegrationSuite[MultiExecutorMockBackend]{

val badHost = "host-0"
val duration = Duration(10, SECONDS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure that such a long duration isn't really necessary, but I don't think it hurts to make it longer just in case.

@squito
Copy link
Contributor Author

squito commented Jun 8, 2016

@skonto since you seemed to be able to trigger the problems very reliably, do you mind giving this a spin and seeing if it works for you? :)

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60188 has finished for PR 13565 at commit 174d070.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Jun 8, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60199 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jun 8, 2016

LGTM but let's see what Stavros says.

@SparkQA
Copy link

SparkQA commented Jun 9, 2016

Test build #3071 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2016

Test build #3072 has finished for PR 13565 at commit 174d070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60548 has finished for PR 13565 at commit e35a7d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Jun 22, 2016

tests seem relatively stable now, and this passes regularly for me, so I'm going to merge it and keep an eye on builds.

@asfgit asfgit closed this in cf1995a Jun 22, 2016
@squito
Copy link
Contributor Author

squito commented Jun 22, 2016

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants