[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

subrotosanyal · 2016-06-14T08:29:57Z

What changes were proposed in this pull request?

As per this fix we take into account if the job has already finished while waiting instead of just basing the logic if the Spark Context reference is available or not. Similar approach is being is used in org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(securityMgr: SecurityManager) to check if the job is finished or not before declaring it failed.

…() will taken into account if the job has been finished or not already before declaring the job to be failed.

AmplabJenkins · 2016-06-14T08:32:16Z

Can one of the admins verify this patch?

subrotosanyal · 2016-06-14T08:52:17Z

hi @vanzin ,
Could you please have a look into this pull request?

jerryshao · 2016-06-14T16:49:24Z

@subrotosanyal is there any problem in the previous code, can you please elaborate it?

subrotosanyal · 2016-06-14T17:12:58Z

hi @jerryshao

16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application.

From the above log we can see that the the job has succeeded further, which is clear by the DEBUG log. In the method ApplicationMaster#waitForSparkContextInitialized there is a wait for 10 seconds and within these 10 seconds the job finishes. But, due to some reason the wait is not notified (not sure why the notification doesn't reach) and once the 10 seconds are over the loop ends with condition that finished flag is not any more false i.e. the job is finished. At this point the code check if the reference to SparkContext is null or not which comes out to be null at this point and thus the ERROR log. Once execution of this method is over ApplicationMaster#runDriver makes the similar check and marks the job as failed.

The change in pull request is trying to figure out if the job is finished or not before marking the job as failed in such scenario.

jerryshao · 2016-06-14T17:27:59Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

      }
-      sparkContext
+      None


Will this always return None?

hi @jerryshao
I am not so good in Scala coding, infact this is the first or second time I am touching scala code :)
I expected the previous line Some(sparkContext) is going to do the trick but, looks like I need to fix that.

jerryshao · 2016-06-14T17:31:46Z

Thanks for the explanation, but I'm not sure why notify is not worked, maybe we should find out the problem there. Also is your application a Spark application submitted through yarn or just a normal application without SparkContext?

subrotosanyal · 2016-06-14T17:49:21Z

so our client(JVM) spawns a SparkContext and uses the same to submit Spark Jobs to the cluster i.e. we are using Spark yarn-cluster mode.

…() will taken into account if the job has been finished or not already before declaring the job to be failed.

subrotosanyal · 2016-06-14T18:55:36Z

hi @jerryshao
Isn't this fix going to deal with the problem?
Do you have any pointers to check why the notification might have been missed.
You can check the complete AM logs in ticket also (if that helps)

jerryshao · 2016-06-14T20:15:10Z

I'm just wondering the real cause of this issue, why it is not notified, normally it should be worked. your fix might be one option, but I'd like to find out the root cause.

vanzin · 2016-06-20T23:19:57Z

I agree with Saisai that the real question is why the context is not registering with the AM.

Is your code perhaps setting "spark.master" to "local" or something that is not "yarn-cluster" before you create the SparkContext? I've never tried and I'm not sure that would even work, but that would follow a different code path that would not trigger the notify.

srowen · 2016-07-04T20:20:37Z

@subrotosanyal I'd close this out if we haven't been able to rule out the points above.

vanzin · 2016-07-12T18:13:57Z

@subrotosanyal ping

subrotosanyal · 2016-07-13T07:28:08Z

hi @vanzin

Even I am surprised to see that notify was not triggered somehow.

Is your code perhaps setting "spark.master" to "local" or something that is not "yarn-cluster" before you create the SparkContext?

I would say we don't set it to local. Further the issue was happening once in a while though the client code remained the same.
Though for time being I have applied the patch and built a custom spark distribution to get rid of this random failure but, in long run I won't prefer to use any custom distribution.

vanzin · 2016-07-13T17:21:01Z

I understand you don't want to use a custom distro, but your patch is masking what could be a real issue, and that's the worrying part. We should understand really why the issue is happening in the first place.

vanzin · 2016-08-04T17:56:39Z

@subrotosanyal we should close this PR until we understand the actual cause of the failure. Please provide more information in the JIRA (like full application logs and maybe sample code to trigger the problem).

Closes apache#10995 Closes apache#13658 Closes apache#14505 Closes apache#14536 Closes apache#12753 Closes apache#14449 Closes apache#12694 Closes apache#12695 Closes apache#14810

SPARK-15937 The method waitForSparkContextInitialized() and runDriver…

6635119

…() will taken into account if the job has been finished or not already before declaring the job to be failed.

jerryshao reviewed Jun 14, 2016
View reviewed changes

SPARK-15937 The method waitForSparkContextInitialized() and runDriver…

c0ec676

…() will taken into account if the job has been finished or not already before declaring the job to be failed.

srowen mentioned this pull request Aug 27, 2016

[BUILD] Closes some stale PRs. #14849

Closed

asfgit closed this in 1a48c00 Aug 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

subrotosanyal commented Jun 14, 2016

AmplabJenkins commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016

jerryshao commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016 •

edited

Loading

jerryshao Jun 14, 2016

subrotosanyal Jun 14, 2016

jerryshao commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016 •

edited

Loading

subrotosanyal commented Jun 14, 2016 •

edited

Loading

jerryshao commented Jun 14, 2016

vanzin commented Jun 20, 2016

srowen commented Jul 4, 2016

vanzin commented Jul 12, 2016

subrotosanyal commented Jul 13, 2016 •

edited

Loading

vanzin commented Jul 13, 2016

vanzin commented Aug 4, 2016

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

Conversation

subrotosanyal commented Jun 14, 2016

What changes were proposed in this pull request?

AmplabJenkins commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016

jerryshao commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016 • edited Loading

jerryshao Jun 14, 2016

Choose a reason for hiding this comment

subrotosanyal Jun 14, 2016

Choose a reason for hiding this comment

jerryshao commented Jun 14, 2016

subrotosanyal commented Jun 14, 2016 • edited Loading

subrotosanyal commented Jun 14, 2016 • edited Loading

jerryshao commented Jun 14, 2016

vanzin commented Jun 20, 2016

srowen commented Jul 4, 2016

vanzin commented Jul 12, 2016

subrotosanyal commented Jul 13, 2016 • edited Loading

vanzin commented Jul 13, 2016

vanzin commented Aug 4, 2016

subrotosanyal commented Jun 14, 2016 •

edited

Loading

subrotosanyal commented Jun 14, 2016 •

edited

Loading

subrotosanyal commented Jun 14, 2016 •

edited

Loading

subrotosanyal commented Jul 13, 2016 •

edited

Loading