Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

Closed
wants to merge 2 commits into from
Closed

[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658

wants to merge 2 commits into from

Conversation

subrotosanyal
Copy link

What changes were proposed in this pull request?

As per this fix we take into account if the job has already finished while waiting instead of just basing the logic if the Spark Context reference is available or not. Similar approach is being is used in org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(securityMgr: SecurityManager) to check if the job is finished or not before declaring it failed.

…() will taken into account if the job has been finished or not already before declaring the job to be failed.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@subrotosanyal
Copy link
Author

hi @vanzin ,
Could you please have a look into this pull request?

@jerryshao
Copy link
Contributor

@subrotosanyal is there any problem in the previous code, can you please elaborate it?

@subrotosanyal
Copy link
Author

subrotosanyal commented Jun 14, 2016

hi @jerryshao

16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application.

From the above log we can see that the the job has succeeded further, which is clear by the DEBUG log. In the method ApplicationMaster#waitForSparkContextInitialized there is a wait for 10 seconds and within these 10 seconds the job finishes. But, due to some reason the wait is not notified (not sure why the notification doesn't reach) and once the 10 seconds are over the loop ends with condition that finished flag is not any more false i.e. the job is finished. At this point the code check if the reference to SparkContext is null or not which comes out to be null at this point and thus the ERROR log. Once execution of this method is over ApplicationMaster#runDriver makes the similar check and marks the job as failed.

The change in pull request is trying to figure out if the job is finished or not before marking the job as failed in such scenario.

}
sparkContext
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this always return None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jerryshao
I am not so good in Scala coding, infact this is the first or second time I am touching scala code :)
I expected the previous line Some(sparkContext) is going to do the trick but, looks like I need to fix that.

@jerryshao
Copy link
Contributor

Thanks for the explanation, but I'm not sure why notify is not worked, maybe we should find out the problem there. Also is your application a Spark application submitted through yarn or just a normal application without SparkContext?

@subrotosanyal
Copy link
Author

subrotosanyal commented Jun 14, 2016

so our client(JVM) spawns a SparkContext and uses the same to submit Spark Jobs to the cluster i.e. we are using Spark yarn-cluster mode.

…() will taken into account if the job has been finished or not already before declaring the job to be failed.
@subrotosanyal
Copy link
Author

subrotosanyal commented Jun 14, 2016

hi @jerryshao
Isn't this fix going to deal with the problem?
Do you have any pointers to check why the notification might have been missed.
You can check the complete AM logs in ticket also (if that helps)

@jerryshao
Copy link
Contributor

I'm just wondering the real cause of this issue, why it is not notified, normally it should be worked. your fix might be one option, but I'd like to find out the root cause.

@vanzin
Copy link
Contributor

vanzin commented Jun 20, 2016

I agree with Saisai that the real question is why the context is not registering with the AM.

Is your code perhaps setting "spark.master" to "local" or something that is not "yarn-cluster" before you create the SparkContext? I've never tried and I'm not sure that would even work, but that would follow a different code path that would not trigger the notify.

@srowen
Copy link
Member

srowen commented Jul 4, 2016

@subrotosanyal I'd close this out if we haven't been able to rule out the points above.

@vanzin
Copy link
Contributor

vanzin commented Jul 12, 2016

@subrotosanyal ping

@subrotosanyal
Copy link
Author

subrotosanyal commented Jul 13, 2016

hi @vanzin

Even I am surprised to see that notify was not triggered somehow.

Is your code perhaps setting "spark.master" to "local" or something that is not "yarn-cluster" before you create the SparkContext?

I would say we don't set it to local. Further the issue was happening once in a while though the client code remained the same.
Though for time being I have applied the patch and built a custom spark distribution to get rid of this random failure but, in long run I won't prefer to use any custom distribution.

@vanzin
Copy link
Contributor

vanzin commented Jul 13, 2016

I understand you don't want to use a custom distro, but your patch is masking what could be a real issue, and that's the worrying part. We should understand really why the issue is happening in the first place.

@vanzin
Copy link
Contributor

vanzin commented Aug 4, 2016

@subrotosanyal we should close this PR until we understand the actual cause of the failure. Please provide more information in the JIRA (like full application logs and maybe sample code to trigger the problem).

srowen added a commit to srowen/spark that referenced this pull request Aug 27, 2016
@asfgit asfgit closed this in 1a48c00 Aug 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants