-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658
[SPARK-15937] [yarn] Improving the logic to wait for an initialised Spark Context #13658
Conversation
…() will taken into account if the job has been finished or not already before declaring the job to be failed.
Can one of the admins verify this patch? |
hi @vanzin , |
@subrotosanyal is there any problem in the previous code, can you please elaborate it? |
hi @jerryshao
From the above log we can see that the the job has succeeded further, which is clear by the DEBUG log. In the method ApplicationMaster#waitForSparkContextInitialized there is a wait for 10 seconds and within these 10 seconds the job finishes. But, due to some reason the wait is not notified (not sure why the notification doesn't reach) and once the 10 seconds are over the loop ends with condition that finished flag is not any more false i.e. the job is finished. At this point the code check if the reference to SparkContext is null or not which comes out to be null at this point and thus the ERROR log. Once execution of this method is over ApplicationMaster#runDriver makes the similar check and marks the job as failed. The change in pull request is trying to figure out if the job is finished or not before marking the job as failed in such scenario. |
} | ||
sparkContext | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this always return None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jerryshao
I am not so good in Scala coding, infact this is the first or second time I am touching scala code :)
I expected the previous line Some(sparkContext) is going to do the trick but, looks like I need to fix that.
Thanks for the explanation, but I'm not sure why notify is not worked, maybe we should find out the problem there. Also is your application a Spark application submitted through yarn or just a normal application without SparkContext? |
so our client(JVM) spawns a SparkContext and uses the same to submit Spark Jobs to the cluster i.e. we are using Spark yarn-cluster mode. |
…() will taken into account if the job has been finished or not already before declaring the job to be failed.
hi @jerryshao |
I'm just wondering the real cause of this issue, why it is not notified, normally it should be worked. your fix might be one option, but I'd like to find out the root cause. |
I agree with Saisai that the real question is why the context is not registering with the AM. Is your code perhaps setting "spark.master" to "local" or something that is not "yarn-cluster" before you create the SparkContext? I've never tried and I'm not sure that would even work, but that would follow a different code path that would not trigger the notify. |
@subrotosanyal I'd close this out if we haven't been able to rule out the points above. |
@subrotosanyal ping |
hi @vanzin Even I am surprised to see that notify was not triggered somehow.
I would say we don't set it to local. Further the issue was happening once in a while though the client code remained the same. |
I understand you don't want to use a custom distro, but your patch is masking what could be a real issue, and that's the worrying part. We should understand really why the issue is happening in the first place. |
@subrotosanyal we should close this PR until we understand the actual cause of the failure. Please provide more information in the JIRA (like full application logs and maybe sample code to trigger the problem). |
Closes apache#10995 Closes apache#13658 Closes apache#14505 Closes apache#14536 Closes apache#12753 Closes apache#14449 Closes apache#12694 Closes apache#12695 Closes apache#14810
What changes were proposed in this pull request?
As per this fix we take into account if the job has already finished while waiting instead of just basing the logic if the Spark Context reference is available or not. Similar approach is being is used in org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(securityMgr: SecurityManager) to check if the job is finished or not before declaring it failed.