-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7775] YARN AM negative sleep exception #6305
Conversation
I know it is painful considering that most of the YARN stuff does not have tests, but it would be useful to add a test that fails without this (to be sure that this fixes the issue you saw, since it is based on a theory, even though it is a pretty good one). |
@@ -346,6 +346,9 @@ private[spark] class ApplicationMaster( | |||
val currentAllocationInterval = | |||
math.min(heartbeatInterval, nextAllocationInterval) | |||
nextAllocationInterval *= 2 | |||
// avoid overflow | |||
nextAllocationInterval = math.min( | |||
nextAllocationInterval, ApplicationMaster.MAX_RM_HEARTBEAT_INTERVAL_MS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cap at heartbeatInterval
to avoid introducing another variable? Or replace nextAllocationInterval *= 2
with nextAllocationInterval = currentAllocationInterval * 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or this kinda weird-looking code:
`math.max(nextAllocationInterval, nextAllocationInterval * 2)`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like the latter.
@harishreedharan I agree that we should have tests. In fact I would have liked to see the original patch that added this feature include the tests. Unfortunately I did not follow the original discussion in close enough detail to test this new feature in any substantial way. I will test this fix out on a real cluster to make sure it does fix the issue. |
OK, sounds good. |
LGTM |
OK, I just verified that a heavy workload that previously failed now succeeds because of this patch. |
Test build #33193 has finished for PR 6305 at commit
|
Test build #33194 has finished for PR 6305 at commit
|
@@ -345,7 +345,7 @@ private[spark] class ApplicationMaster( | |||
if (numPendingAllocate > 0) { | |||
val currentAllocationInterval = | |||
math.min(heartbeatInterval, nextAllocationInterval) | |||
nextAllocationInterval *= 2 | |||
nextAllocationInterval = currentAllocationInterval * 2 // avoid overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively caps the interval to the heartbeatInterval right? that seems OK, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
lgtm |
lgtm too |
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
This kills the reporter thread. This is caused by #6082 (merged into master branch only).