Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7775] YARN AM negative sleep exception #6305

Closed
wants to merge 2 commits into from

Conversation

andrewor14
Copy link
Contributor

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
  at java.lang.Thread.sleep(Native Method)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)

This kills the reporter thread. This is caused by #6082 (merged into master branch only).

@harishreedharan
Copy link
Contributor

I know it is painful considering that most of the YARN stuff does not have tests, but it would be useful to add a test that fails without this (to be sure that this fixes the issue you saw, since it is based on a theory, even though it is a pretty good one).

@@ -346,6 +346,9 @@ private[spark] class ApplicationMaster(
val currentAllocationInterval =
math.min(heartbeatInterval, nextAllocationInterval)
nextAllocationInterval *= 2
// avoid overflow
nextAllocationInterval = math.min(
nextAllocationInterval, ApplicationMaster.MAX_RM_HEARTBEAT_INTERVAL_MS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cap at heartbeatInterval to avoid introducing another variable? Or replace nextAllocationInterval *= 2 with nextAllocationInterval = currentAllocationInterval * 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this kinda weird-looking code:

`math.max(nextAllocationInterval, nextAllocationInterval * 2)`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the latter.

@andrewor14
Copy link
Contributor Author

@harishreedharan I agree that we should have tests. In fact I would have liked to see the original patch that added this feature include the tests. Unfortunately I did not follow the original discussion in close enough detail to test this new feature in any substantial way. I will test this fix out on a real cluster to make sure it does fix the issue.

@harishreedharan
Copy link
Contributor

OK, sounds good.

@sryza
Copy link
Contributor

sryza commented May 21, 2015

LGTM

@andrewor14
Copy link
Contributor Author

OK, I just verified that a heavy workload that previously failed now succeeds because of this patch.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33193 has finished for PR 6305 at commit 56d6e5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GroupedData protected[sql](df: DataFrame, groupingExprs: Seq[Expression])
    • public class TaskMemoryManager

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33194 has finished for PR 6305 at commit b970770.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GroupedData protected[sql](df: DataFrame, groupingExprs: Seq[Expression])
    • public class TaskMemoryManager

@@ -345,7 +345,7 @@ private[spark] class ApplicationMaster(
if (numPendingAllocate > 0) {
val currentAllocationInterval =
math.min(heartbeatInterval, nextAllocationInterval)
nextAllocationInterval *= 2
nextAllocationInterval = currentAllocationInterval * 2 // avoid overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively caps the interval to the heartbeatInterval right? that seems OK, just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@tgravescs
Copy link
Contributor

lgtm

@vanzin
Copy link
Contributor

vanzin commented May 21, 2015

lgtm too

@asfgit asfgit closed this in 15680ae May 21, 2015
@andrewor14 andrewor14 deleted the yarn-negative-sleep branch May 21, 2015 20:02
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
```
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
  at java.lang.Thread.sleep(Native Method)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)
```
This kills the reporter thread. This is caused by apache#6082 (merged into master branch only).

Author: Andrew Or <andrew@databricks.com>

Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits:

b970770 [Andrew Or] Use existing cap
56d6e5e [Andrew Or] Avoid negative sleep
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
```
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
  at java.lang.Thread.sleep(Native Method)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)
```
This kills the reporter thread. This is caused by apache#6082 (merged into master branch only).

Author: Andrew Or <andrew@databricks.com>

Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits:

b970770 [Andrew Or] Use existing cap
56d6e5e [Andrew Or] Avoid negative sleep
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
```
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
  at java.lang.Thread.sleep(Native Method)
  at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)
```
This kills the reporter thread. This is caused by apache#6082 (merged into master branch only).

Author: Andrew Or <andrew@databricks.com>

Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits:

b970770 [Andrew Or] Use existing cap
56d6e5e [Andrew Or] Avoid negative sleep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants