-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. #6082
Conversation
Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations.
Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations.
Can one of the admins verify this patch? |
@@ -74,6 +74,14 @@ Most of the configs are the same for Spark on YARN as for other deployment modes | |||
<td>5000</td> | |||
<td> | |||
The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. | |||
To avoid the application master to be expired by late reporting, if a higher value is provided, the interval will be set to the half of the expiry interval in YARN's configuration <code>(yarn.am.liveness-monitor.expiry-interval-ms / 2)</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind breaking these long lines into multiple lines?
I'd also rephrase this: "The value is capped at half the value of YARN's configuration for the expiry interval (yarn.am.liveness-monitor.expiry-interval-ms
)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
Looks sane, mostly style nits. If you don't mind we should deprecate |
Okay, I will follow your guides. |
currentAllocationInterval = | ||
math.min(heartbeatInterval,currentAllocationInterval * 2) | ||
logDebug(s"Number of pending allocations is ${numPendingAllocate}. " + | ||
"Sleeping for " + currentAllocationInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be indented two spaces after the start of the preceding line.
Would anyone verify this patch? |
jenkins, test this please |
so do we really want the backoff here? It complicates the logic and makes it unpredictable as to when exactly it heartbeats. It also becomes useless very quickly. If I just get in a queue that doesn't have space immediately then I stop asking quicker but I still needs containers. It seems to me if we just have one for when we want containers and one for when we don't it should be good. If after that we still run into problems we can revisit it. Also having YARN tell us when to backoff would be best: https://issues.apache.org/jira/browse/YARN-3630 |
@tgravescs that was going to be my feedback as well. However, what about situations where someone requests more containers than YARN can allocate? I think this is a pretty reasonable thing to do because it's fairly cumbersome for users to check the size of the YARN pool they're running and then triangulate with their --executor-cores and --executor-memory to pick the exact number of executors they're allowed. We would end up always heartbeating at the fast rate. |
I've opened that issue for YARN, but it's not a good practice to rely on that. Multiplicative back-off is very ancient practice, not so hard to predict and decreases congestion nicely. There are more effective models in network rate-limiting, but it's simple and effective. We just can't HB in every 200ms, basically when our first HB was unsuccessful for containers, there's only a little more chance we got that the next one will be successful. Also, consider a contested server with thousands of Spark jobs. But, yeah. We want to provide a faster start-up for jobs on clusters with a lot of free resources. So we start with 200ms. |
Normally I would expect us to grab most of it up front and then run for a while without needing more, then perhaps iterate if dynamic allocation is on and we go to different stage. it seems like a person should know pretty quickly after initial testing what their queue limits are, but a valid point. It just seems this kind of limits the usefulness of it. If the cluster is idle it does let me get stuff quicker, but if the cluster is idle then just decrease your heartbeat anyway. I think having just the 2 configs works the same in that case. if its idle with lots of free resources, I get them and then go to the slower heartbeat. Also if the cluster is idle with lots of resources you should get them the first time before the sleep even happens, right? We call allocate before launching the reporter thread, which will do an allocate that should get response from first one before doing the sleep. Have you run this on various conditions to see what actually happens? |
I think this is part of what we're trying to automate. I don't think a user (or cluster administrator) should need to think about this at all. From their perspective, they just want to get containers as fast as possible with spamming the RM. |
I agree it should be automated or have reasonable default (5 is definitely high), but I think there are so many different configs and possible setups that is very hard to do based on the current yarn RM. How fast you get containers is factor of to many things - nm heartbeats, size of cluster, load on cluster, etc. If there is a difference great, lets go with this, if not is it really necessarily or is there a better option. Then the question comes down to RM load. If we do end up heartbeating in every 1 second does that hurt the RM. This again is going to be very cluster dependent. I would guess on most small and medium clusters its fine. Folks with larger clusters can configure it up slightly. |
@tgravescs Of course it differs from when you have a constant HB interval. In any case you wish to have an adaptive solution or the one they've implemented in MR? I can easily set up a scenario for you, where a 1 second of HB interval will stress the RM and every user will suffer. I really don't see your point here. Multiplicative back-off is a very simple and adaptive solution. What you are trying to do is not adaptive to the cluster, thus being not adaptive to the user, because practically the user sits on the cluster.
No. There is not everything about free resources, but the RM has its own logic that needs time to complete. 200ms could be tuned, but back-off is essential here. |
Can we test this, please? |
ok to test |
Merged build triggered. |
Merged build started. |
Merged build finished. Test FAILed. |
Test FAILed. |
Oops, what happened there? Changed one character only. |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Test build #33080 has started for PR 6082 at commit |
Test build #33080 has finished for PR 6082 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
@srowen Based on the recent conversations are you ok with this going in? |
Yeah I defer to your judgment @tgravescs . I suppose it wasn't obvious to me this is a win in whatever a normal case is, and your test indicated it wasn't in your case. So I'm a little uncomfortable with the logic that it should go in because it helps in theory, and a test must be bad if it disagrees. @ehnalis it's not true that this can't do any harm, as you even say. Yes, you can tune away the harm in that type of case, but then you've put in another lever to know about to get it tuned. I can appreciate the argument that adaptiveness is likely to be better in more cases than it's worse, even at defaults. I know start-up time is an issue. I don't object to merging to |
+1 and merging this. |
This caused the following (benign) exception on the AM:
|
Is it benign? Won't that cause the reporter thread to stop? |
No, I thought it was benign at first because this stops a "reporter thread", but apparently this thread does more than just reporting? I have filed SPARK-7775. |
I actually don't see how
According to my logs:
It appears that |
Hmmm...
With enough iterations this will overflow and become negative, and then |
Ah... overflow. That's probably it. It seems that we need a cap on the interval. |
Fix @ #6305. |
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by #6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes #6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations. Author: ehnalis <zoltan.zvara@gmail.com> Closes apache#6082 from ehnalis/yarn and squashes the following commits: a1d2101 [ehnalis] MIss-spell fixed. 90f8ba4 [ehnalis] Changed default HB values. 6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value. 08bac63 [ehnalis] Refined style, grammar, removed duplicated code. 073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations. Author: ehnalis <zoltan.zvara@gmail.com> Closes apache#6082 from ehnalis/yarn and squashes the following commits: a1d2101 [ehnalis] MIss-spell fixed. 90f8ba4 [ehnalis] Changed default HB values. 6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value. 08bac63 [ehnalis] Refined style, grammar, removed duplicated code. 073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations. Author: ehnalis <zoltan.zvara@gmail.com> Closes apache#6082 from ehnalis/yarn and squashes the following commits: a1d2101 [ehnalis] MIss-spell fixed. 90f8ba4 [ehnalis] Changed default HB values. 6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value. 08bac63 [ehnalis] Refined style, grammar, removed duplicated code. 073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
``` SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356) ``` This kills the reporter thread. This is caused by apache#6082 (merged into master branch only). Author: Andrew Or <andrew@databricks.com> Closes apache#6305 from andrewor14/yarn-negative-sleep and squashes the following commits: b970770 [Andrew Or] Use existing cap 56d6e5e [Andrew Or] Avoid negative sleep
Added faster RM-heartbeats on pending container allocations with multiplicative back-off.
Also updated related documentations.