-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3797] Run external shuffle service in Yarn NM #3082
Conversation
This includes the necessary changes in making the Yarn shuffle service a module of Spark that core doesn't depend on. This is included in the network-yarn module, which depends on the network-shuffle module.
Test build #22835 has started for PR 3082 at commit
|
Test build #22835 has finished for PR 3082 at commit
|
Test FAILed. |
Test build #22837 has started for PR 3082 at commit
|
if (executorIdleTimeout <= 0) { | ||
throw new SparkException(s"spark.dynamicAllocation.executorIdleTimeout must be > 0!") | ||
} | ||
// Verify that external shuffle service is enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment is redundant given the code.
LGTM aside some nits; I still have to catch up with the rest of the network code changes to make more informed comments, though. |
@@ -93,6 +93,7 @@ | |||
<module>tools</module> | |||
<module>network/common</module> | |||
<module>network/shuffle</module> | |||
<module>network/yarn</module> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this should be probably gated on some profile (e.g. "-Pyarn"). I don't think this would compile with Hadoop 1.0.4, for example...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... good point
Test build #22843 has started for PR 3082 at commit
|
Test build #22837 has finished for PR 3082 at commit
|
Test FAILed. |
Test build #22848 has started for PR 3082 at commit
|
Currently the user will have to add all three network jars to the NM class path. A future PR will use a maven plugin to merge these into one.
Test build #22854 has started for PR 3082 at commit
|
Test build #22843 has finished for PR 3082 at commit
|
Here's a quick status update on this PR. As of the latest commit, the Yarn shuffle service now handles authentication. However, it is not working yet because of two known issues (see commit message andrewor14@d1124e4 for more detail). These are being actively fixed by @aarondav at the moment and this PR will be blocked until that one is merged (#3108). I have tested the tentative changes there and verified that the service functions as expected with authentication with those changes. |
Test build #22922 has finished for PR 3082 at commit
|
Test PASSed. |
On service init, we want to fail fast if the server fails to start for any reason (e.g. port bind exception). However, once the NM has been running for a while, it should be super robust so we need to add try catches around the application initialization / stopping code.
@@ -89,6 +90,20 @@ class ExecutorRunnable( | |||
|
|||
ctx.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityMgr)) | |||
|
|||
// If external shuffle service is enabled, register with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to say "register and transfer keys" rather than just register
Test build #22950 has started for PR 3082 at commit
|
private static final Charset UTF8_CHARSET = Charset.forName("UTF-8"); | ||
|
||
// Spark user used for authenticating SASL connections | ||
// Note that this should match the value in org.apache.spark.SecurityManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this must match"
This seems quite straightforward. LGTM. |
Alright, as of the latest commit the two blocking fixes previously mentioned are merged. I think I have addressed all the comments. This is ready to go on my side. |
Test build #22954 has started for PR 3082 at commit
|
Test build #22950 has finished for PR 3082 at commit
|
Test PASSed. |
Ok I'm merging this into master and 1.2. Thanks everyone. |
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark. This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster. Author: Andrew Or <andrew@databricks.com> Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits: ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 0ee67a2 [Andrew Or] Minor wording suggestions 1c66046 [Andrew Or] Remove unused provided dependencies 0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 6489db5 [Andrew Or] Try catch at the right places 7b71d8f [Andrew Or] Add detailed java docs + reword a few comments d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE) 5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 9b6e058 [Andrew Or] Address various feedback f48b20c [Andrew Or] Fix tests again f39daa6 [Andrew Or] Do not make network-yarn an assembly module 761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 15a5b37 [Andrew Or] Fix build for Hadoop 1.x baff916 [Andrew Or] Fix tests 5bf9b7e [Andrew Or] Address a few minor comments 5b419b8 [Andrew Or] Add missing license header 804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled 1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config b4b1f0c [Andrew Or] 4 tabs -> 2 tabs 43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service (cherry picked from commit 61a5cce) Signed-off-by: Andrew Or <andrew@databricks.com>
Test build #22954 has finished for PR 3082 at commit
|
Test PASSed. |
This is another addendum to apache#3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project. Author: Andrew Or <andrew@databricks.com> Closes apache#3147 from andrewor14/yarn-shuffle-build and squashes the following commits: bda58d0 [Andrew Or] Fix line too long 81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar 65db822 [Andrew Or] Actually mark slf4j as provided abcefd1 [Andrew Or] Do the same for SBT c653028 [Andrew Or] Package network-yarn and its dependencies
This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project. Author: Andrew Or <andrew@databricks.com> Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits: bda58d0 [Andrew Or] Fix line too long 81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar 65db822 [Andrew Or] Actually mark slf4j as provided abcefd1 [Andrew Or] Do the same for SBT c653028 [Andrew Or] Package network-yarn and its dependencies (cherry picked from commit aa43a8d) Signed-off-by: Andrew Or <andrew@databricks.com>
This creates a new module
network/yarn
that depends onnetwork/shuffle
recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.I have tested this on a stable Yarn cluster.