Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3797] Run external shuffle service in Yarn NM #3082

Closed
wants to merge 23 commits into from

Conversation

andrewor14
Copy link
Contributor

This creates a new module network/yarn that depends on network/shuffle recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.

I have tested this on a stable Yarn cluster.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22835 has started for PR 3082 at commit 804e7ff.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22835 has finished for PR 3082 at commit 804e7ff.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ExecutorLostFailure(execId: String) extends TaskFailedReason
    • public class YarnShuffleService extends AuxiliaryService
    • class DecimalType(DataType):
    • // in some cases, such as when a class is enclosed in an object (in which case
    • case class UnscaledValue(child: Expression) extends UnaryExpression
    • case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
    • case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
    • abstract class GenericStrategy[PhysicalPlan <: TreeNode[PhysicalPlan]] extends Logging
    • case class PrecisionInfo(precision: Int, scale: Int)
    • case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
    • abstract class UserDefinedType[UserType] extends DataType with Serializable
    • final class Decimal extends Ordered[Decimal] with Serializable
    • trait DecimalIsConflicted extends Numeric[Decimal]
    • public abstract class UserDefinedType<UserType> extends DataType implements Serializable
    • trait RunnableCommand extends logical.Command
    • case class ExecutedCommand(cmd: RunnableCommand) extends SparkPlan
    • protected case class Keyword(str: String)
    • sys.error(s"Failed to load class for data source: $provider")
    • case class EqualTo(attribute: String, value: Any) extends Filter
    • case class GreaterThan(attribute: String, value: Any) extends Filter
    • case class GreaterThanOrEqual(attribute: String, value: Any) extends Filter
    • case class LessThan(attribute: String, value: Any) extends Filter
    • case class LessThanOrEqual(attribute: String, value: Any) extends Filter
    • trait RelationProvider
    • abstract class BaseRelation
    • abstract class TableScan extends BaseRelation
    • abstract class PrunedScan extends BaseRelation
    • abstract class PrunedFilteredScan extends BaseRelation

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22835/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22837 has started for PR 3082 at commit 5b419b8.

  • This patch merges cleanly.

if (executorIdleTimeout <= 0) {
throw new SparkException(s"spark.dynamicAllocation.executorIdleTimeout must be > 0!")
}
// Verify that external shuffle service is enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment is redundant given the code.

@vanzin
Copy link
Contributor

vanzin commented Nov 4, 2014

LGTM aside some nits; I still have to catch up with the rest of the network code changes to make more informed comments, though.

@@ -93,6 +93,7 @@
<module>tools</module>
<module>network/common</module>
<module>network/shuffle</module>
<module>network/yarn</module>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this should be probably gated on some profile (e.g. "-Pyarn"). I don't think this would compile with Hadoop 1.0.4, for example...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... good point

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22843 has started for PR 3082 at commit 5bf9b7e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22837 has finished for PR 3082 at commit 5b419b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class YarnShuffleService extends AuxiliaryService

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22837/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22848 has started for PR 3082 at commit 761f58a.

  • This patch merges cleanly.

Currently the user will have to add all three network jars to
the NM class path. A future PR will use a maven plugin to merge
these into one.
@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22854 has started for PR 3082 at commit f39daa6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22843 has finished for PR 3082 at commit 5bf9b7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

Here's a quick status update on this PR. As of the latest commit, the Yarn shuffle service now handles authentication. However, it is not working yet because of two known issues (see commit message andrewor14@d1124e4 for more detail). These are being actively fixed by @aarondav at the moment and this PR will be blocked until that one is merged (#3108). I have tested the tentative changes there and verified that the service functions as expected with authentication with those changes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22922 has finished for PR 3082 at commit d1124e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class ShuffleSecretManager implements SecretKeyHolder
    • public class YarnShuffleService extends AuxiliaryService
    • public class HadoopConfigProvider extends ConfigProvider

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22922/
Test PASSed.

Andrew Or added 2 commits November 5, 2014 13:41
On service init, we want to fail fast if the server fails to start
for any reason (e.g. port bind exception). However, once the NM
has been running for a while, it should be super robust so we need
to add try catches around the application initialization / stopping
code.
@@ -89,6 +90,20 @@ class ExecutorRunnable(

ctx.setApplicationACLs(YarnSparkHadoopUtil.getApplicationAclsForYarn(securityMgr))

// If external shuffle service is enabled, register with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to say "register and transfer keys" rather than just register

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22950 has started for PR 3082 at commit 0eb6233.

  • This patch merges cleanly.

private static final Charset UTF8_CHARSET = Charset.forName("UTF-8");

// Spark user used for authenticating SASL connections
// Note that this should match the value in org.apache.spark.SecurityManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this must match"

@pwendell
Copy link
Contributor

pwendell commented Nov 5, 2014

This seems quite straightforward. LGTM.

@andrewor14 andrewor14 changed the title [WIP][SPARK-3797] Run external shuffle service in Yarn NM [SPARK-3797] Run external shuffle service in Yarn NM Nov 5, 2014
@andrewor14
Copy link
Contributor Author

Alright, as of the latest commit the two blocking fixes previously mentioned are merged. I think I have addressed all the comments. This is ready to go on my side.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22954 has started for PR 3082 at commit ef3ddae.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22950 has finished for PR 3082 at commit 0eb6233.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaGradientBoostedTrees
    • case class Params(
    • public class ShuffleSecretManager implements SecretKeyHolder
    • public class YarnShuffleService extends AuxiliaryService
    • public class HadoopConfigProvider extends ConfigProvider

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22950/
Test PASSed.

@andrewor14
Copy link
Contributor Author

Ok I'm merging this into master and 1.2. Thanks everyone.

asfgit pushed a commit that referenced this pull request Nov 5, 2014
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.

This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:

ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
0ee67a2 [Andrew Or] Minor wording suggestions
1c66046 [Andrew Or] Remove unused provided dependencies
0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
6489db5 [Andrew Or] Try catch at the right places
7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
9b6e058 [Andrew Or] Address various feedback
f48b20c [Andrew Or] Fix tests again
f39daa6 [Andrew Or] Do not make network-yarn an assembly module
761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
15a5b37 [Andrew Or] Fix build for Hadoop 1.x
baff916 [Andrew Or] Fix tests
5bf9b7e [Andrew Or] Address a few minor comments
5b419b8 [Andrew Or] Add missing license header
804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service

(cherry picked from commit 61a5cce)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 61a5cce Nov 5, 2014
@andrewor14 andrewor14 deleted the yarn-shuffle-service branch November 5, 2014 23:47
@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22954 has finished for PR 3082 at commit ef3ddae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaGradientBoostedTrees
    • case class Params(
    • public final class LimitedInputStream extends FilterInputStream
    • public class ShuffleSecretManager implements SecretKeyHolder
    • public class YarnShuffleService extends AuxiliaryService
    • public class HadoopConfigProvider extends ConfigProvider
    • class Matrices(object):
    • class ChiSqTestResult(JavaModelWrapper):
    • case class AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int, Seq[ReceivedBlockInfo]])
    • class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22954/
Test PASSed.

ankurdave pushed a commit to ankurdave/spark that referenced this pull request Nov 12, 2014
This is another addendum to apache#3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project.

Author: Andrew Or <andrew@databricks.com>

Closes apache#3147 from andrewor14/yarn-shuffle-build and squashes the following commits:

bda58d0 [Andrew Or] Fix line too long
81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build
fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar
65db822 [Andrew Or] Actually mark slf4j as provided
abcefd1 [Andrew Or] Do the same for SBT
c653028 [Andrew Or] Package network-yarn and its dependencies
asfgit pushed a commit that referenced this pull request Nov 12, 2014
This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project.

Author: Andrew Or <andrew@databricks.com>

Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits:

bda58d0 [Andrew Or] Fix line too long
81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build
fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar
65db822 [Andrew Or] Actually mark slf4j as provided
abcefd1 [Andrew Or] Do the same for SBT
c653028 [Andrew Or] Package network-yarn and its dependencies

(cherry picked from commit aa43a8d)
Signed-off-by: Andrew Or <andrew@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants