Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

ankurdave · 2014-05-29T01:07:40Z

The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark 1.0.0. This PR updates the GraphX docs for Spark 1.0.0 to encourage users to build the latest version of Spark from branch-1.0, which contains a fix. Alternatively, it suggests a workaround involving partitioning the edges before constructing the graph.

We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph.

AmplabJenkins · 2014-05-29T01:07:58Z

Merged build triggered.

AmplabJenkins · 2014-05-29T01:08:05Z

Merged build started.

ankurdave · 2014-05-29T01:08:43Z

@pwendell Since this is a post-release doc change, I wasn't sure which branch to submit against -- let me know if I should rebase to another one.

AmplabJenkins · 2014-05-29T01:12:59Z

Merged build triggered.

AmplabJenkins · 2014-05-29T01:13:05Z

Merged build started.

AmplabJenkins · 2014-05-29T02:22:07Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-29T02:22:07Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15272/

AmplabJenkins · 2014-05-29T02:27:22Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-29T02:27:22Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15273/

npanj · 2014-05-29T04:12:18Z

It's really nice of you to update the document with workaround. Ideally we should have tried to convince Patrick et al. the importance of this fix and why it should be in 1.0 . IMHO, partitionBY is one of the most important operations in GraphX for many algorithms..

ankurdave · 2014-05-29T04:29:05Z

@npanj It's unfortunate that there'll be a big known problem in the 1.0.0 release. Still, I think it's tolerable for the following reasons:

The only GraphX library that requires calling partitionBy (namely TriangleCount) expects the user to do it before passing the graph in. This enables use of the workaround, which is reasonably simple, though inefficient.
Due to the early stage of GraphX, we've been encouraging people to use master for the latest features anyway.
Spark 1.0.1 will include the fix and is hoped to come just a few weeks after 1.0.0.
For people who need a prebuilt distribution, we (the GraphX team) can make a patched one.

The only major problem would be if the third-party Spark distributions don't upgrade to 1.0.1 in a timely manner.

pwendell · 2014-05-29T04:39:26Z

Hey @npanj Matei and I talked with Ankur a bunch and this was a tough call but together we decided to release without it. This fix is already in the release branch so it will be available right away for those who want it. We'll likely have a 1.0.1 pretty quickly with a handful of fixes and this can be included.

pwendell · 2014-05-29T04:40:59Z

docs/graphx-programming-guide.md

+## Workaround for `Graph.partitionBy` in Spark 1.0.0
+<a name="partitionBy_workaround"></a>
+
+The [`Graph.partitionBy`][Graph.partitionBy] operator allows users to choose the graph partitioning strategy, but due to [SPARK-1931](https://issues.apache.org/jira/browse/SPARK-1931), this method is broken in Spark 1.0.0. We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph, as follows:


I'd actually suggest they build spark from the 1.0 branch if this fix is the main thing they are interested in.

pwendell · 2014-05-29T04:42:42Z

Looks good, made some minor comments. We just need to remember to revert this in 1.0.0 :)

ankurdave · 2014-05-29T04:54:57Z

@pwendell Thanks for the comments - done.

AmplabJenkins · 2014-05-29T04:57:58Z

Merged build triggered.

AmplabJenkins · 2014-05-29T04:58:07Z

Merged build started.

AmplabJenkins · 2014-05-29T06:02:10Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-29T06:02:10Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15279/

mateiz · 2014-06-03T01:19:36Z

Hey @ankurdave I'd suggest just updating the 1.0.0 docs on the website manually for now and then we can remove it for 1.0.1. What do you think?

ankurdave · 2014-06-03T01:21:10Z

Sure, I'll do that now and then close this PR. People who rebuild the docs before 1.0.1 is out will need to include these changes though.

ankurdave · 2014-06-03T02:28:33Z

Done. Closing

Applied PR #908 to the generated docs: apache/spark#908

…pache#908) ### What changes were proposed in this pull request? This PR adds MERGE operations to `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? These changes are needed to optimize conditions of MERGE operations and match the existing logic for UPDATE and DELETE. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with new tests.

…pache#913) * MapR [SPARK-953] Investigate and add all needed changes for Spark services (apache#905) * [EZSPA-347] Find a way to pass sensitive configs in secure manner (apache#907) * MapR [SPARK-961] Spark job can't be properly killed using yarn API or CLI (apache#908) * MapR [SPARK-962] MSSQL can not handle SQL syntax which is used in Spark (apache#909) * MapR [SPARK-963] select from hbase table which was created via hive fails (apache#910) Co-authored-by: Dmitry Popkov <91957973+d-popkov@users.noreply.github.com> Co-authored-by: Andrew Khalymon <andrew.khalymon@hpe.com>

…axCreatedFiles(#908)

…pache#913) * MapR [SPARK-953] Investigate and add all needed changes for Spark services (apache#905) * [EZSPA-347] Find a way to pass sensitive configs in secure manner (apache#907) * MapR [SPARK-961] Spark job can't be properly killed using yarn API or CLI (apache#908) * MapR [SPARK-962] MSSQL can not handle SQL syntax which is used in Spark (apache#909) * MapR [SPARK-963] select from hbase table which was created via hive fails (apache#910) Co-authored-by: Dmitry Popkov <91957973+d-popkov@users.noreply.github.com> Co-authored-by: Andrew Khalymon <andrew.khalymon@hpe.com>

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931

9ca3d58

We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph.

Remove unnecessary comment

5809279

pwendell reviewed May 29, 2014
View reviewed changes

Point to branch-1.0 instead of master for fix

fbd7a12

ankurdave closed this Jun 3, 2014

hubot pushed a commit to apache/spark-website that referenced this pull request Apr 28, 2017

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931

6380889

Applied PR #908 to the generated docs: apache/spark#908

wangyum added a commit that referenced this pull request May 26, 2023

[CARMEL-5931] Avoid file number exceeds spark.sql.dynamic.partition.m…

669368b

…axCreatedFiles(#908)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

npanj commented May 29, 2014

ankurdave commented May 29, 2014

pwendell commented May 29, 2014

pwendell May 29, 2014

pwendell commented May 29, 2014

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

mateiz commented Jun 3, 2014

ankurdave commented Jun 3, 2014

ankurdave commented Jun 3, 2014

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

Conversation

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

npanj commented May 29, 2014

ankurdave commented May 29, 2014

pwendell commented May 29, 2014

pwendell May 29, 2014

Choose a reason for hiding this comment

pwendell commented May 29, 2014

ankurdave commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

AmplabJenkins commented May 29, 2014

mateiz commented Jun 3, 2014

ankurdave commented Jun 3, 2014

ankurdave commented Jun 3, 2014