Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931 #908

Closed
wants to merge 3 commits into from

Conversation

ankurdave
Copy link
Contributor

The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark 1.0.0. This PR updates the GraphX docs for Spark 1.0.0 to encourage users to build the latest version of Spark from branch-1.0, which contains a fix. Alternatively, it suggests a workaround involving partitioning the edges before constructing the graph.

We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@ankurdave
Copy link
Contributor Author

@pwendell Since this is a post-release doc change, I wasn't sure which branch to submit against -- let me know if I should rebase to another one.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15272/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15273/

@npanj
Copy link

npanj commented May 29, 2014

It's really nice of you to update the document with workaround. Ideally we should have tried to convince Patrick et al. the importance of this fix and why it should be in 1.0 . IMHO, partitionBY is one of the most important operations in GraphX for many algorithms..

@ankurdave
Copy link
Contributor Author

@npanj It's unfortunate that there'll be a big known problem in the 1.0.0 release. Still, I think it's tolerable for the following reasons:

  1. The only GraphX library that requires calling partitionBy (namely TriangleCount) expects the user to do it before passing the graph in. This enables use of the workaround, which is reasonably simple, though inefficient.
  2. Due to the early stage of GraphX, we've been encouraging people to use master for the latest features anyway.
  3. Spark 1.0.1 will include the fix and is hoped to come just a few weeks after 1.0.0.
  4. For people who need a prebuilt distribution, we (the GraphX team) can make a patched one.

The only major problem would be if the third-party Spark distributions don't upgrade to 1.0.1 in a timely manner.

@pwendell
Copy link
Contributor

Hey @npanj Matei and I talked with Ankur a bunch and this was a tough call but together we decided to release without it. This fix is already in the release branch so it will be available right away for those who want it. We'll likely have a 1.0.1 pretty quickly with a handful of fixes and this can be included.

## Workaround for `Graph.partitionBy` in Spark 1.0.0
<a name="partitionBy_workaround"></a>

The [`Graph.partitionBy`][Graph.partitionBy] operator allows users to choose the graph partitioning strategy, but due to [SPARK-1931](https://issues.apache.org/jira/browse/SPARK-1931), this method is broken in Spark 1.0.0. We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph, as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually suggest they build spark from the 1.0 branch if this fix is the main thing they are interested in.

@pwendell
Copy link
Contributor

Looks good, made some minor comments. We just need to remember to revert this in 1.0.0 :)

@ankurdave
Copy link
Contributor Author

@pwendell Thanks for the comments - done.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15279/

@mateiz
Copy link
Contributor

mateiz commented Jun 3, 2014

Hey @ankurdave I'd suggest just updating the 1.0.0 docs on the website manually for now and then we can remove it for 1.0.1. What do you think?

@ankurdave
Copy link
Contributor Author

Sure, I'll do that now and then close this PR. People who rebuild the docs before 1.0.1 is out will need to include these changes though.

@ankurdave
Copy link
Contributor Author

Done. Closing

@ankurdave ankurdave closed this Jun 3, 2014
hubot pushed a commit to apache/spark-website that referenced this pull request Apr 28, 2017
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…pache#908)

### What changes were proposed in this pull request?

This PR adds MERGE operations to `ReplaceNullWithFalseInPredicate`.

### Why are the changes needed?

These changes are needed to optimize conditions of MERGE operations and match the existing logic for UPDATE and DELETE.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR comes with new tests.
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…pache#913)

* MapR [SPARK-953] Investigate and add all needed changes for Spark services (apache#905)

* [EZSPA-347] Find a way to pass sensitive configs in secure manner (apache#907)

* MapR [SPARK-961] Spark job can't be properly killed using yarn API or CLI (apache#908)

* MapR [SPARK-962] MSSQL can not handle SQL syntax which is used in Spark (apache#909)

* MapR [SPARK-963] select from hbase table which was created via hive fails (apache#910)

Co-authored-by: Dmitry Popkov <91957973+d-popkov@users.noreply.github.com>
Co-authored-by: Andrew Khalymon <andrew.khalymon@hpe.com>
wangyum added a commit that referenced this pull request May 26, 2023
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…pache#913)

* MapR [SPARK-953] Investigate and add all needed changes for Spark services (apache#905)

* [EZSPA-347] Find a way to pass sensitive configs in secure manner (apache#907)

* MapR [SPARK-961] Spark job can't be properly killed using yarn API or CLI (apache#908)

* MapR [SPARK-962] MSSQL can not handle SQL syntax which is used in Spark (apache#909)

* MapR [SPARK-963] select from hbase table which was created via hive fails (apache#910)

Co-authored-by: Dmitry Popkov <91957973+d-popkov@users.noreply.github.com>
Co-authored-by: Andrew Khalymon <andrew.khalymon@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants