[SPARK-5484][GraphX] Periodically do checkpoint in Pregel #15125

dding3 · 2016-09-17T02:51:24Z

What changes were proposed in this pull request?

Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.

This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core

How was this patch tested?

unit tests, manual tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

SparkQA · 2016-09-17T03:08:20Z

Test build #65519 has finished for PR 15125 at commit 720a741.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-17T04:38:07Z

Test build #65521 has finished for PR 15125 at commit b06b22c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-17T04:48:30Z

Test build #65522 has finished for PR 15125 at commit 01523cb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-17T06:56:17Z

Test build #65523 has finished for PR 15125 at commit 70f826d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2017-01-17T22:55:07Z

Hi @dding3. Thanks for working on this! I was able to rebase and apply your patch to our build of Spark 2.1 to successfully compute the connected components of a graph with 5.2 billion vertices, 3.7 billion edges and 2.4 billion connected components. It took 832 iterations, but without your patch it stalled around iteration number 330.

Can you please rebase your patch? I will take up a review of this PR when you've had a chance to rebase.

Cheers.

dding3 · 2017-01-17T23:18:06Z

Sure. I will work on the rebase and update the PR soon.

SparkQA · 2017-01-18T05:03:15Z

Test build #71556 has finished for PR 15125 at commit e786838.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2017-01-18T17:54:17Z

LGTM. @srowen, can you recommend an mllib committer to review these changes? I'm not familiar with that team.

srowen · 2017-01-18T19:55:29Z

I don't know. @ankurdave ? I am not sure this is being maintained in any meaningful sense now.

viirya · 2017-02-10T07:58:44Z

graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala

-    }
-    if (data.edges.getStorageLevel == StorageLevel.NONE) {
-      data.edges.persist()
+      data.persist()


hmm, GraphImpl.persist actually persists its vertices and replicatedVertexView.edges (i.e., edges). But the problem is we only check the storage level of vertices. Maybe we should keep the original.

I enhanced the persistence tests in PeriodicGraphCheckpointSuite to check that the storage level requested in the graph construction is the storage level seen after persistence. Both this version and the original version of this method failed that unit test.

The graph's vertex and edge rdds are somewhat peculiar in that .cache() and .persist() do not do the same thing, unlike other RDDs. And while .cache() honors the default storage level specified at graph construction time, .persist() always caches with the MEMORY_ONLY storage level.

At any rate, getting the PeriodicGraphCheckpointer to honor the default storage level specified at graph construction time requires changing these method calls from persist() to cache().

viirya · 2017-02-10T08:03:20Z

graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

+    var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache()
+    val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+      checkpointInterval, graph.vertices.sparkContext)
+    messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])


Do we need to update it before the loop? I think it should be enough to do the update in the loop.

I agree. What do you think, @dding3?

Actually, for the sake of simplicity and consistency, I'm going to suggest we keep the checkpointer update calls but remove all .cache() calls. The update calls persist the underlying data, making the calls to .cache() unnecessary.

I think we need cache graph/messages here so they don't need to be computed again in the loop. I agree with you and I will keep the checkpointer update calls and remove all .cache calls.

viirya · 2017-02-10T08:05:06Z

graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

    var activeMessages = messages.count()
+
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog).cache()


checkpointer will do persist. So I think we don't need to call cache here.

Actually, this may be more subtle than I thought. I'm going to think through this again.

Ok. I agree with your observation here, @viirya.

viirya · 2017-02-10T08:13:21Z

@mallman Thanks for ping me on this. The changes looks good to me overall. The most issue might be as @srowen said, I don't know if this is actively maintained still.

I saw most recently commits to GraphX coming from @ankurdave, @tdas, @dbtsai.

mallman · 2017-02-10T21:53:13Z

mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala

@@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path



Is there a reason this test suite isn't moved into the GraphX codebase?

mallman · 2017-02-10T21:54:08Z

mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala

@@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite}
 import org.apache.spark.mllib.util.MLlibTestSparkContext


Is there a reason this file isn't moved into the core codebase?

mallman · 2017-02-10T22:32:47Z

graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala

@@ -76,7 +77,7 @@ import org.apache.spark.storage.StorageLevel
 *
 * TODO: Move this out of MLlib?


This comment should be removed.

mallman · 2017-02-10T23:03:02Z

@viirya @dding3 I'm going to rerun our big connected components computation with the changes I've suggested to validate that it still performs and completes as expected. Given the time required to complete this calculation (10+ hours), I might not get to it until next week.

dding3 · 2017-02-12T02:25:41Z

Thank you guys for reviewing the code. I have updated it based on the comments.

SparkQA · 2017-02-12T05:02:57Z

Test build #72748 has finished for PR 15125 at commit a25d00c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2017-02-13T19:09:51Z

@dding3 These latest changes look great. I'll run our big connected components job today and report back.

mallman · 2017-04-11T18:44:53Z

@felixcheung ping

mallman · 2017-04-18T17:00:19Z

@srowen Can you merge this PR, please? It's been over a month since we've heard from any of the reviewers.

srowen · 2017-04-18T17:02:01Z

I feel like i really don't know anything about graphx and can't evaluate this. It seems reasonable. I don't know if graphx is really active at this stage?

mallman · 2017-04-18T17:14:10Z

I feel like i really don't know anything about graphx and can't evaluate this. It seems reasonable. I don't know if graphx is really active at this stage?

Understood. Let me respond to your concerns.

First, this PR has undergone extensive review and revision from people who have experience with this codebase, including from someone (me) who's successfully scaled it to a pregel computation with 800+ iterations. Without this patch, that computation is impossible. I'm not a committer, but as someone who's spent a large amount of time in the guts of GraphX I believe I'm qualified to sign off on this PR.

Second, the consumers of the GraphX codebase are dependent on support of it. I understand if the Spark team doesn't want to support it anymore, but in that case I think the responsible thing to do would be to split it out and put it into its own repo. That would allow the interested parties to maintain it independently of Spark. So long as GraphX is part of Spark, the only way to maintain it is through the Spark project.

There seems to be some sentiment that GraphX is somehow deprecated or a dead end project, yet I do not see a viable alternative. Even then, it would be very helpful to officially deprecate the project before ending support of it.

Thanks.

srowen · 2017-04-18T17:40:06Z

Yeah, catch-22. I'd also like to split out graphx, but I sorta think that's what's already happened with GraphFrames. I don't feel strongly enough to campaign for it, but think graphx should probably be deprecated soon.

It's probably a fine change and don't mind merging it if there are no objections.

mallman · 2017-04-18T18:03:42Z

Yeah, catch-22. I'd also like to split out graphx, but I sorta think that's what's already happened with GraphFrames. I don't feel strongly enough to campaign for it, but think graphx should probably be deprecated soon.

Unfortunately, GraphFrames does not currently scale to the level we need. Because of fundamental differences between RDDs and DataFrames in the way they can be extended, I do not see that changing anytime soon. For those reasons, GraphX remains essential to large-scale consumers. But certainly I'm all for splitting it out into its own repo if it means the community can better maintain it.

Right now we're keeping some patches on our own private branch, but since there's community interest in this patch (small as it may be) and a significant amount of work done on this PR, I hope we can merge this in.

felixcheung · 2017-04-18T18:20:50Z

I didn't forget about this ;) I was hoping others could review. I will do another pass today - I believe at this point we have quorum with the number of reviewers who have looked at this PR.

felixcheung

@dding3 a few comments - could you take a look and update?

felixcheung · 2017-04-19T03:17:57Z

core/src/main/scala/org/apache/spark/util/PeriodicCheckpointer.scala

+   * Call this to unpersist the Dataset.
+   */
+  def unpersistDataSet(): Unit = {
+    while (persistedQueue.nonEmpty) {


I don't know if it's the goal but this isn't thread-safe?

I would agree with you, that this is not thread safe. Is that a concern?

with the limited internal only use, it should be ok

felixcheung · 2017-04-19T03:19:23Z

core/src/main/scala/org/apache/spark/util/PeriodicRDDCheckpointer.scala

@@ -15,7 +15,7 @@
 * limitations under the License.
 */

-package org.apache.spark.mllib.impl
+package org.apache.spark.util


it's PeriodicRDDCheckpointer, shouldn't this be in the org.apache.spark.rdd.util namespace?

felixcheung · 2017-04-19T03:22:31Z

graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

@@ -122,27 +125,39 @@ object Pregel extends Logging {
    require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," +
      s" but got ${maxIterations}")

-    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
+    val checkpointInterval = graph.vertices.sparkContext.getConf
+      .getInt("spark.graphx.pregel.checkpointInterval", 10)


@mallman it's at https://github.com/apache/spark-website/tree/asf-site/releases/_posts

felixcheung · 2017-04-19T03:25:20Z

graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala

@@ -87,10 +87,10 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED](

  override protected def persist(data: Graph[VD, ED]): Unit = {
    if (data.vertices.getStorageLevel == StorageLevel.NONE) {
-      data.vertices.persist()


@mallman @dding3 it will be good to add a comment on that here

felixcheung · 2017-04-19T03:29:27Z

I think we want to bring GraphFrames to feature/performance parity with GraphX - @mallman would love to understand the challenges you have run into. Better yet, would be great to get some issues created to track them

…eriodicGraphCheckpointer

mallman · 2017-04-20T00:29:39Z

I think we want to bring GraphFrames to feature/performance parity with GraphX - @mallman would love to understand the challenges you have run into. Better yet, would be great to get some issues created to track them

Regarding the scale/performance of GraphX vis-a-vis GraphFrames, I can speak from our experience with the connected components algorithm.

As you know, there are two implementations of the connected components algorithm in the GraphFrames project. There's an implementation which "piggybacks" on GraphX. And there's an implementation that does not use GraphX.

We don't really need any vertex or edge attributes when computing connected components. Any such attributes are strictly overhead. We found that using vertex and edge attributes of type Boolean and value null provide the least overhead. I would expect piggybacking on top of GraphX cannot scale better than using GraphX itself. And we just didn't get performance parity using GraphFrames in this way.

GraphX uses custom implementations of the RDD interface with fast point indices for lookups and joins. By contrast, the Dataset interface is closed to extension by clients, and that's by design.

Considering the problem that way, I think that bringing a Dataset-based graph library to performance parity with an RDD-based graph library will be quite challenging. This is especially true in cases where we the client doesn't even need vertex or edge attributes.

I think that to even get to performance parity, Spark SQL needs to include support for some kind of columnar indices. But even if GraphFrames implements a better algorithm for connected components than that in GraphX, would that algorithm perform better in GraphX if it was ported to that codebase?

We'd love to use something like GraphFrames pervasively, as it does provide a much more convenient interface when we do use vertex or edge attributes. In fact, before we discovered GraphFrames we made quite a lot of headway into building our own graph library of the sort. However, we found that the overhead incurred by the DataFrame approach (we took) was untenable.

We use a sort of hybrid approach. We do everything except graph computations with dataframes. We convert to RDDs for graph computations.

Cheers.

SparkQA · 2017-04-20T05:19:31Z

Test build #75968 has finished for PR 15125 at commit 9a6fd1f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-20T21:22:01Z

Test build #76003 has finished for PR 15125 at commit 5015b44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-21T19:25:04Z

thanks @mallman I'll digest this and see what improvements we could make in gf.

felixcheung · 2017-04-21T19:28:13Z

@dding3 @mallman one last thing to consider, with this PR we are having

spark.graphx.pregel.checkpointInterval

defaulting to 10. This changes the existing behavior (where there is no checkpointing). Do you think it will be lower risk to have this default to -1 hence off by default?

I think this goes along with the earlier discussion of whether this should be documented in release note.

dding3 · 2017-04-22T18:06:55Z

OK, agreed. If user didn't set checkpointer directory while we turn on checkpoint in pregel by default, there may be exception. I will change spark.graphx.pregel.checkpointInterval to -1 as default value.

SparkQA · 2017-04-22T21:22:00Z

Test build #76069 has finished for PR 15125 at commit 24d4ad6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-23T00:00:12Z

docs/graphx-programming-guide.md

-messages are periodically checkpoint and the checkpoint interval is set by
-"spark.graphx.pregel.checkpointInterval", it can be disable by set as -1):
+of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodcally
+checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval"):


I think we can recommend a good value (say 10 was the earlier default) to set to since now it defaults to off
Also, good point about checkpointdir - would be good to mention that is required be set as well and link to any doc we have on that

SparkQA · 2017-04-24T03:13:43Z

Test build #76088 has finished for PR 15125 at commit ec62659.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-24T18:06:44Z

docs/graphx-programming-guide.md

@@ -709,7 +709,8 @@ messages remaining.

 The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch*
 of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodcally
-checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval"):
+checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval" to a positive number,
+say 10. And set checkpoint directory as well using SparkContext.setCheckpointDir(directory: String)):


nit: this suggestion seems a bit soft :) how did it come up with the value 10 earlier?

I reference the value in LDA and other ml algorithms in spark, by default their checkpointInterval is set to 10.

felixcheung

LGTM.
@mallman @srowen any more thought?

## What changes were proposed in this pull request? Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains. This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set. This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core ## How was this patch tested? unit tests, manual tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: ding <ding@localhost.localdomain> Author: dding3 <ding.ding@intel.com> Author: Michael Allman <michael@videoamp.com> Closes #15125 from dding3/cp2_pregel. (cherry picked from commit 0a7f5f2) Signed-off-by: Felix Cheung <felixcheung@apache.org>

felixcheung · 2017-04-25T18:22:46Z

merged to master/2.2
I talked to Ankur about this change offline.

ding added 7 commits January 17, 2017 12:44

test

3834981

period do checkpoint in pregel

166fd6d

remove unused code

b119e4a

fix mima

d183a7c

fix mima

352dcb2

fix scala style fail

ad82e45

rebase upstream

e786838

dding3 force-pushed the cp2_pregel branch from 70f826d to e786838 Compare January 18, 2017 02:23

mallman mentioned this pull request Feb 9, 2017

[SPARK-19433][ML] Periodic checkout datasets for long ml pipeline #16775

Closed

viirya reviewed Feb 10, 2017

View reviewed changes

mallman reviewed Feb 10, 2017

View reviewed changes

code rebase

a25d00c

felixcheung reviewed Apr 19, 2017

View reviewed changes

mv PeriodicRDDCheckpointer to rdd.util and explain why use cache in P…

9a6fd1f

…eriodicGraphCheckpointer

fix scalastyle error

5015b44

turn off checkpoint graph and messages in pregel by default

24d4ad6

felixcheung reviewed Apr 23, 2017

View reviewed changes

recommend a good value in checkpoint interval in pregel

ec62659

felixcheung reviewed Apr 24, 2017

View reviewed changes

felixcheung approved these changes Apr 24, 2017

View reviewed changes

asfgit closed this in 0a7f5f2 Apr 25, 2017

szhem mentioned this pull request Mar 30, 2018

[SPARK-22184][CORE][GRAPHX] GraphX fails in case of insufficient memory and checkpoints enabled #19410

Closed

		@@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite}
		import org.apache.spark.mllib.util.MLlibTestSparkContext

		@@ -76,7 +77,7 @@ import org.apache.spark.storage.StorageLevel
		*
		* TODO: Move this out of MLlib?

[SPARK-5484][GraphX] Periodically do checkpoint in Pregel #15125

[SPARK-5484][GraphX] Periodically do checkpoint in Pregel #15125

Conversation

dding3 commented Sep 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 17, 2016

SparkQA commented Sep 17, 2016

SparkQA commented Sep 17, 2016

SparkQA commented Sep 17, 2016

mallman commented Jan 17, 2017

dding3 commented Jan 17, 2017

SparkQA commented Jan 18, 2017

mallman commented Jan 18, 2017

srowen commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Feb 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mallman commented Feb 10, 2017

dding3 commented Feb 12, 2017

SparkQA commented Feb 12, 2017

mallman commented Feb 13, 2017

mallman commented Apr 11, 2017

mallman commented Apr 18, 2017

srowen commented Apr 18, 2017

mallman commented Apr 18, 2017

srowen commented Apr 18, 2017

mallman commented Apr 18, 2017

felixcheung commented Apr 18, 2017 via email

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Apr 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Apr 19, 2017

mallman commented Apr 20, 2017

SparkQA commented Apr 20, 2017

SparkQA commented Apr 20, 2017

felixcheung commented Apr 21, 2017

felixcheung commented Apr 21, 2017 • edited Loading

dding3 commented Apr 22, 2017

SparkQA commented Apr 22, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung commented Apr 25, 2017

felixcheung Apr 19, 2017 •

edited

Loading

felixcheung commented Apr 21, 2017 •

edited

Loading