SPARK-2387: remove stage barrier #1328

lirui-apache · 2014-07-08T11:51:50Z

This PR is a PoC implementation of SPARK-2387.

When a ShuffleMapTask finishes, DAGScheduler will check resource usage. And if there’re free slots, DAGScheduler chooses a stage from the waiting list whose parent stages have all started, and pre-starts this waiting stage. All the in-progress parent stages will then register the map outputs progressively with MapOutputTrackerMaster. A flag is added to MapOutputTracker to indicate whether the map statuses for a shuffle is partial or not, so that we can distinguish partial registration from failed shuffle map stage.
When the downstream task tries to fetch shuffle blocks, it will get an array of map outputs that has “holes” (unfinished map tasks) in it. We created PartialBlockFetcherIterator to handle this map output array. PartialBlockFetcherIterator keeps an array of conventional iterators (BasicBlockFetcherIterator or NettyBlockFetcherIterator). When some new map outputs become available, PartialBlockFetcherIterator will delegate these outputs to a new conventional iterator and relies on these conventional iterators for “hasNext” and “next” methods. When all the delegated map statuses run out, PartialBlockFetcherIterator contacts local MapOutputTrackerWorker for updated map outputs. MapOutputTrackerWorker uses an "updater" thread to communicate with MapOutputTrackerMaster to update the map statuses and informs the downstream tasks to continue when the map statuses get updated.

This PoC feature is mainly intended and tested against standalone cluster. I used a 7-node cluster for performance test. Each node runs an executor with 32 CPUs and 90GB memory. I used graphx.SynthBenchmark for the test and the testcase used is:
graphx.SynthBenchmark -partStrategy=EdgePartition2D -numEPart=112 -nverts=10000000 -niter=3
The feature improves the whole job by roughly 10% (reduces the creation time from 128s to 116s and run time from 126s to 115s).

Signed-off-by: lirui <rui.li@intel.com>

… in MapOutputTracker

…gistration

… preferred locations

…stage gets re-submitted

…e finishing slowly

This reverts commit 12b8093.

…ve been launched

AmplabJenkins · 2014-07-08T11:56:06Z

Can one of the admins verify this patch?

sryza · 2014-07-08T16:46:15Z

SPARK-2099 is adding a general executor->driver heartbeat. It might be worth piggybacking the communication between the MapOutputTrackerWorker and MapOutputTrackerMaster on this.

lirui-apache · 2014-07-09T01:51:24Z

Thanks @sryza for the idea. I think it's OK to piggy back the communication in a heartbeat, but we should also allow the worker to explicitly ask the master for map statuses when a task demands more outputs. I'll look into it once you have that feature merged.

colorant · 2014-07-11T02:47:05Z

I think there might be several issues need to be addressed here to make this more robust and sound solution:

A solution to avoid pre-start stage occupy too many CPU resource which starve the parent stage and prevent it from finish in time. will need to control how many pre-start task can launch and adjust according to parent stage's status. And need some config to allow user to fine tune the behavior, after all, what's the best time to pre-start might depends on actual case.
might need a more common Interface, seems a lot of specific class implementation is assumed.
Might need to take more care of the memory management issue, say how to prevent pre-start stage occupy too many memory (say by cache etc) which complicate the overall cache eviction and GC problem etc.

Just my suggestion.

zsxwing · 2014-07-22T10:24:13Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

@@ -340,6 +459,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf)
 */
 private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) {
  protected val mapStatuses = new HashMap[Int, Array[MapStatus]]
+    with mutable.SynchronizedMap[Int, Array[MapStatus]]


I think ConcurrentHashMap is better in most cases.

Will this PR be merged soon? If not, I hope this line can be merged soon because it solves a critical concurrent issue of mapStatuses.

@zsxwing thanks for the comments. Maybe it's better to make it ConcurrentHashMap in the base class.
I don't think this PR can be merged soon... So maybe you can open another JIRA to fix this.

Maybe it's better to make it ConcurrentHashMap in the base class.

Because MapOutputTrackerMaster uses TimeStampedHashMap which is not a ConcurrentHashMap, MapOutputTracker still needs to use Map. Nevertheless, I can add a comment on MapOutputTracker.mapStatuses to mark that it should be a thread-safe map.

pwendell · 2014-09-21T04:57:11Z

I'd like to close this issue for now pending more of a design discussion on the JIRA. These Proof of Concept patches are useful to have, but I'd rather not have them lingering for a long time in the PR queue.

I will post a link on the JIRA to this diff so we have it as a reference:
https://github.com/lirui-intel/spark/compare/removeStageBarrier

…shMap MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon. Author: zsxwing <zsxwing@gmail.com> Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits: d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap

lirui added 30 commits May 5, 2014 13:21

minor fix

163302d

Signed-off-by: lirui <rui.li@intel.com>

Merge branch 'master' of https://github.com/lirui-intel/spark

f81476d

try to locate the point to remove the barrier

3124380

apply upstream hot fix

8e625c0

RemoveStageBarrier: support partial map outputs

1d5d0f0

RemoveStageBarrier: build fix

c4f4054

RemoveStageBarrier: register map outputs progressively

444d2d9

RemoveStageBarrier: increment epoch for progressive registration

2df1d4e

RemoveStageBarrier: fix check free CPUs

9f18dc7

RemoveStageBarrier: make reducers refresh map outputs less often

7af23c0

RemoveStageBarrier: start reducers earlier

9a32a17

RemoveStageBarrier: add log info

9ffb208

RemoveStageBarrier: adjust sleep interval

ef3b043

RemoveStageBarrier: add a new iterator to manage partial map outputs

4213d63

RemoveStageBarrier: minor fixes

376230a

RemoveStageBarrier: fix: reducers may fail due to very slow mappers

efd31ef

RemoveStageBarrier: add some log info

3cb944c

RemoveStageBarrier: stage with a bigger ID should take precedence

641715e

RemoveStageBarrier: track whether map output for a shuffle is partial…

b0c2df2

… in MapOutputTracker

RemoveStageBarrier: refine how we get the stage to pre-start

75d2744

RemoveStageBarrier: indicate the output is partial for progressive re…

b7f1f84

…gistration

add some debug info

be47408

add a new locality level for tasks with no preferred locations

c88014b

re-compute pending list when new executor is added

133a356

pendingTasksWithNoPrefs should only contain tasks that really have no…

7d92f9a

… preferred locations

make the delay schedule configurable

c1de426

clean up

e57e081

do some refactor

fda0281

RemoveStageBarrier: fix problem with consolidated shuffle file

781861d

RemoveStageBarrier: should fail the pre-started stages if the parent …

679813b

…stage gets re-submitted

lirui added 16 commits July 3, 2014 14:59

RemoveStageBarrier: make the updater sleep a little longer if maps ar…

d941899

…e finishing slowly

RemoveStageBarrier: fix bug

12b8093

Revert "RemoveStageBarrier: fix bug"

c74a876

This reverts commit 12b8093.

RemoveStageBarrier: pre-start a stage if all of its parents' tasks ha…

c313fe0

…ve been launched

RemoveStageBarrier: code refactor

033ffc0

RemoveStageBarrier: add some log

8a08a6c

RemoveStageBarrier: revert change about tracking waiting tasks

a8b5d75

RemoveStageBarrier: code cleanup

f66a8eb

merge upstream master branch

1521fef

RemoveStageBarrier: fix code style

9747d6b

RemoveStageBarrier: minor fix

8f798d8

RemoveStageBarrier: minor fix

1ab7a15

RemoveStageBarrier: let the reducer wake the updater

8417ffe

RemoveStageBarrier: introduce a min interval to update map status

31c4634

RemoveStageBarrier: fix bug

e1c374c

RemoveStageBarrier: code clean up

a503508

zsxwing reviewed Jul 22, 2014
View reviewed changes

zsxwing mentioned this pull request Jul 23, 2014

SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap #1541

Closed

fix style

85a5d85

asfgit closed this in d112a6c Sep 21, 2014

lianhuiwang mentioned this pull request Nov 24, 2014

[SPARK-2387][Core]Remove Stage's barrier #3430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2387: remove stage barrier #1328

SPARK-2387: remove stage barrier #1328

lirui-apache commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

sryza commented Jul 8, 2014

lirui-apache commented Jul 9, 2014

colorant commented Jul 11, 2014

zsxwing Jul 22, 2014

zsxwing Jul 23, 2014

lirui-apache Jul 23, 2014

zsxwing Jul 23, 2014

pwendell commented Sep 21, 2014

SPARK-2387: remove stage barrier #1328

SPARK-2387: remove stage barrier #1328

Conversation

lirui-apache commented Jul 8, 2014

AmplabJenkins commented Jul 8, 2014

sryza commented Jul 8, 2014

lirui-apache commented Jul 9, 2014

colorant commented Jul 11, 2014

zsxwing Jul 22, 2014

Choose a reason for hiding this comment

zsxwing Jul 23, 2014

Choose a reason for hiding this comment

lirui-apache Jul 23, 2014

Choose a reason for hiding this comment

zsxwing Jul 23, 2014

Choose a reason for hiding this comment

pwendell commented Sep 21, 2014