-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2387: remove stage barrier #1328
Conversation
… in MapOutputTracker
… preferred locations
…stage gets re-submitted
…e finishing slowly
This reverts commit 12b8093.
Can one of the admins verify this patch? |
SPARK-2099 is adding a general executor->driver heartbeat. It might be worth piggybacking the communication between the MapOutputTrackerWorker and MapOutputTrackerMaster on this. |
Thanks @sryza for the idea. I think it's OK to piggy back the communication in a heartbeat, but we should also allow the worker to explicitly ask the master for map statuses when a task demands more outputs. I'll look into it once you have that feature merged. |
I think there might be several issues need to be addressed here to make this more robust and sound solution:
Just my suggestion. |
@@ -340,6 +459,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) | |||
*/ | |||
private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) { | |||
protected val mapStatuses = new HashMap[Int, Array[MapStatus]] | |||
with mutable.SynchronizedMap[Int, Array[MapStatus]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ConcurrentHashMap
is better in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this PR be merged soon? If not, I hope this line can be merged soon because it solves a critical concurrent issue of mapStatuses
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing thanks for the comments. Maybe it's better to make it ConcurrentHashMap in the base class.
I don't think this PR can be merged soon... So maybe you can open another JIRA to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to make it ConcurrentHashMap in the base class.
Because MapOutputTrackerMaster uses TimeStampedHashMap which is not a ConcurrentHashMap, MapOutputTracker still needs to use Map. Nevertheless, I can add a comment on MapOutputTracker.mapStatuses to mark that it should be a thread-safe map.
I'd like to close this issue for now pending more of a design discussion on the JIRA. These Proof of Concept patches are useful to have, but I'd rather not have them lingering for a long time in the PR queue. I will post a link on the JIRA to this diff so we have it as a reference: |
…shMap MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon. Author: zsxwing <zsxwing@gmail.com> Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits: d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
This PR is a PoC implementation of SPARK-2387.
When a ShuffleMapTask finishes, DAGScheduler will check resource usage. And if there’re free slots, DAGScheduler chooses a stage from the waiting list whose parent stages have all started, and pre-starts this waiting stage. All the in-progress parent stages will then register the map outputs progressively with MapOutputTrackerMaster. A flag is added to MapOutputTracker to indicate whether the map statuses for a shuffle is partial or not, so that we can distinguish partial registration from failed shuffle map stage.
When the downstream task tries to fetch shuffle blocks, it will get an array of map outputs that has “holes” (unfinished map tasks) in it. We created PartialBlockFetcherIterator to handle this map output array. PartialBlockFetcherIterator keeps an array of conventional iterators (BasicBlockFetcherIterator or NettyBlockFetcherIterator). When some new map outputs become available, PartialBlockFetcherIterator will delegate these outputs to a new conventional iterator and relies on these conventional iterators for “hasNext” and “next” methods. When all the delegated map statuses run out, PartialBlockFetcherIterator contacts local MapOutputTrackerWorker for updated map outputs. MapOutputTrackerWorker uses an "updater" thread to communicate with MapOutputTrackerMaster to update the map statuses and informs the downstream tasks to continue when the map statuses get updated.
This PoC feature is mainly intended and tested against standalone cluster. I used a 7-node cluster for performance test. Each node runs an executor with 32 CPUs and 90GB memory. I used graphx.SynthBenchmark for the test and the testcase used is:
graphx.SynthBenchmark -partStrategy=EdgePartition2D -numEPart=112 -nverts=10000000 -niter=3
The feature improves the whole job by roughly 10% (reduces the creation time from 128s to 116s and run time from 126s to 115s).