[SPARK-18838][CORE] Use separate executor service for each event listener #16291

sitalkedia · 2016-12-15T02:56:24Z

What changes were proposed in this pull request?

Currently we are observing the issue of very high event processing delay in driver's ListenerBus for large jobs with many tasks. Many critical component of the scheduler like ExecutorAllocationManager, HeartbeatReceiver depend on the ListenerBus events and this delay might hurt the job performance significantly or even fail the job. For example, a significant delay in receiving the SparkListenerTaskStart might cause ExecutorAllocationManager manager to mistakenly remove an executor which is not idle.
The problem is that the event processor in ListenerBus is a single thread which loops through all the Listeners for each event and processes each event synchronously https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. This single threaded processor often becomes the bottleneck for large jobs. Also, if one of the Listener is very slow, all the listeners will pay the price of delay incurred by the slow listener. In addition to that a slow listener can cause events to be dropped from the event queue which might be fatal to the job.
To solve the above problems, we propose to get rid of the event queue and the single threaded event processor. Instead each listener will have its own dedicate single threaded executor service . When ever an event is posted, it will be submitted to executor service of all the listeners. The Single threaded executor service will guarantee in order processing of the events per listener. The queue used for the executor service will be bounded to guarantee we do not grow the memory indefinitely. The downside of this approach is separate event queue per listener will increase the driver memory footprint.

How was this patch tested?

Tested by running the job on the cluster and the average event processing latency dropped from 2 minutes to 3 milliseconds.

SparkQA · 2016-12-15T02:59:30Z

Test build #70172 has finished for PR 16291 at commit b4af82f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

sitalkedia · 2016-12-15T03:02:20Z

cc - @zsxwing - Please note that the PR is incomplete and there are some test failures. I just wanted some initial feedback on the design before investing more time on it.

SparkQA · 2016-12-15T03:10:11Z

Test build #70173 has finished for PR 16291 at commit ed79578.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-15T05:14:58Z

Test build #70180 has finished for PR 16291 at commit defd536.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

rxin · 2016-12-15T18:01:21Z

This was initially introduced by @kayousterhout

vanzin · 2016-12-15T19:30:28Z

Hmm... I took a quick look and I'm not sure I understand exactly what's going on. It seems you're wrapping each listener with a ListenerEvenProcessor (note the typo), and each processor has its own thread pool for processing events.

If that's the case, that sounds wrong. Each listener should process events serially otherwise you risk getting into funny situations like a task end event being processed before the task start event for the same task.

I think this would benefit from a proper explanation of the changes being proposed, instead of a bunch of code. How will the listener bus work with the changes? Where will we have thread pools, where will we have message queues? Will each listener get its own dedicated thread, or will there be a limit? What kind of controls will there be to avoid memory pressure? Would it be worth it to allow certain listeners to have dedicated threads while others share the same one like the current approach? That kind of thing. Can you write something up and post it to the bug?

sitalkedia · 2016-12-16T05:40:48Z

@vanzin - Thanks for taking a look and sorry about putting my unfinished sloppy code out there. I updated the bug and the PR with the overall design idea, which hopefully will answer your questions.

Each listener should process events serially otherwise you risk getting into funny situations like a task end event being processed before the task start event for the same task.

You are right, that's why the executor service is single threaded which guarantees ordered processing of the events per listener.

SparkQA · 2016-12-17T01:19:18Z

Test build #70287 has finished for PR 16291 at commit c790861.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-17T01:24:31Z

Test build #70289 has finished for PR 16291 at commit d687aaf.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-17T03:18:24Z

Test build #70290 has finished for PR 16291 at commit 20fbf2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-17T06:27:40Z

Test build #70298 has started for PR 16291 at commit 9b4968d.

sitalkedia · 2016-12-17T06:27:41Z

Jenkins test this please.

SparkQA · 2016-12-17T06:32:36Z

Test build #70299 has started for PR 16291 at commit 9b4968d.

SparkQA · 2016-12-17T17:36:39Z

Test build #70311 has finished for PR 16291 at commit d331f1e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-18T18:45:37Z

Test build #70322 has finished for PR 16291 at commit fbcdcca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-19T04:04:34Z

Test build #70325 has finished for PR 16291 at commit 6763827.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

SparkQA · 2016-12-19T07:52:34Z

Test build #70345 has started for PR 16291 at commit 66e4f12.

sitalkedia · 2016-12-19T15:10:25Z

Jenkins retest this please.

SparkQA · 2016-12-19T17:37:26Z

Test build #70360 has finished for PR 16291 at commit 66e4f12.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

vanzin · 2016-12-19T20:23:24Z

I'm not sure whether that change would be a problem, but I wonder if a system where listeners can "opt in" to this behavior wouldn't be better. e.g., EventLoggingListener can slow down the bus, having it run in a separate thread would benefit all the others. Similarly, the dynamic allocation listener could have its own dedicated thread since it needs lower latency, and doesn't really depend on any others.

sitalkedia · 2016-12-22T05:01:41Z

@markhamstra, @vanzin thanks for your comments!

so it is entirely possible. for example, for one ListenerEventExecutor to process a task end event for a particular task before another ListenerEventExecutor has worked sufficiently through its eventQueue to have even seen the corresponding task start event. That is quite a bit different than the prior ordering guarantee implied by the comment "onPostEvent is guaranteed to be called in the same thread for all listeners."

That is true. If a listener is slow in processing the events than it might not see the task start event where as other listeners might already have processed the corresponding task end event. I am not sure why would that cause any problem though. Each listener should in theory could work independent of each other and we should only guarantee ordered processing of the events within a listener. This is the only way I can think of to scale the driver for large workloads.

but I wonder if a system where listeners can "opt in" to this behavior wouldn't be better. e.g., EventLoggingListener can slow down the bus, having it run in a separate thread would benefit all the others.

I think that will be an over-kill. Having a separate thread per listener simplifies the design. Also, for user added listeners, it will be hard to decide if we want to have a shared or separate thread per listener. The only downside of having a separate thread per listener is increased memory footprint in worst case. But I would argue that since the average event processing latency is decreased by multi threaded event processor, the backlog on the event queues will clear faster which actually might decrease the driver memory footprint on average case.

markhamstra · 2016-12-22T16:36:16Z

Each listener should in theory could work independent of each other and we should only guarantee ordered processing of the events within a listener.

If we were starting from nothing, then yes, it would be valid and advisable to design the Listener infrastructure using only this weaker guarantee. The issue, though, is that we are not starting from nothing, but rather from a system that currently offers a much stronger guarantee on the synchronized behavior of Listeners. If it is the case that no Listeners currently rely on the stronger guarantee and thus could work completely correctly under the weaker guarantee of this PR, then we could make this change without much additional concern. But reaching that level of confidence in current Listeners is a difficult prerequisite -- strictly speaking, it's an impossible task.

We could carefully work through all the internal behavior of Spark's Listeners to convince ourselves that they can work correctly under the new, weaker guarantee. At a bare minimum, we need to do that much before we can consider merging this PR -- but that's probably not enough. The problem is that Listeners aren't just internal to Spark. Users have also developed their own custom Listeners that either implement SparkListenerInterface or extend SparkListener or SparkFirehoseListener, and we can't just assume that those custom Listeners don't rely upon the current guarantee to either synchronize behavior with other custom Listeners or even with Spark internal Listeners. Since we can't know that user Listeners don't already rely upon the current, stronger guarantee, the question now becomes whether we even have the freedom to change that guarantee within the lifetime of Spark 2.x, or whether any such change would have to wait for Spark 3.x.

SparkListener is still annotated as @DeveloperAPI, so if that were the only piece in play, then we could change its guarantee fairly freely. SparkListenerInterface is almost as good, since it includes the admonition in a comment to "[n]ote that this is an internal interface which might change in different Spark releases." The stickier issue is with SparkFirehoseListener, which carries no such annotations or comments, but is just a plain public class and API. So, after convincing ourselves that Spark's internal Listeners would be fine with this PR, we'd still have to convince the Spark PMC that changing the public SparkFirehoseListener (with prominent warnings in the release notes, of course) before Spark 3.x would be acceptable.

And all of the above is still really only arguing about whether we could adopt this PR in essentially its present form. There are still questions of whether we should do this or maybe instead we should do something a little different or more. I can see some merit in Marcelo's "opt in" suggestion. If there is utility in having groups of Listeners that can rely upon synchronized behavior, then we should probably retain one or more threads running synchronized Listeners. For example, if Listener A relies upon synchronization with Listeners B and C while D needs to synchronize with E, but F, G and H are all independent, then there are a couple of things we could do. First, the independent Listeners (F, G and H) can each run in its own thread, providing the scalable performance that this PR is aiming for. After that, we could either have one synchronized Listener thread for all the other Listeners, or we could have one thread for A, B and C and one thread for D and E. Whether we support only one synchronized Listener group/thread or multiple, we'd still need some mechanism for Listeners to select into a synchronized group or to indicate that they can and should be run independently on their own thread.

@rxin

vanzin · 2016-12-22T21:27:54Z

I think that will be an over-kill. Having a separate thread per listener simplifies the design.

I don't think it's overkill, and it wouldn't really be that much of a change from your existing design. It also serves as a way of controlling how much extra memory you'll need for these things, since right now each listener would incur in its own thread + message queue.

e.g. you could just have a tagging interface (if listener implements "X" then it gets its own dedicated dispatch thread, or if you want to get fancy, you could also define a "group" so that listener in the same group share a dispatch thread).

mridulm · 2016-12-23T03:14:18Z

I agree with @markhamstra and @vanzin - having ability to tag listeners into groups (default = spark listener group) and preserving current synchronized behavior within group would be ensure backward compatibility at fairly minimal additional complexity.

SparkQA · 2016-12-28T01:29:10Z

Test build #70649 has finished for PR 16291 at commit 162001a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2016-12-28T01:46:04Z

jenkins test this please.

SparkQA · 2016-12-28T01:52:04Z

Test build #70650 has finished for PR 16291 at commit f2c777d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-28T02:07:07Z

Test build #70651 has finished for PR 16291 at commit 9e5819d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-28T04:13:59Z

Test build #70652 has finished for PR 16291 at commit 6494cfe.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-28T06:42:40Z

Test build #70656 has started for PR 16291 at commit e09d66e.

sitalkedia · 2016-12-28T08:14:17Z

Jenkins retest this please.

SparkQA · 2016-12-28T10:44:05Z

Test build #70660 has finished for PR 16291 at commit e09d66e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sitalkedia · 2016-12-28T15:41:02Z

Thanks guys for your input. I have made changes to support grouping of listeners to preserve the existing synchronization behavior. Currently, the ExecutionAllocationManager, HeartbeatReceiver and EventLogger use their separate threads. All User developed Listerners use a separate thread and all other belong to a default group. Let me know what you think the approach.

@rxin, @vanzin, @markhamstra , @mridulm .

vanzin

I did a first pass to check the logic, and as I explained in the comments, my main concern is that I think this change should only apply to LiveListenerBus. This asynchronous behavior is actually not desired in ReplayListenerBus.

vanzin · 2017-01-05T22:27:25Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

 import org.apache.spark.internal.Logging
+import org.apache.spark.scheduler.LiveListenerBus
+
+private class ListenerEventExecutor[L <: AnyRef] (listenerName: String, queueCapacity: Int)


Not a fan of the name, since you don't execute events. I prefer ListenerEventDispatcher.

Also, listenerName is more of a group name now, right? You mention "listener" in many places of this class, those should be replaced with "listener group".

Finally, not sure why do you need the type parameter?

nit: no space before (

vanzin · 2017-01-05T22:27:49Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

+
+private class ListenerEventExecutor[L <: AnyRef] (listenerName: String, queueCapacity: Int)
+  extends Logging {
+  private val threadFactory = new ThreadFactoryBuilder().setDaemon(true)


ThreadUtils.namedThreadFactory.

vanzin · 2017-01-05T22:30:43Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

+   * guarantee that we do not process any event before starting the event executor.
+   */
+  private val isStarted = new AtomicBoolean(false)
+  private val lock = new ReentrantLock()


Minor, but you could achieve the same thing with synchronized/wait/notifyAll, with less code.

vanzin · 2017-01-05T22:36:29Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

+
+  def stop(): Unit = {
+    executorService.shutdown()
+    executorService.awaitTermination(10, TimeUnit.SECONDS)


This changes the current behavior slightly. Right now the bus will wait until the listener thread stops:

listenerThread.join()

Here you might return before things are really shut down. It's unlikely to happen, but if it does happen, it can lead to weird issues. Might be better to try shutdownNow() if this call fails, which at least makes a best effort to interrupt existing tasks that might be slow.

vanzin · 2017-01-05T22:37:11Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

+    } catch {
+      case e: RejectedExecutionException =>
+        droppedEventsCounter.incrementAndGet()
+        if (System.currentTimeMillis() - lastReportTimestamp >= 60 * 1000) {


When measuring elapsed time it's generally better to use System.nanoTime since it's monotonic.

vanzin · 2017-01-05T22:58:01Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingQueryListenerBus.scala

@@ -35,7 +35,7 @@ import org.apache.spark.util.ListenerBus
 * and StreamingQueryManager. So this bus will dispatch events to registered listeners for only
 * those queries that were started in the associated SparkSession.
 */
-class StreamingQueryListenerBus(sparkListenerBus: LiveListenerBus)
+class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)


Do you need to make this a val?

vanzin · 2017-01-05T22:58:39Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingQueryListenerBus.scala

@@ -69,7 +69,7 @@ class StreamingQueryListenerBus(sparkListenerBus: LiveListenerBus)
        activeQueryRunIds.synchronized { activeQueryRunIds += s.runId }
        sparkListenerBus.post(s)
        // post to local listeners to trigger callbacks
-        postToAll(s)
+        postToAllSync(s)


Do you need this change? Seems like it wasn't a synchronous call before. If you need it, then postToAllSync isn't really for testing only, and you need to change the comment in that method.

Same below.

vanzin · 2017-01-05T22:59:55Z

streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala

@@ -661,7 +661,7 @@ class StreamingContext private[streaming] (
    var shutdownHookRefToRemove: AnyRef = null
    if (LiveListenerBus.withinListenerThread.value) {
      throw new SparkException(
-        s"Cannot stop StreamingContext within listener thread of ${LiveListenerBus.name}")
+        "Cannot stop SparkContext within listener event executor thread")


StreamingContext was actually correct.

vanzin · 2017-01-05T23:00:13Z

streaming/src/main/scala/org/apache/spark/streaming/scheduler/StreamingListenerBus.scala

@@ -26,7 +26,7 @@ import org.apache.spark.util.ListenerBus
 * registers itself with Spark listener bus, so that it can receive WrappedStreamingListenerEvents,
 * unwrap them as StreamingListenerEvent and dispatch them to StreamingListeners.
 */
-private[streaming] class StreamingListenerBus(sparkListenerBus: LiveListenerBus)
+private[streaming] class StreamingListenerBus(val sparkListenerBus: LiveListenerBus)


Similarly, is val necessary?

vanzin · 2017-01-05T23:01:11Z

core/src/main/scala/org/apache/spark/util/ListenerBus.scala

+}
+
+private[spark] object ListenerEventExecutor {
+  val DefaultEventListenerGroup = "default-event-listener"


Spark's style is to use ALL_CAPS for constants.

tgravescs · 2017-03-02T14:16:42Z

@sitalkedia are you still working on this?

sitalkedia · 2017-03-02T16:37:05Z

@tgravescs - I have not gotten time to work on this. Hopefully will get to it sometime next week.

CodingCat · 2017-04-20T21:48:56Z

what's the current status of this PR?

HyukjinKwon · 2017-05-11T14:02:33Z

what's the current status of this PR?

sitalkedia force-pushed the skedia/event_log_processing branch from b4af82f to ed79578 Compare December 15, 2016 03:07

sitalkedia force-pushed the skedia/event_log_processing branch from ed79578 to defd536 Compare December 15, 2016 05:08

sitalkedia changed the title ~~[SPARK-18838] Use separate executor service for each event listener~~ [SPARK-18838][WIP] Use separate executor service for each event listener Dec 15, 2016

sitalkedia force-pushed the skedia/event_log_processing branch from defd536 to c790861 Compare December 17, 2016 01:15

sitalkedia changed the title ~~[SPARK-18838][WIP] Use separate executor service for each event listener~~ [SPARK-18838][CORE] Use separate executor service for each event listener Dec 17, 2016

sitalkedia force-pushed the skedia/event_log_processing branch from c790861 to d687aaf Compare December 17, 2016 01:19

sitalkedia force-pushed the skedia/event_log_processing branch from d687aaf to 20fbf2f Compare December 17, 2016 01:26

sitalkedia force-pushed the skedia/event_log_processing branch from 9b4968d to d331f1e Compare December 17, 2016 17:02

sitalkedia force-pushed the skedia/event_log_processing branch from d331f1e to fbcdcca Compare December 18, 2016 16:47

sitalkedia force-pushed the skedia/event_log_processing branch from fbcdcca to 6763827 Compare December 19, 2016 02:09

sitalkedia force-pushed the skedia/event_log_processing branch from 6763827 to 66e4f12 Compare December 19, 2016 07:49

sitalkedia force-pushed the skedia/event_log_processing branch from 162001a to f2c777d Compare December 28, 2016 01:42

sitalkedia force-pushed the skedia/event_log_processing branch from f2c777d to 9e5819d Compare December 28, 2016 02:00

sitalkedia force-pushed the skedia/event_log_processing branch from 9e5819d to 6494cfe Compare December 28, 2016 04:06

Sital Kedia added 2 commits December 27, 2016 20:20

[SPARK-18838] Use a separate executor service for each event listener

df5b05e

Support event listener group

e09d66e

sitalkedia force-pushed the skedia/event_log_processing branch from 6494cfe to e09d66e Compare December 28, 2016 06:41

vanzin reviewed Jan 5, 2017

View reviewed changes

This was referenced Jun 2, 2017

[SPARK-18838][CORE] Introduce blocking strategy for LiveListener #18004

Closed

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

vanzin mentioned this pull request Jun 15, 2017

[SPARK-18838][CORE] Introduce multiple queues in LiveListenerBus #18253

Closed

[SPARK-18838][CORE] Use separate executor service for each event listener #16291

[SPARK-18838][CORE] Use separate executor service for each event listener #16291

Conversation

sitalkedia commented Dec 15, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 15, 2016

sitalkedia commented Dec 15, 2016

SparkQA commented Dec 15, 2016

SparkQA commented Dec 15, 2016

rxin commented Dec 15, 2016

vanzin commented Dec 15, 2016

sitalkedia commented Dec 16, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

sitalkedia commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 18, 2016

SparkQA commented Dec 19, 2016

SparkQA commented Dec 19, 2016

sitalkedia commented Dec 19, 2016

SparkQA commented Dec 19, 2016

vanzin commented Dec 19, 2016

sitalkedia commented Dec 22, 2016

markhamstra commented Dec 22, 2016

vanzin commented Dec 22, 2016

mridulm commented Dec 23, 2016

SparkQA commented Dec 28, 2016

sitalkedia commented Dec 28, 2016

SparkQA commented Dec 28, 2016

SparkQA commented Dec 28, 2016

SparkQA commented Dec 28, 2016

SparkQA commented Dec 28, 2016

sitalkedia commented Dec 28, 2016

SparkQA commented Dec 28, 2016

sitalkedia commented Dec 28, 2016

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Jan 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Mar 2, 2017

sitalkedia commented Mar 2, 2017

CodingCat commented Apr 20, 2017

HyukjinKwon commented May 11, 2017

sitalkedia commented Dec 15, 2016 •

edited

Loading

vanzin Jan 5, 2017 •

edited

Loading