Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18838][CORE] Use separate executor service for each event listener #16291

Closed

Conversation

sitalkedia
Copy link

@sitalkedia sitalkedia commented Dec 15, 2016

What changes were proposed in this pull request?

Currently we are observing the issue of very high event processing delay in driver's ListenerBus for large jobs with many tasks. Many critical component of the scheduler like ExecutorAllocationManager, HeartbeatReceiver depend on the ListenerBus events and this delay might hurt the job performance significantly or even fail the job. For example, a significant delay in receiving the SparkListenerTaskStart might cause ExecutorAllocationManager manager to mistakenly remove an executor which is not idle.
The problem is that the event processor in ListenerBus is a single thread which loops through all the Listeners for each event and processes each event synchronously https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. This single threaded processor often becomes the bottleneck for large jobs. Also, if one of the Listener is very slow, all the listeners will pay the price of delay incurred by the slow listener. In addition to that a slow listener can cause events to be dropped from the event queue which might be fatal to the job.
To solve the above problems, we propose to get rid of the event queue and the single threaded event processor. Instead each listener will have its own dedicate single threaded executor service . When ever an event is posted, it will be submitted to executor service of all the listeners. The Single threaded executor service will guarantee in order processing of the events per listener. The queue used for the executor service will be bounded to guarantee we do not grow the memory indefinitely. The downside of this approach is separate event queue per listener will increase the driver memory footprint.

How was this patch tested?

Tested by running the job on the cluster and the average event processing latency dropped from 2 minutes to 3 milliseconds.

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70172 has finished for PR 16291 at commit b4af82f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia
Copy link
Author

cc - @zsxwing - Please note that the PR is incomplete and there are some test failures. I just wanted some initial feedback on the design before investing more time on it.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from b4af82f to ed79578 Compare December 15, 2016 03:07
@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70173 has finished for PR 16291 at commit ed79578.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from ed79578 to defd536 Compare December 15, 2016 05:08
@sitalkedia sitalkedia changed the title [SPARK-18838] Use separate executor service for each event listener [SPARK-18838][WIP] Use separate executor service for each event listener Dec 15, 2016
@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70180 has finished for PR 16291 at commit defd536.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@rxin
Copy link
Contributor

rxin commented Dec 15, 2016

This was initially introduced by @kayousterhout

@vanzin
Copy link
Contributor

vanzin commented Dec 15, 2016

Hmm... I took a quick look and I'm not sure I understand exactly what's going on. It seems you're wrapping each listener with a ListenerEvenProcessor (note the typo), and each processor has its own thread pool for processing events.

If that's the case, that sounds wrong. Each listener should process events serially otherwise you risk getting into funny situations like a task end event being processed before the task start event for the same task.

I think this would benefit from a proper explanation of the changes being proposed, instead of a bunch of code. How will the listener bus work with the changes? Where will we have thread pools, where will we have message queues? Will each listener get its own dedicated thread, or will there be a limit? What kind of controls will there be to avoid memory pressure? Would it be worth it to allow certain listeners to have dedicated threads while others share the same one like the current approach? That kind of thing. Can you write something up and post it to the bug?

@sitalkedia
Copy link
Author

@vanzin - Thanks for taking a look and sorry about putting my unfinished sloppy code out there. I updated the bug and the PR with the overall design idea, which hopefully will answer your questions.

Each listener should process events serially otherwise you risk getting into funny situations like a task end event being processed before the task start event for the same task.

You are right, that's why the executor service is single threaded which guarantees ordered processing of the events per listener.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from defd536 to c790861 Compare December 17, 2016 01:15
@sitalkedia sitalkedia changed the title [SPARK-18838][WIP] Use separate executor service for each event listener [SPARK-18838][CORE] Use separate executor service for each event listener Dec 17, 2016
@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70287 has finished for PR 16291 at commit c790861.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from c790861 to d687aaf Compare December 17, 2016 01:19
@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70289 has finished for PR 16291 at commit d687aaf.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from d687aaf to 20fbf2f Compare December 17, 2016 01:26
@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70290 has finished for PR 16291 at commit 20fbf2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70298 has started for PR 16291 at commit 9b4968d.

@sitalkedia
Copy link
Author

Jenkins test this please.

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70299 has started for PR 16291 at commit 9b4968d.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from 9b4968d to d331f1e Compare December 17, 2016 17:02
@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70311 has finished for PR 16291 at commit d331f1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from d331f1e to fbcdcca Compare December 18, 2016 16:47
@SparkQA
Copy link

SparkQA commented Dec 18, 2016

Test build #70322 has finished for PR 16291 at commit fbcdcca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from fbcdcca to 6763827 Compare December 19, 2016 02:09
@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70325 has finished for PR 16291 at commit 6763827.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from 6763827 to 66e4f12 Compare December 19, 2016 07:49
@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70345 has started for PR 16291 at commit 66e4f12.

@sitalkedia
Copy link
Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70360 has finished for PR 16291 at commit 66e4f12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)

@vanzin
Copy link
Contributor

vanzin commented Dec 19, 2016

I'm not sure whether that change would be a problem, but I wonder if a system where listeners can "opt in" to this behavior wouldn't be better. e.g., EventLoggingListener can slow down the bus, having it run in a separate thread would benefit all the others. Similarly, the dynamic allocation listener could have its own dedicated thread since it needs lower latency, and doesn't really depend on any others.

@sitalkedia
Copy link
Author

@markhamstra, @vanzin thanks for your comments!

so it is entirely possible. for example, for one ListenerEventExecutor to process a task end event for a particular task before another ListenerEventExecutor has worked sufficiently through its eventQueue to have even seen the corresponding task start event. That is quite a bit different than the prior ordering guarantee implied by the comment "onPostEvent is guaranteed to be called in the same thread for all listeners."

That is true. If a listener is slow in processing the events than it might not see the task start event where as other listeners might already have processed the corresponding task end event. I am not sure why would that cause any problem though. Each listener should in theory could work independent of each other and we should only guarantee ordered processing of the events within a listener. This is the only way I can think of to scale the driver for large workloads.

but I wonder if a system where listeners can "opt in" to this behavior wouldn't be better. e.g., EventLoggingListener can slow down the bus, having it run in a separate thread would benefit all the others.

I think that will be an over-kill. Having a separate thread per listener simplifies the design. Also, for user added listeners, it will be hard to decide if we want to have a shared or separate thread per listener. The only downside of having a separate thread per listener is increased memory footprint in worst case. But I would argue that since the average event processing latency is decreased by multi threaded event processor, the backlog on the event queues will clear faster which actually might decrease the driver memory footprint on average case.

@markhamstra
Copy link
Contributor

Each listener should in theory could work independent of each other and we should only guarantee ordered processing of the events within a listener.

If we were starting from nothing, then yes, it would be valid and advisable to design the Listener infrastructure using only this weaker guarantee. The issue, though, is that we are not starting from nothing, but rather from a system that currently offers a much stronger guarantee on the synchronized behavior of Listeners. If it is the case that no Listeners currently rely on the stronger guarantee and thus could work completely correctly under the weaker guarantee of this PR, then we could make this change without much additional concern. But reaching that level of confidence in current Listeners is a difficult prerequisite -- strictly speaking, it's an impossible task.

We could carefully work through all the internal behavior of Spark's Listeners to convince ourselves that they can work correctly under the new, weaker guarantee. At a bare minimum, we need to do that much before we can consider merging this PR -- but that's probably not enough. The problem is that Listeners aren't just internal to Spark. Users have also developed their own custom Listeners that either implement SparkListenerInterface or extend SparkListener or SparkFirehoseListener, and we can't just assume that those custom Listeners don't rely upon the current guarantee to either synchronize behavior with other custom Listeners or even with Spark internal Listeners. Since we can't know that user Listeners don't already rely upon the current, stronger guarantee, the question now becomes whether we even have the freedom to change that guarantee within the lifetime of Spark 2.x, or whether any such change would have to wait for Spark 3.x.

SparkListener is still annotated as @DeveloperAPI, so if that were the only piece in play, then we could change its guarantee fairly freely. SparkListenerInterface is almost as good, since it includes the admonition in a comment to "[n]ote that this is an internal interface which might change in different Spark releases." The stickier issue is with SparkFirehoseListener, which carries no such annotations or comments, but is just a plain public class and API. So, after convincing ourselves that Spark's internal Listeners would be fine with this PR, we'd still have to convince the Spark PMC that changing the public SparkFirehoseListener (with prominent warnings in the release notes, of course) before Spark 3.x would be acceptable.

And all of the above is still really only arguing about whether we could adopt this PR in essentially its present form. There are still questions of whether we should do this or maybe instead we should do something a little different or more. I can see some merit in Marcelo's "opt in" suggestion. If there is utility in having groups of Listeners that can rely upon synchronized behavior, then we should probably retain one or more threads running synchronized Listeners. For example, if Listener A relies upon synchronization with Listeners B and C while D needs to synchronize with E, but F, G and H are all independent, then there are a couple of things we could do. First, the independent Listeners (F, G and H) can each run in its own thread, providing the scalable performance that this PR is aiming for. After that, we could either have one synchronized Listener thread for all the other Listeners, or we could have one thread for A, B and C and one thread for D and E. Whether we support only one synchronized Listener group/thread or multiple, we'd still need some mechanism for Listeners to select into a synchronized group or to indicate that they can and should be run independently on their own thread.

@rxin

@vanzin
Copy link
Contributor

vanzin commented Dec 22, 2016

I think that will be an over-kill. Having a separate thread per listener simplifies the design.

I don't think it's overkill, and it wouldn't really be that much of a change from your existing design. It also serves as a way of controlling how much extra memory you'll need for these things, since right now each listener would incur in its own thread + message queue.

e.g. you could just have a tagging interface (if listener implements "X" then it gets its own dedicated dispatch thread, or if you want to get fancy, you could also define a "group" so that listener in the same group share a dispatch thread).

@mridulm
Copy link
Contributor

mridulm commented Dec 23, 2016

I agree with @markhamstra and @vanzin - having ability to tag listeners into groups (default = spark listener group) and preserving current synchronized behavior within group would be ensure backward compatibility at fairly minimal additional complexity.

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70649 has finished for PR 16291 at commit 162001a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from 162001a to f2c777d Compare December 28, 2016 01:42
@sitalkedia
Copy link
Author

jenkins test this please.

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70650 has finished for PR 16291 at commit f2c777d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from f2c777d to 9e5819d Compare December 28, 2016 02:00
@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70651 has finished for PR 16291 at commit 9e5819d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from 9e5819d to 6494cfe Compare December 28, 2016 04:06
@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70652 has finished for PR 16291 at commit 6494cfe.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia sitalkedia force-pushed the skedia/event_log_processing branch from 6494cfe to e09d66e Compare December 28, 2016 06:41
@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70656 has started for PR 16291 at commit e09d66e.

@sitalkedia
Copy link
Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70660 has finished for PR 16291 at commit e09d66e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

Thanks guys for your input. I have made changes to support grouping of listeners to preserve the existing synchronization behavior. Currently, the ExecutionAllocationManager, HeartbeatReceiver and EventLogger use their separate threads. All User developed Listerners use a separate thread and all other belong to a default group. Let me know what you think the approach.

@rxin, @vanzin, @markhamstra , @mridulm .

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass to check the logic, and as I explained in the comments, my main concern is that I think this change should only apply to LiveListenerBus. This asynchronous behavior is actually not desired in ReplayListenerBus.

import org.apache.spark.internal.Logging
import org.apache.spark.scheduler.LiveListenerBus

private class ListenerEventExecutor[L <: AnyRef] (listenerName: String, queueCapacity: Int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the name, since you don't execute events. I prefer ListenerEventDispatcher.

Also, listenerName is more of a group name now, right? You mention "listener" in many places of this class, those should be replaced with "listener group".

Finally, not sure why do you need the type parameter?

nit: no space before (


private class ListenerEventExecutor[L <: AnyRef] (listenerName: String, queueCapacity: Int)
extends Logging {
private val threadFactory = new ThreadFactoryBuilder().setDaemon(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadUtils.namedThreadFactory.

* guarantee that we do not process any event before starting the event executor.
*/
private val isStarted = new AtomicBoolean(false)
private val lock = new ReentrantLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but you could achieve the same thing with synchronized/wait/notifyAll, with less code.


def stop(): Unit = {
executorService.shutdown()
executorService.awaitTermination(10, TimeUnit.SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the current behavior slightly. Right now the bus will wait until the listener thread stops:

listenerThread.join()

Here you might return before things are really shut down. It's unlikely to happen, but if it does happen, it can lead to weird issues. Might be better to try shutdownNow() if this call fails, which at least makes a best effort to interrupt existing tasks that might be slow.

} catch {
case e: RejectedExecutionException =>
droppedEventsCounter.incrementAndGet()
if (System.currentTimeMillis() - lastReportTimestamp >= 60 * 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When measuring elapsed time it's generally better to use System.nanoTime since it's monotonic.

@@ -35,7 +35,7 @@ import org.apache.spark.util.ListenerBus
* and StreamingQueryManager. So this bus will dispatch events to registered listeners for only
* those queries that were started in the associated SparkSession.
*/
class StreamingQueryListenerBus(sparkListenerBus: LiveListenerBus)
class StreamingQueryListenerBus(val sparkListenerBus: LiveListenerBus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to make this a val?

@@ -69,7 +69,7 @@ class StreamingQueryListenerBus(sparkListenerBus: LiveListenerBus)
activeQueryRunIds.synchronized { activeQueryRunIds += s.runId }
sparkListenerBus.post(s)
// post to local listeners to trigger callbacks
postToAll(s)
postToAllSync(s)
Copy link
Contributor

@vanzin vanzin Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this change? Seems like it wasn't a synchronous call before. If you need it, then postToAllSync isn't really for testing only, and you need to change the comment in that method.

Same below.

@@ -661,7 +661,7 @@ class StreamingContext private[streaming] (
var shutdownHookRefToRemove: AnyRef = null
if (LiveListenerBus.withinListenerThread.value) {
throw new SparkException(
s"Cannot stop StreamingContext within listener thread of ${LiveListenerBus.name}")
"Cannot stop SparkContext within listener event executor thread")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamingContext was actually correct.

@@ -26,7 +26,7 @@ import org.apache.spark.util.ListenerBus
* registers itself with Spark listener bus, so that it can receive WrappedStreamingListenerEvents,
* unwrap them as StreamingListenerEvent and dispatch them to StreamingListeners.
*/
private[streaming] class StreamingListenerBus(sparkListenerBus: LiveListenerBus)
private[streaming] class StreamingListenerBus(val sparkListenerBus: LiveListenerBus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, is val necessary?

}

private[spark] object ListenerEventExecutor {
val DefaultEventListenerGroup = "default-event-listener"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's style is to use ALL_CAPS for constants.

@tgravescs
Copy link
Contributor

@sitalkedia are you still working on this?

@sitalkedia
Copy link
Author

@tgravescs - I have not gotten time to work on this. Hopefully will get to it sometime next week.

@CodingCat
Copy link
Contributor

what's the current status of this PR?

1 similar comment
@HyukjinKwon
Copy link
Member

what's the current status of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants