[SPARK-4194] [core] Make SparkContext initialization exception-safe. #5335

vanzin · 2015-04-02T19:49:38Z

SparkContext has a very long constructor, where multiple things are
initialized, multiple threads are spawned, and multiple opportunities
for exceptions to be thrown exist. If one of these happens at an
innoportune time, lots of garbage tends to stick around.

This patch re-organizes SparkContext so that its internal state is
initialized in a big "try" block. The fields keeping state are now
completely private to SparkContext, and are "vars", because Scala
doesn't allow you to initialize a val later. The existing API interface
is kept by turning vals into defs (which works because Scala guarantees
the same binary interface for those).

On top of that, a few things in other areas were changed to avoid more
things leaking:

Executor was changed to explicitly wait for the heartbeat thread to
stop. LocalBackend was changed to wait for the "StopExecutor"
message to be received, since otherwise there could be a race
between that message arriving and the actor system being shut down.
ConnectionManager could possibly hang during shutdown, because an
interrupt at the wrong moment could cause the selector thread to
still call select and then wait forever. So also wake up the
selector so that this situation is avoided.

This fixes the thread leak. I also changed the unit test to keep track of allocated contexts and making sure they're closed after tests are run; this is needed since some tests use this pattern: val sc = createContext() doSomethingThatMayThrow() sc.stop()

SparkContext has a very long constructor, where multiple things are initialized, multiple threads are spawned, and multiple opportunities for exceptions to be thrown exist. If one of these happens at an innoportune time, lots of garbage tends to stick around. This patch re-organizes SparkContext so that its internal state is initialized in a big "try" block. The fields keeping state are now completely private to SparkContext, and are "vars", because Scala doesn't allow you to initialize a val later. The existing API interface is kept by turning vals into defs (which works because Scala guarantees the same binary interface for those). On top of that, a few things in other areas were changed to avoid more things leaking: - Executor was changed to explicitly wait for the heartbeat thread to stop. LocalBackend was changed to wait for the "StopExecutor" message to be received, since otherwise there could be a race between that message arriving and the actor system being shut down. - ConnectionManager could possibly hang during shutdown, because an interrupt at the wrong moment could cause the selector thread to still call select and then wait forever. So also wake up the selector so that this situation is avoided.

vanzin · 2015-04-02T19:51:43Z

Note the PR contains the commits from #5311. I hope once that is pushed that github will figure things out. (If not I'll rebase manually.) So if you want to skip those changes, just look at the last commit in the list.

With both of these PRs, I was able to run the core/ unit tests and verify that:

no executor allocator threads were left behind
no driver heartbeater threads were left behind
no "AkkaAskTimeout" exceptions show up in the unit test logs

I tried to run MiMA checks locally and they look ok, but let's see what jenkins says.

SparkQA · 2015-04-02T21:14:25Z

Test build #29621 has finished for PR 5335 at commit 8caa8b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala

vanzin · 2015-04-03T17:43:04Z

Oops, borked merge, fixing...

SparkQA · 2015-04-03T17:49:26Z

Test build #29684 has finished for PR 5335 at commit c671c46.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

vanzin · 2015-04-03T17:57:49Z

Jenkins, retest this please.

SparkQA · 2015-04-03T19:00:01Z

Test build #29685 has finished for PR 5335 at commit 6b73fcb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

SparkQA · 2015-04-03T19:54:22Z

Test build #29686 has finished for PR 5335 at commit 6b73fcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-03T21:12:56Z

Test build #29692 has finished for PR 5335 at commit 2621609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala

SparkQA · 2015-04-07T01:53:35Z

Test build #29764 has finished for PR 5335 at commit 408dada.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

vanzin · 2015-04-10T19:40:46Z

Ping?

srowen · 2015-04-10T21:07:12Z

I tend to trust your hand on this. This is a big change and it's hard to match up the existing logic with new logic in the diff, though it looks like that was the intent and effect from spot-checking some elements. That the tests have passed several times is a good sign. The reorganization is significantly positive since the fields, and their initialization, are clearly grouped now.

One minor style thing, why the _member naming syntax, where it's not otherwise used in Spark? just to make it crystal clear here what's a member?

I think the additional changes look reasonable, like using a Java Executor in the, well, Executor.

I favor this though weakly on the grounds that I'm mostly relying on tests for correctness. The intent is sound. @rxin @pwendell do you have any thoughts on this one?

vanzin · 2015-04-10T21:09:17Z

One minor style thing, why the _member naming syntax

That's actually used in lots of places in Spark. It's used when some variable / field name conflicts with a def, which is the case in this change.

srowen · 2015-04-10T21:18:05Z

OK right, and that's true of all of them here.

pwendell · 2015-04-11T04:10:08Z

ping @JoshRosen - I think he's proposed this exact change to me in the past.

JoshRosen · 2015-04-11T22:53:05Z

See the description of PR #3121 for my previous discussion of this. If we want to avoid introducing vars in SparkContext, one alternative would be to move the creation of these components outside of the SparkContext constructor and into a method in the companion object, putting the try / catch blocks there. This lets us isolate the mutability into a single method, so things are mutable while we're constructing SparkContext but are immutable once the object has been fully constructed. I'm not sure whether this approach would be easier or harder to understand than the mutable vars + getters used here.

vanzin · 2015-04-12T19:17:51Z

That aproach would be a lot more complicated. The first reason why it would be complicated is that you'd need an uber-constructor in SparkContext that takes all the initialized internal values. Unless there's some fancy Scala feature I'm not aware of, that in itself is scary as hell, and would mean the other constructors would be similarly ugly in that they'd have to call the companion object.

It would also cause (even more) duplication of the declaration of these things, since they'd have to be declared in the companion object's method too.

Finally, it would complicate stop(), because it would have to either be copy & pasted into the companion object so that it cleans up after an exception, or you'd need a stop() method in the companion object that takes all arguments as parameters, so that the SparkContext class can call it.

So while I would love to simplify the code in SparkContext, the alternative suggestion, as far as I can see, does nothing towards that.

And that's why I chose private vars. It's not optimal, and I really wish Scala would allow me to initialize a val after its declaration, like Java does. But it's the easiest approach, and it doesn't expose any mutable SparkContext state that wasn't already exposed before.

pwendell · 2015-04-13T02:43:24Z

I also feel that the current approach makes more sense than Josh's alternative. Changes to SparkContext get a lot of scrutiny during code review, so clear documentation, IMO is sufficient to ensure this is followed correctly (famous last words). I didn't have time to dive into this to make sure there are no correctness issues, but the broad approach looks good to me... I think it's worth fixing this up.

srowen · 2015-04-14T11:11:28Z

Seems like there's support for this change, though needs a rebase @vanzin . Any objections to proceeding after that?

Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

SparkQA · 2015-04-14T17:52:01Z

Test build #30260 has finished for PR 5335 at commit 80fc00e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.
This patch removes the following dependencies:
- RoaringBitmap-0.4.5.jar
- activation-1.1.jar
- akka-actor_2.10-2.3.4-spark.jar
- akka-remote_2.10-2.3.4-spark.jar
- akka-slf4j_2.10-2.3.4-spark.jar
- aopalliance-1.0.jar
- arpack_combined_all-0.1.jar
- avro-1.7.7.jar
- breeze-macros_2.10-0.11.2.jar
- breeze_2.10-0.11.2.jar
- chill-java-0.5.0.jar
- chill_2.10-0.5.0.jar
- commons-beanutils-1.7.0.jar
- commons-beanutils-core-1.8.0.jar
- commons-cli-1.2.jar
- commons-codec-1.10.jar
- commons-collections-3.2.1.jar
- commons-compress-1.4.1.jar
- commons-configuration-1.6.jar
- commons-digester-1.8.jar
- commons-httpclient-3.1.jar
- commons-io-2.1.jar
- commons-lang-2.5.jar
- commons-lang3-3.3.2.jar
- commons-math-2.1.jar
- commons-math3-3.1.1.jar
- commons-net-2.2.jar
- compress-lzf-1.0.0.jar
- config-1.2.1.jar
- core-1.1.2.jar
- curator-client-2.4.0.jar
- curator-framework-2.4.0.jar
- curator-recipes-2.4.0.jar
- gmbal-api-only-3.0.0-b023.jar
- grizzly-framework-2.1.2.jar
- grizzly-http-2.1.2.jar
- grizzly-http-server-2.1.2.jar
- grizzly-http-servlet-2.1.2.jar
- grizzly-rcm-2.1.2.jar
- groovy-all-2.3.7.jar
- guava-14.0.1.jar
- guice-3.0.jar
- hadoop-annotations-2.2.0.jar
- hadoop-auth-2.2.0.jar
- hadoop-client-2.2.0.jar
- hadoop-common-2.2.0.jar
- hadoop-hdfs-2.2.0.jar
- hadoop-mapreduce-client-app-2.2.0.jar
- hadoop-mapreduce-client-common-2.2.0.jar
- hadoop-mapreduce-client-core-2.2.0.jar
- hadoop-mapreduce-client-jobclient-2.2.0.jar
- hadoop-mapreduce-client-shuffle-2.2.0.jar
- hadoop-yarn-api-2.2.0.jar
- hadoop-yarn-client-2.2.0.jar
- hadoop-yarn-common-2.2.0.jar
- hadoop-yarn-server-common-2.2.0.jar
- ivy-2.4.0.jar
- jackson-annotations-2.4.0.jar
- jackson-core-2.4.4.jar
- jackson-core-asl-1.8.8.jar
- jackson-databind-2.4.4.jar
- jackson-jaxrs-1.8.8.jar
- jackson-mapper-asl-1.8.8.jar
- jackson-module-scala_2.10-2.4.4.jar
- jackson-xc-1.8.8.jar
- jansi-1.4.jar
- javax.inject-1.jar
- javax.servlet-3.0.0.v201112011016.jar
- javax.servlet-3.1.jar
- javax.servlet-api-3.0.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar
- jcl-over-slf4j-1.7.10.jar
- jersey-client-1.9.jar
- jersey-core-1.9.jar
- jersey-grizzly2-1.9.jar
- jersey-guice-1.9.jar
- jersey-json-1.9.jar
- jersey-server-1.9.jar
- jersey-test-framework-core-1.9.jar
- jersey-test-framework-grizzly2-1.9.jar
- jets3t-0.7.1.jar
- jettison-1.1.jar
- jetty-util-6.1.26.jar
- jline-0.9.94.jar
- jline-2.10.4.jar
- jodd-core-3.6.3.jar
- json4s-ast_2.10-3.2.10.jar
- json4s-core_2.10-3.2.10.jar
- json4s-jackson_2.10-3.2.10.jar
- jsr305-1.3.9.jar
- jtransforms-2.4.0.jar
- jul-to-slf4j-1.7.10.jar
- kryo-2.21.jar
- log4j-1.2.17.jar
- lz4-1.2.0.jar
- management-api-3.0.0-b012.jar
- mesos-0.21.0-shaded-protobuf.jar
- metrics-core-3.1.0.jar
- metrics-graphite-3.1.0.jar
- metrics-json-3.1.0.jar
- metrics-jvm-3.1.0.jar
- minlog-1.2.jar
- netty-3.8.0.Final.jar
- netty-all-4.0.23.Final.jar
- objenesis-1.2.jar
- opencsv-2.3.jar
- oro-2.0.8.jar
- paranamer-2.6.jar
- parquet-column-1.6.0rc3.jar
- parquet-common-1.6.0rc3.jar
- parquet-encoding-1.6.0rc3.jar
- parquet-format-2.2.0-rc1.jar
- parquet-generator-1.6.0rc3.jar
- parquet-hadoop-1.6.0rc3.jar
- parquet-jackson-1.6.0rc3.jar
- protobuf-java-2.4.1.jar
- protobuf-java-2.5.0-spark.jar
- py4j-0.8.2.1.jar
- pyrolite-2.0.1.jar
- quasiquotes_2.10-2.0.1.jar
- reflectasm-1.07-shaded.jar
- scala-compiler-2.10.4.jar
- scala-library-2.10.4.jar
- scala-reflect-2.10.4.jar
- scalap-2.10.4.jar
- scalatest_2.10-2.2.1.jar
- slf4j-api-1.7.10.jar
- slf4j-log4j12-1.7.10.jar
- snappy-java-1.1.1.6.jar
- spark-bagel_2.10-1.4.0-SNAPSHOT.jar
- spark-catalyst_2.10-1.4.0-SNAPSHOT.jar
- spark-core_2.10-1.4.0-SNAPSHOT.jar
- spark-graphx_2.10-1.4.0-SNAPSHOT.jar
- spark-launcher_2.10-1.4.0-SNAPSHOT.jar
- spark-mllib_2.10-1.4.0-SNAPSHOT.jar
- spark-network-common_2.10-1.4.0-SNAPSHOT.jar
- spark-network-shuffle_2.10-1.4.0-SNAPSHOT.jar
- spark-repl_2.10-1.4.0-SNAPSHOT.jar
- spark-sql_2.10-1.4.0-SNAPSHOT.jar
- spark-streaming_2.10-1.4.0-SNAPSHOT.jar
- spire-macros_2.10-0.7.4.jar
- spire_2.10-0.7.4.jar
- stax-api-1.0.1.jar
- stream-2.7.0.jar
- tachyon-0.5.0.jar
- tachyon-client-0.5.0.jar
- uncommons-maths-1.2.2a.jar
- unused-1.0.0.jar
- xmlenc-0.52.jar
- xz-1.0.jar
- zookeeper-3.4.5.jar

SparkQA · 2015-04-15T03:42:31Z

Test build #30293 has finished for PR 5335 at commit 746b661.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

Marcelo Vanzin added 5 commits April 2, 2015 12:27

Stop alloc manager before scheduler.

a0b0881

More exception safety.

27456b9

Nits.

071f16e

Merge branch 'master' into SPARK-4194

3979aad

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala

Fix merge.

c671c46

Scalastyle.

6b73fcb

Merge branch 'master' into SPARK-4194

2621609

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala

Merge branch 'master' into SPARK-4194

408dada

Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala

Merge branch 'master' into SPARK-4194

80fc00e

Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

Fix borked merge.

746b661

asfgit closed this in de4fa6b Apr 16, 2015

vanzin deleted the SPARK-4194 branch April 16, 2015 16:31

andrewor14 mentioned this pull request Apr 18, 2015

[core] [minor] Make sure ConnectionManager stops. #5566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4194] [core] Make SparkContext initialization exception-safe. #5335

[SPARK-4194] [core] Make SparkContext initialization exception-safe. #5335

vanzin commented Apr 2, 2015

vanzin commented Apr 2, 2015

SparkQA commented Apr 2, 2015

vanzin commented Apr 3, 2015

SparkQA commented Apr 3, 2015

vanzin commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 7, 2015

vanzin commented Apr 10, 2015

srowen commented Apr 10, 2015

vanzin commented Apr 10, 2015

srowen commented Apr 10, 2015

pwendell commented Apr 11, 2015

JoshRosen commented Apr 11, 2015

vanzin commented Apr 12, 2015

pwendell commented Apr 13, 2015

srowen commented Apr 14, 2015

SparkQA commented Apr 14, 2015

SparkQA commented Apr 15, 2015

[SPARK-4194] [core] Make SparkContext initialization exception-safe. #5335

[SPARK-4194] [core] Make SparkContext initialization exception-safe. #5335

Conversation

vanzin commented Apr 2, 2015

vanzin commented Apr 2, 2015

SparkQA commented Apr 2, 2015

vanzin commented Apr 3, 2015

SparkQA commented Apr 3, 2015

vanzin commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 3, 2015

SparkQA commented Apr 7, 2015

vanzin commented Apr 10, 2015

srowen commented Apr 10, 2015

vanzin commented Apr 10, 2015

srowen commented Apr 10, 2015

pwendell commented Apr 11, 2015

JoshRosen commented Apr 11, 2015

vanzin commented Apr 12, 2015

pwendell commented Apr 13, 2015

srowen commented Apr 14, 2015

SparkQA commented Apr 14, 2015

SparkQA commented Apr 15, 2015