[SPARK-22789] Map-only continuous processing execution #19984

jose-torres · 2017-12-15T01:34:32Z

What changes were proposed in this pull request?

Basic continuous execution, supporting map/flatMap/filter, with commits and advancement through RPC.

How was this patch tested?

new unit-ish tests (exercising execution end to end)

SparkQA · 2017-12-15T01:49:23Z

Test build #84937 has finished for PR 19984 at commit 25a23d1.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-15T05:01:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("The interval at which continuous execution readers will poll to check whether" +
+        " the epoch has advanced on the driver.")
+      .intConf


timeConf?

gatorsmile · 2017-12-15T05:03:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("The size (measured in number of rows) of the queue used in continuous execution to" +
+      " buffer the results of a ContinuousDataReader.")
+    .intConf


longConf?

Should it be? I can't imagine anything close to MAX_INT being a reasonable value here. Will it be hard to migrate to a long if we later discover it's needed?

gatorsmile · 2017-12-15T05:04:48Z

sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java

+   *    import scala.concurrent.duration._
+   *    df.writeStream.trigger(Trigger.Continuous(10.seconds))
+   * }}}
+   * @since 2.2.0


gatorsmile · 2017-12-15T05:04:57Z

sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java

+   * {{{
+   *    df.writeStream.trigger(Trigger.Continuous("10 seconds"))
+   * }}}
+   * @since 2.2.0


ueshin · 2017-12-15T07:09:35Z

sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java

+   *
+   * {{{
+   *    import java.util.concurrent.TimeUnit
+   *    df.writeStream.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))


Trigger.Continuous(10, TimeUnit.SECONDS) instead of ProcessingTime.create(10, TimeUnit.SECONDS)?

SparkQA · 2017-12-15T19:34:45Z

Test build #84973 has finished for PR 19984 at commit 70d5d7c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-15T22:47:09Z

Test build #84975 has finished for PR 19984 at commit 63f78d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-15T22:57:07Z

Test build #84976 has finished for PR 19984 at commit 527cc5d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-16T02:48:22Z

Test build #84981 has finished for PR 19984 at commit f50488c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T03:09:24Z

Test build #85079 has finished for PR 19984 at commit 2af9b40.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T07:03:31Z

Test build #85089 has finished for PR 19984 at commit 359ebdd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-19T08:05:01Z

Test build #85092 has finished for PR 19984 at commit 19f08a9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Half-finished the review. Just posted the comments as I'm leaving now.

zsxwing · 2017-12-20T19:37:25Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

+/**
+ * Atomically increment the current epoch and get the new value.
+ */
+private[sql] case class IncrementAndGetEpoch() extends EpochCoordinatorMessage


nit: case class -> case object

zsxwing · 2017-12-20T19:38:35Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

+/**
+ * Get the current epoch.
+ */
+private[sql] case class GetCurrentEpoch() extends EpochCoordinatorMessage


nit: case class -> case object

zsxwing · 2017-12-20T19:39:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

@@ -303,7 +299,7 @@ abstract class StreamExecution(
          e,
          committedOffsets.toOffsetSeq(sources, offsetSeqMetadata).toString,
          availableOffsets.toOffsetSeq(sources, offsetSeqMetadata).toString)
-        logError(s"Query $prettyIdString terminated with error", e)
+        // logError(s"Query $prettyIdString terminated with error", e)


nit: restore logError

zsxwing · 2017-12-20T19:48:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+      .map(f => pathToBatchId(f.getPath))
+
+    for (batchId <- batchIds if batchId > thresholdBatchId) {
+      print(s"AAAAA purging\n")


nit: remove this

zsxwing · 2017-12-20T19:52:32Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

+   * @param sparkSessionForQuery Isolated [[SparkSession]] to run the continuous query with.
+   */
+  private def runContinuous(sparkSessionForQuery: SparkSession): Unit = {
+    import scala.collection.JavaConverters._


nit: move this to the beginning of this file

zsxwing · 2017-12-20T21:05:16Z