[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output #17346

lw-lin · 2017-03-19T10:31:51Z

The Problem

Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output:

[info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
[info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
[info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
[info] 
[info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
[info]   at scala.Predef$.assert(Predef.scala:170)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
[info]   at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
[info]   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
[info]   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
[info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
[info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
[info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
[info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
[info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
[info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)

What changes were proposed in this pull request?

This patch alters InMemoryFileIndex to filter out these basePaths whose ancestor is the streaming metadata dir (_spark_metadata). E.g., the following and other similar dir or files will be filtered out:

(introduced by globbing basePath/*)
- basePath/_spark_metadata
(introduced by globbing basePath/*/*)
- basePath/_spark_metadata/0
- basePath/_spark_metadata/1
- ...

How was this patch tested?

Added unit tests

SparkQA · 2017-03-19T10:38:47Z

Test build #74819 has finished for PR 17346 at commit 19d0d48.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-19T13:26:54Z

Test build #74820 has finished for PR 17346 at commit 0e35db7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-03-20T01:21:50Z

@zsxwing would you take a look at this? Thanks!

SparkQA · 2017-03-29T05:10:54Z

Test build #75336 has finished for PR 17346 at commit 0e35db7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-06T09:59:45Z

Test build #75565 has finished for PR 17346 at commit 0e35db7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-04-23T10:59:23Z

Rebased to master to resolve conflicts

SparkQA · 2017-04-23T13:09:44Z

Test build #76081 has finished for PR 17346 at commit 59ee112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-29T05:12:32Z

Test build #76293 has started for PR 17346 at commit 59ee112.

SparkQA · 2017-04-29T11:28:03Z

Test build #76301 has finished for PR 17346 at commit 59ee112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-05-01T13:02:48Z

Jenkins retest this please

lw-lin · 2017-05-01T13:03:31Z

@zsxwing would you take a look at your convenience? Thanks!

SparkQA · 2017-05-01T15:13:36Z

Test build #76348 has finished for PR 17346 at commit 59ee112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Sorry for the delay. Looks pretty good. Just some nits.

zsxwing · 2017-05-02T18:48:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala

+   *  - ancestorIsMetadataDirectory(/a/b/c) => false
+   */
+  def ancestorIsMetadataDirectory(path: Path): Boolean = {
+    require(path.isAbsolute, s"$path is required to be absolute")


I'm wondering if we can call makeQualified instead.

switched to makeQualified

zsxwing · 2017-05-02T18:51:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala

+    require(path.isAbsolute, s"$path is required to be absolute")
+    var currentPath = path
+    var finished = false
+    while (!finished) {


How about changing it to currentPath != null? Then you don't need finished

fixed. good point!

zsxwing · 2017-05-02T18:54:19Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

+    val inputData = MemoryStream[Int]
+    val ds = inputData.toDS()
+
+    val outputDir = Utils.createTempDir(namePrefix = "stream.output").getCanonicalPath


nit: use withTempDir to create temp dir instead

zsxwing · 2017-05-02T18:54:41Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

+    val ds = inputData.toDS()
+
+    val outputDir = Utils.createTempDir(namePrefix = "stream.output").getCanonicalPath
+    val checkpointDir = Utils.createTempDir(namePrefix = "stream.checkpoint").getCanonicalPath


nit: same as above

zsxwing · 2017-05-02T18:57:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

+  // or "/.../_spark_metadata/0" (a file in the metadata dir). `rootPathsSpecified` might contain
+  // such streaming metadata dir or files, e.g. when after globbing "basePath/*" where "basePath"
+  // is the output of a streaming query.
+  override val rootPaths = rootPathsSpecified.filterNot(FileStreamSink.ancestorIsMetadataDirectory)


Just to confirm one thing: for files in rootPaths or their sub dirs, they will be dropped by InMemoryFileIndex.shouldFilterOut. Right?

Yea that's quite correct! They will be filted by InMemoryFileIndex.shouldFilterOut.

SparkQA · 2017-05-03T05:54:53Z

Test build #76408 has finished for PR 17346 at commit 49ee54d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-05-03T06:04:44Z

Comments have been addressed -- @zsxwing it'd be great if you could take another look

zsxwing · 2017-05-03T18:09:50Z

LGTM. Thanks! Merging to master and 2.2.

… when reading FileStreamSink's output ## The Problem Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output: ``` [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) ``` ## What changes were proposed in this pull request? This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out: - (introduced by globbing `basePath/*`) - `basePath/_spark_metadata` - (introduced by globbing `basePath/*/*`) - `basePath/_spark_metadata/0` - `basePath/_spark_metadata/1` - ... ## How was this patch tested? Added unit tests Author: Liwei Lin <lwlin7@gmail.com> Closes #17346 from lw-lin/filter-metadata. (cherry picked from commit 6b9e49d) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

lw-lin · 2017-05-03T23:58:51Z

thank you @zsxwing

lw-lin force-pushed the filter-metadata branch from 1d189ed to 0e35db7 Compare March 19, 2017 11:24

lw-lin changed the title ~~[WIP] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output~~ [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output Mar 19, 2017

Initial commit

59ee112

lw-lin force-pushed the filter-metadata branch from 0e35db7 to 59ee112 Compare April 23, 2017 10:58

zsxwing requested changes May 2, 2017

View reviewed changes

Address comments

49ee54d

asfgit closed this in 6b9e49d May 3, 2017

lw-lin deleted the filter-metadata branch May 4, 2017 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output #17346

[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output #17346

lw-lin commented Mar 19, 2017 •

edited

Loading

SparkQA commented Mar 19, 2017

SparkQA commented Mar 19, 2017

lw-lin commented Mar 20, 2017

SparkQA commented Mar 29, 2017

SparkQA commented Apr 6, 2017

lw-lin commented Apr 23, 2017

SparkQA commented Apr 23, 2017

SparkQA commented Apr 29, 2017

SparkQA commented Apr 29, 2017

lw-lin commented May 1, 2017

lw-lin commented May 1, 2017

SparkQA commented May 1, 2017

zsxwing left a comment

zsxwing May 2, 2017

lw-lin May 3, 2017

zsxwing May 2, 2017

lw-lin May 3, 2017

zsxwing May 2, 2017

lw-lin May 3, 2017

zsxwing May 2, 2017

zsxwing May 2, 2017

lw-lin May 3, 2017 •

edited

Loading

SparkQA commented May 3, 2017

lw-lin commented May 3, 2017 •

edited

Loading

zsxwing commented May 3, 2017

lw-lin commented May 3, 2017

[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output #17346

[SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output #17346

Conversation

lw-lin commented Mar 19, 2017 • edited Loading

The Problem

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 19, 2017

SparkQA commented Mar 19, 2017

lw-lin commented Mar 20, 2017

SparkQA commented Mar 29, 2017

SparkQA commented Apr 6, 2017

lw-lin commented Apr 23, 2017

SparkQA commented Apr 23, 2017

SparkQA commented Apr 29, 2017

SparkQA commented Apr 29, 2017

lw-lin commented May 1, 2017

lw-lin commented May 1, 2017

SparkQA commented May 1, 2017

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw-lin May 3, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented May 3, 2017

lw-lin commented May 3, 2017 • edited Loading

zsxwing commented May 3, 2017

lw-lin commented May 3, 2017

lw-lin commented Mar 19, 2017 •

edited

Loading

lw-lin May 3, 2017 •

edited

Loading

lw-lin commented May 3, 2017 •

edited

Loading