[SPARK-29572][SQL] add v1 read fallback API in DS v2 #26231

cloud-fan · 2019-10-23T15:03:40Z

What changes were proposed in this pull request?

Add a V1Scan interface, so that data source v1 implementations can migrate to DS v2 much easier.

Why are the changes needed?

It's a lot of work to migrate v1 sources to DS v2. The new API added here can allow v1 sources to go through v2 code paths without implementing all the Batch, Stream, PartitionReaderFactory, ... stuff.

We already have a v1 write fallback API after #25348

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test suite

cloud-fan · 2019-10-23T15:05:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -97,17 +97,14 @@ trait DataSourceScanExec extends LeafExecNode {

 /** Physical plan node for scanning data from a relation. */
 case class RowDataSourceScanExec(
-    fullOutput: Seq[Attribute],
-    requiredColumnsIndex: Seq[Int],
+    output: Seq[Attribute],


Since I need to use RowDataSourceScanExec in the new read fallback code path, simplify it a little bit to make it easier to use.

cloud-fan · 2019-10-23T15:06:42Z

More than half of the code change is from new tests. This is actually a small patch.

cc @brkyvz @rdblue

SparkQA · 2019-10-23T17:55:01Z

Test build #112548 has finished for PR 26231 at commit 034bc07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V1Scan extends Scan

dongjoon-hyun · 2019-10-24T20:48:17Z

Retest this please.

SparkQA · 2019-10-24T23:37:07Z

Test build #112622 has finished for PR 26231 at commit 034bc07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V1Scan extends Scan

SparkQA · 2019-10-28T13:17:02Z

Test build #112771 has started for PR 26231 at commit bb195f3.

rdblue · 2019-10-29T20:23:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -141,7 +138,8 @@ case class RowDataSourceScanExec(
  // Don't care about `rdd` and `tableIdentifier` when canonicalizing.
  override def doCanonicalize(): SparkPlan =
    copy(
-      fullOutput.map(QueryPlan.normalizeExpressions(_, fullOutput)),
+      // Only the required column names matter when checking equality.
+      output.map(a => a.withExprId(ExprId(-1))),


Why does this not use normalizeExpressions? It seems odd to use a special case fixed ID here.

rdblue · 2019-10-29T20:28:58Z

sql/core/src/test/scala/org/apache/spark/sql/connector/V1ReadFallbackSuite.scala

+    override def readSchema(): StructType = requiredSchema
+    override def toV1Relation(): BaseRelation = {
+      new BaseRelation with TableScan {
+        override def sqlContext: SQLContext = SparkSession.active.sqlContext


Can SQLContext be passed in when converting?

rdblue · 2019-10-29T20:30:10Z

sql/core/src/test/scala/org/apache/spark/sql/connector/V1ReadFallbackSuite.scala

+}
+
+class TableWithV1ReadFallback extends Table with SupportsRead {
+  override def name(): String = "v1-read-fallback"


This should return the string version of the identifier it was loaded with.

SparkQA · 2019-10-30T13:06:57Z

Test build #112924 has finished for PR 26231 at commit f5cb61f.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataSourceV2Strategy(session: SparkSession) extends Strategy with PredicateHelper

dongjoon-hyun · 2019-10-30T18:04:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala

@@ -27,7 +26,7 @@ import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy
 import org.apache.spark.sql.internal.SQLConf

 class SparkPlanner(
-    val sparkContext: SparkContext,
+    val session: SparkSession,


Wow. Finally. :)

dongjoon-hyun · 2019-10-30T18:20:44Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

@@ -42,7 +43,7 @@ object DataSourceV2Strategy extends Strategy with PredicateHelper {
   */
  private def pushFilters(
      scanBuilder: ScanBuilder,
-      filters: Seq[Expression]): (Seq[Expression], Seq[Expression]) = {
+      filters: Seq[Expression]): (Seq[Filter], Seq[Expression]) = {


cc @dbtsai since this is related to his on-going nested column filter work.

SparkQA · 2019-10-30T22:30:15Z

Test build #112951 has finished for PR 26231 at commit e8d718d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataSourceV2Strategy(session: SparkSession) extends Strategy with PredicateHelper

cloud-fan · 2019-11-01T13:39:29Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

@@ -51,6 +54,7 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] {
         """.stripMargin)

      val scanRelation = DataSourceV2ScanRelation(relation.table, scan, output)
+      scanRelation.setTagValue(PUSHED_FILTERS_TAG, pushedFilters)


It will be convenient if Scan can report pushed filters itself. But I'm not sure how to design the API to make it work.

Here I just store the pushed filters in the DataSourceV2ScanRelation, so that I can use it when creating v1 physical scan node later, which needs pushedFilters to do equality check.

Can the pushedFilters be just a parameter of DataSourceV2ScanRelation?

SparkQA · 2019-11-01T13:39:42Z

Test build #113092 has finished for PR 26231 at commit 87fc70a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-01T17:37:23Z

Test build #113094 has finished for PR 26231 at commit 211c580.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-18T17:04:03Z

Test build #115512 has finished for PR 26231 at commit d6c0597.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-12-19T02:06:10Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+            case s: TableScan => s.buildScan()
+            case _ =>
+              throw new IllegalArgumentException(
+                "`V1Scan.toV1Relation` must return a `TableScan` instance.")


If it must return TableScan, why not rename the API as toV1TableScan ?

Sorry I meant rename to toV1TableScan and also change the return type as TableScan.

TableScan is just a mixin, what we expect is BaseScan with TableScan, but that doesn't work for java.

SparkQA · 2019-12-19T14:17:47Z

Test build #115561 has finished for PR 26231 at commit 389444e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-12-19T19:42:01Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala


  import DataSourceV2Implicits._

  override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+    // projection and filters were already pushed down in the optimizer.
+    // this uses PhysicalOperation to get the projection and ensure that if the batch scan does
+    // not support columnar, a projection is added to convert the rows to UnsafeRow.
    case PhysicalOperation(project, filters, relation: DataSourceV2ScanRelation) =>


How about we match the V1Scan here? Thus we can simplify the code below.

sql/core/src/test/scala/org/apache/spark/sql/connector/V1ReadFallbackSuite.scala

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

gengliangwang · 2019-12-20T07:52:35Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala


  import DataSourceV2Implicits._

  override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+    case PhysicalOperation(project, filters,
+        relation @ DataSourceV2ScanRelation(table, v1Scan: V1Scan, output)) =>


table is not used

SparkQA · 2020-01-06T21:48:32Z

Test build #116182 has finished for PR 26231 at commit eccadc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-08T18:19:35Z

Test build #116309 has finished for PR 26231 at commit 51bd0d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

I have some questions around V1Scan. I wonder if we can have the V1Scan output be a bit different such that the pushed filters and schema pruning be pushed down into the scan even for V1 relations, and maybe produce an RDD[Row]. Let me know what you think

brkyvz · 2019-11-13T00:32:46Z

sql/core/src/main/scala/org/apache/spark/sql/connector/read/V1Scan.scala

+
+  /**
+   * Creates an `BaseRelation` that can scan data from DataSource v1 to RDD[Row]. The returned
+   * relation must be a `TableScan` instance.


Why does it need to be a TableScan? Can't it be a HadoopFsRelation? Can't it be a PrunedFilteredScan?

brkyvz · 2020-01-10T01:54:56Z

sql/core/src/main/java/org/apache/spark/sql/connector/read/V1Scan.java

+public interface V1Scan extends Scan {
+
+  /**
+   * Creates an `BaseRelation` with `TableScan` that can scan data from DataSource v1 to RDD[Row].


nit: Create a BaseRelation

brkyvz · 2020-01-10T01:56:00Z

sql/core/src/main/java/org/apache/spark/sql/connector/read/V1Scan.java

+   *
+   * @since 3.0.0
+   */
+  <T extends BaseRelation & TableScan> T toV1TableScan(SQLContext context);


it kind of seems weird to me that we're introducing new APIs that use deprecated APIs

We haven't marked SQLContext as deprecated yet.

brkyvz · 2020-01-10T02:10:22Z

sql/core/src/main/java/org/apache/spark/sql/connector/read/V1Scan.java

+ * @since 3.0.0
+ */
+@Unstable
+public interface V1Scan extends Scan {


Can we not push down filters and schema pruning down to this scan? We support these in the V1 APIs. Then you can avoid the pushed Filters tag

The idea is the same with v1 write fallback API. The v1 write fallback API also relies on the v2 API to config the write. It's better to leverage the v2 infra as much as we can. e.g. we may improve the v2 pushdown to push more operators that v1 doesn't support.

brkyvz · 2020-01-10T02:11:42Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+      val unsafeRowRDD = DataSourceStrategy.toCatalystRDD(v1Relation, output, rdd)
+      val originalOutputNames = relation.table.schema().map(_.name)
+      val requiredColumnsIndex = output.map(_.name).map(originalOutputNames.indexOf)
+      val dsScan = RowDataSourceScanExec(


how about an alternate constructor?

cloud-fan · 2020-01-13T04:43:23Z

Similar to v1 write fallback API which uses v2 API to config the write, I think it makes more sense to use V2 API to do operator pushdown. That's why the V1 scan should be a TableScan not PrunedFilteredScan, as the pushdown should already be done at that point.

SparkQA · 2020-01-13T08:05:02Z

Test build #116600 has finished for PR 26231 at commit ff2d410.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-13T08:07:23Z

retest this please

SparkQA · 2020-01-13T12:38:59Z

Test build #116617 has finished for PR 26231 at commit ff2d410.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

One of the filter sets provided is wrong. I think we can do something even better by some code reorganization. In V2ScanRelationPushDown if we change the ordering of pruneColumns and pushFilters, I think you can create an API that simply wraps PrunedScan and PrunedFilteredScan. Imagine it as follows:

First you prune columns. The V1ScanBuilder will check if the relation is PrunedScan or PrunedFilteredScan. If it is PrunedScan, it will eagerly build the RDD
If it is a PrunedFilteredScan, then it will wait. Once SupportsPushDownFilters kicks in, you call into PrunedFilteredScan with the pruned schema and filters. Then you get the RDD, and the BaseRelation gives you the unhandled filters.
You wrap the returned RDD in TableScan, and you're done. If the relation was neither, you just call the TableScan.buildScan().

What do you think?

brkyvz · 2020-01-15T16:53:44Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

@@ -51,6 +54,7 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] {
         """.stripMargin)

      val scanRelation = DataSourceV2ScanRelation(relation.table, scan, output)
+      scanRelation.setTagValue(PUSHED_FILTERS_TAG, pushedFilters)


Can the pushedFilters be just a parameter of DataSourceV2ScanRelation?

brkyvz · 2020-01-15T16:55:20Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+      val dsScan = RowDataSourceScanExec(
+        output,
+        requiredColumnsIndex,
+        pushedFilters.toSet,


this is incorrect, right? There were other filters that weren't handled. As I understand, this should be the entire set of filters

Yes it should be the entire set of filters, but it's not a big deal. RowDataSourceScanExec.filters is only used in toString, to let people know which filters are pushed but not accepted by the source.

Anyway we can retain the full filter set like the pushed filters, I'll fix it.

brkyvz · 2020-01-15T17:02:05Z

sql/core/src/test/scala/org/apache/spark/sql/connector/V1ReadFallbackSuite.scala

+class V1TableScan(
+    context: SQLContext,
+    requiredSchema: StructType,
+    filters: Array[Filter]) extends BaseRelation with TableScan {


what about unhandledFilters?

The implementation doesn't need to track the unhandled filters.

Implementation tells Spark the unhandled filters (a.k.a. post-scan filters) at https://github.com/apache/spark/pull/26231/files#diff-e65f6ba43960e865ba29530572696f56R150 , and then only need to track the pushed filters to evaluate them when scanning.

SparkQA · 2020-01-16T12:36:27Z

Test build #116828 has finished for PR 26231 at commit 92b04c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class V1ScanWrapper(v1Scan: V1Scan, pushedFilters: Seq[sources.Filter]) extends Scan

SparkQA · 2020-01-16T15:22:54Z

Test build #116834 has finished for PR 26231 at commit a48e7bb.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class V1ScanWrapper(

cloud-fan · 2020-01-16T15:25:21Z

retest this please

SparkQA · 2020-01-16T19:52:00Z

Test build #116858 has finished for PR 26231 at commit a48e7bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class V1ScanWrapper(

brkyvz · 2020-01-16T22:03:13Z

LGTM

cloud-fan · 2020-01-17T04:41:07Z

thanks for the review, merging to master!

cloud-fan commented Oct 23, 2019

View reviewed changes

cloud-fan added the SQL label Oct 23, 2019

cloud-fan mentioned this pull request Oct 28, 2019

[SPARK-29618] remove V1_BATCH_WRITE table capability #26281

Closed

cloud-fan force-pushed the v1-read-fallback branch from 034bc07 to bb195f3 Compare October 28, 2019 13:12

rdblue reviewed Oct 29, 2019

View reviewed changes

cloud-fan force-pushed the v1-read-fallback branch from bb195f3 to f5cb61f Compare October 30, 2019 12:56

cloud-fan force-pushed the v1-read-fallback branch from f5cb61f to e8d718d Compare October 30, 2019 17:59

dongjoon-hyun reviewed Oct 30, 2019

View reviewed changes

cloud-fan force-pushed the v1-read-fallback branch from e8d718d to 87fc70a Compare November 1, 2019 13:30

cloud-fan commented Nov 1, 2019

View reviewed changes

gengliangwang reviewed Dec 19, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/V1ReadFallbackSuite.scala Show resolved Hide resolved

cloud-fan force-pushed the v1-read-fallback branch from cd923ee to f786fa8 Compare December 20, 2019 06:26

gengliangwang reviewed Dec 20, 2019

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Dec 20, 2019

View reviewed changes

cloud-fan force-pushed the v1-read-fallback branch from eccadc7 to 51bd0d7 Compare January 8, 2020 13:48

brkyvz reviewed Jan 10, 2020

View reviewed changes

brkyvz reviewed Jan 15, 2020

View reviewed changes

cloud-fan added 8 commits January 16, 2020 15:45

add v1 read fallback API in DS v2

9d2923a

address comments

b9976b7

fix

9f34319

address comment

1e25565

address comment

d9ae863

address comment

938b607

address comments

16ad7eb

address comments

7e69bfb

cloud-fan force-pushed the v1-read-fallback branch from ff2d410 to 92b04c2 Compare January 16, 2020 08:07

use a wrapper

a48e7bb

cloud-fan force-pushed the v1-read-fallback branch from 92b04c2 to a48e7bb Compare January 16, 2020 08:35

cloud-fan closed this in 0bd7a3d Jan 17, 2020

[SPARK-29572][SQL] add v1 read fallback API in DS v2 #26231

[SPARK-29572][SQL] add v1 read fallback API in DS v2 #26231

Conversation

cloud-fan commented Oct 23, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

cloud-fan commented Oct 23, 2019

SparkQA commented Oct 23, 2019

dongjoon-hyun commented Oct 24, 2019

SparkQA commented Oct 24, 2019

SparkQA commented Oct 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 30, 2019

cloud-fan Nov 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 1, 2019

SparkQA commented Nov 1, 2019

SparkQA commented Dec 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2020

SparkQA commented Jan 8, 2020

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 13, 2020

SparkQA commented Jan 13, 2020

cloud-fan commented Jan 13, 2020

SparkQA commented Jan 13, 2020

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 16, 2020

SparkQA commented Jan 16, 2020

cloud-fan commented Jan 16, 2020

SparkQA commented Jan 16, 2020

brkyvz commented Jan 16, 2020

cloud-fan commented Jan 17, 2020

cloud-fan commented Oct 23, 2019 •

edited

Loading

cloud-fan Nov 1, 2019 •

edited

Loading