[SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths #25348

brkyvz · 2019-08-03T22:25:12Z

What changes were proposed in this pull request?

This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as SupportsOverwrite or SupportsTruncate to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not.

We do not support dynamicPartitionOverwrite, as we cannot call a commit method that actually cleans up the data in the partitions that were touched through this fallback.

How was this patch tested?

Will add tests and example implementation after comments + feedback. This is a proposal at this point.

SparkQA · 2019-08-03T23:35:00Z

Test build #108613 has finished for PR 25348 at commit d5798fd.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V1WriteBuilder extends WriteBuilder

SparkQA · 2019-08-04T01:57:37Z

Test build #108615 has finished for PR 25348 at commit 59094e9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait SupportsV1Write extends V2TableWriteExec

cloud-fan · 2019-08-05T14:33:37Z

Do we have a plan to support read fallback?

rdblue · 2019-08-05T18:46:11Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

-    val batchWrite = newWriteBuilder() match {
+    newWriteBuilder() match {
+      case v1: V1WriteBuilder if isTruncate(deleteWhere) =>
+        writeWithV1(v1.buildForV1Write(), SaveMode.Overwrite, writeOptions)


Overwrite is ambiguous and doesn't specify whether the table data should be truncated, replaced dynamically by partition, etc. It isn't possible for v1 sources to guarantee the right behavior -- deleting data that matches deleteWhere -- so v1 fallback should not be supported here.

Does it make sense to just do this but only if "deleteWhere") is empty?

No. If you pass SaveMode.Overwrite into a v1 implementation, the behavior is undefined. So we shouldn't ever pass this.

I was talking to @brkyvz directly and we think that we can use SupportsDelete to make this possible. If a table supports deleting by filter, then we can run that first and then run the insert. But that should be done in a follow-up PR.

rdblue · 2019-08-05T18:48:06Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+      mode: SaveMode,
+      options: CaseInsensitiveStringMap): RDD[InternalRow] = {
+    relation.createRelation(
+      sqlContext, mode, options.asScala.toMap, Dataset.ofRows(sqlContext.sparkSession, plan))


I think this should use the original options map that preserves the case that was passed in.

rdblue · 2019-08-05T18:51:56Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

-          doWrite(batchWrite)
+          writer match {
+            case v1: V1WriteBuilder =>
+              val mode = if (ifNotExists) SaveMode.Ignore else SaveMode.ErrorIfExists


The table was just created above, so both of these modes are incorrect. This should be SaveMode.Append because the table already exists.

rdblue · 2019-08-05T18:57:08Z

sql/core/src/main/scala/org/apache/spark/sql/sources/v2/writer/V1WriteBuilder.scala

+   *
+   * @since 3.0.0
+   */
+  def buildForV1Write(): CreatableRelationProvider = {


Why not add a path for InsertableRelation?

insert semantics are weird. It doesn't support the passing in of options as well.
CreatableRelationProvider is more flexible. I also did a quick spot check of:
https://spark-packages.org/?q=tags%3A%22Data%20Sources%22

All sources that I checked support CreatableRelationProvider, but some don't support InsertableRelation

rdblue · 2019-08-05T19:11:39Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

@@ -129,7 +137,8 @@ case class AtomicCreateTableAsSelectExec(
    }
    val stagedTable = catalog.stageCreate(
      ident, query.schema, partitioning.toArray, properties.asJava)
-    writeToStagedTable(stagedTable, writeOptions, ident)
+    val mode = if (ifNotExists) SaveMode.Ignore else SaveMode.ErrorIfExists


Same problem here. I think mode should always be Append because that's the only one with reliable behavior that matches v2.

rdblue · 2019-08-05T19:13:29Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

            planLater(query),
            props,
            writeOptions,
            orCreate = orCreate) :: Nil
      }

    case AppendData(r: DataSourceV2Relation, query, _) =>
-      AppendDataExec(r.table.asWritable, r.options, planLater(query)) :: Nil
+      AppendDataExec(r.table.asWritable, r.options, query, planLater(query)) :: Nil


I think that falling back to v1 should use different plan nodes so that users can see that the v1 write API is used instead of v2. It would also help keep concerns separated in the exec nodes.

That's gonna require a separation at the catalog level

We talked about this separately and I think this is okay for now. In the future, we can add a separate Write produced by the builder (like Scan in the read side) and use that to use a separate plan.

brkyvz · 2019-08-06T00:32:55Z

also cc @RussellSpitzer (you may be interested)

RussellSpitzer · 2019-08-06T19:18:51Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+      options: CaseInsensitiveStringMap): RDD[InternalRow] = {
+    relation.createRelation(
+      sqlContext, mode, options.asScala.toMap, Dataset.ofRows(sqlContext.sparkSession, plan))
+    sparkContext.emptyRDD


I might be missing something here, but why does this return an RDD and why is it empty?

Ok so I see this is used in Atomic Table Writes, that seems a bit wrong to me, shouldn't we just not support the atomic table writes with V1 Fallback? Seems like we are violating the contract by returning empty regardless of what happens.

What is the contract? All data write commands have returned an empty result since the beginning of time, e.g. look at SaveIntoDataSourceCommand, V2TableWriteExec, InsertIntoDataSourceCommand.

val writtenRows = writer match { case v1: V1WriteBuilder => writeWithV1(v1.buildForV1Write(), writeOptions) case v2 => doWrite(v2.buildForBatch()) }

If this is always empty why do we save it as writtenRows here? This is just to hold a reference to the empty result set?

yeah. It's pretty much dead code. I think if we decide to change what to return later, it's easier to change 1-2 places vs 'n' different operators.

This reverts commit 5aed803.

brkyvz · 2019-08-07T01:58:32Z

Do we have a plan to support read fallback?

I'd like to add that support in a separate PR, but I do think it is valuable and we should have it.

brkyvz · 2019-08-07T02:00:29Z

@rdblue Addressed your comments. Can you please take a look when possible?

SparkQA · 2019-08-07T02:11:30Z

Test build #108737 has finished for PR 25348 at commit 335a92d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-07T02:19:21Z

Test build #108738 has finished for PR 25348 at commit 3e35d5c.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SupportsV1Write extends SparkPlan

SparkQA · 2019-08-07T04:24:52Z

Test build #108746 has finished for PR 25348 at commit 41c4c0a.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
protected implicit class toV1WriteBuilder(builder: WriteBuilder)

cloud-fan · 2019-08-07T05:19:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala

+  override protected def doExecute(): RDD[InternalRow] = {
+    writeBuilder match {
+      case builder: SupportsTruncate if isTruncate(deleteWhere) =>
+        writeWithV1(builder.truncate().asV1Writer.buildForV1Write(), writeOptions)


Then it's not a simple fallback now. People need to implement SupportsTruncate and SupportsOverwrite, and people need to update their CreatableRelationProvider implementation to apply the deleteWhere condition while the save mode is always apend.

AFAIK a v1 source can write data with SaveMode, which has inconsistent behavior and we don't want to rely on it. A v1 source can also write data with InsertableRelation.insert, which can append data or overwrite the entire table. I think those are the only 2 cases where we can safely fallback to v1 source.

I've thought about it more, and since the only mode of write that we want to support is Appending, I changed the behavior to InsertableRelation. With regards to:

Then it's not a simple fallback now. People need to implement SupportsTruncate and SupportsOverwrite.

I think this is fine. If a data source already has implemented SaveMode.Overwrite, they already have some logic to delete partitions. Implementing these as a part of the WriteBuilder API shouldn't be too hard, and code re-use is possible.

An alternative is SupportsDelete, which could be used for both. We just can't support dynamic overwrite.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala

rdblue · 2019-08-19T23:09:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala

+    plan: LogicalPlan) extends V1FallbackWriters {
+
+  override protected def doExecute(): RDD[InternalRow] = {
+    writeWithV1(writeBuilder.buildForV1Write())


Why pass the builder in rather than building and passing the BatchWrite? Is this trying to manage the life-cycle so that the write is only created if it will be executed?

If so, this may be a good reason to have a separate Write and BatchWrite like we have for Scan and BatchScan.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala

rdblue · 2019-08-19T23:19:33Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

-          doWrite(batchWrite)
+          writeBuilder match {
+            case v1: V1WriteBuilder => writeWithV1(v1.buildForV1Write())
+            case v2 => doWrite(v2.buildForBatch())


Minor: Should we rename doWrite to writeWithV2?

rdblue · 2019-08-19T23:29:22Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

-    writeOptions: CaseInsensitiveStringMap,
-    query: SparkPlan) extends V2TableWriteExec with BatchWriteHelper {
+    writeBuilder: WriteBuilder,
+    query: SparkPlan) extends V2TableWriteExec {


Since thinking about the impact to simpleString, I realized that this is also going to delegate to the write builder for other methods as well, including equals and hashCode. Since we can't rely on the behavior of the write builder's equals and hashCode methods, I don't think it the builder should be used as an argument to the plan case classes.

I think that also makes sense: case classes are algebraic types, so we shouldn't include objects in their definitions that don't behave like algebraic types.

rdblue · 2019-08-19T23:32:11Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

@@ -259,27 +266,25 @@ case class AppendDataExec(
 */
 case class OverwriteByExpressionExec(
    table: SupportsWrite,
+    writeBuilder: WriteBuilder,


If the builder is no longer passed in, then I don't think this class needs to change at all.

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/V1WriteFallbackSuite.scala

SparkQA · 2019-08-20T19:30:25Z

Test build #109430 has finished for PR 25348 at commit 00347ee.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait BatchWriteHelper

SparkQA · 2019-08-20T20:38:00Z

Test build #109431 has finished for PR 25348 at commit d4d6276.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-08-20T21:23:24Z

sql/core/src/main/scala/org/apache/spark/sql/sources/v2/writer/V1WriteBuilder.scala

+  def buildForV1Write(): InsertableRelation
+
+  // These methods cannot be implemented by a V1WriteBuilder.
+  override final def buildForBatch(): BatchWrite = super.buildForBatch()


not sure if this is required. Now WriteBuilder implementations need to

class ExampleBuilder extends WriteBuilder with V1WriteBuilder

Minor: would be nice to have a comment that the superclass is going to throw an exception.

rdblue · 2019-08-20T23:59:56Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/utils/TestV2SessionCatalogBase.scala

+    if (tables.containsKey(fullIdent)) {
+      tables.get(fullIdent)
+    } else {
+      // Table was created through the built-in catalog


I find this case a little odd. I think it makes sense to layer on in-memory tables because we need to return the same table instance. But why create an in-memory shadow table for tables that already exist?

Nevermind, I get it.

rdblue · 2019-08-21T00:02:19Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/utils/TestV2SessionCatalogBase.scala

+import org.apache.spark.sql.sources.v2.Table
+import org.apache.spark.sql.types.StructType
+
+/** A SessionCatalog that always loads an in memory Table, so we can test write code paths. */


This doesn't always load an in-memory table, since newTable is abstract. Can you update the docs to be a bit more clear about what this does?

I'm not sure that it makes sense for this to be separate, since the table cache is in memory but tables aren't necessarily in memory.

rdblue · 2019-08-21T00:08:50Z

+1 when tests pass

SparkQA · 2019-08-21T01:38:15Z

Test build #109435 has finished for PR 25348 at commit 749ae85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-21T04:43:31Z

Test build #109449 has finished for PR 25348 at commit 27598ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-21T09:25:36Z

LGTM, merging to master!

### What changes were proposed in this pull request? Add a `V1Scan` interface, so that data source v1 implementations can migrate to DS v2 much easier. ### Why are the changes needed? It's a lot of work to migrate v1 sources to DS v2. The new API added here can allow v1 sources to go through v2 code paths without implementing all the Batch, Stream, PartitionReaderFactory, ... stuff. We already have a v1 write fallback API after apache#25348 ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test suite Closes apache#26231 from cloud-fan/v1-read-fallback. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Adds a v1 fallback writer implementation for v2 data source codepaths

d5798fd

brkyvz mentioned this pull request Aug 3, 2019

[SPARK-28554][SQL] implement basic catalog functionalities for JDBC v2 with a DS v1 fallback API #25291

Closed

Update WriteToDataSourceV2Exec.scala

59094e9

dongjoon-hyun added the SQL label Aug 4, 2019

rdblue reviewed Aug 5, 2019

View reviewed changes

brkyvz added 2 commits August 5, 2019 15:25

Merge branch 'master' of github.com:apache/spark into v1WriteFallback

1587d31

some changes but doubtful

5aed803

RussellSpitzer reviewed Aug 6, 2019

View reviewed changes

brkyvz added 2 commits August 6, 2019 18:11

Merge branch 'master' of github.com:apache/spark into v1WriteFallback

bcdf8c5

Revert "some changes but doubtful"

a1284a1

This reverts commit 5aed803.

Address comments and separate whatever's possible

ef2ec72

brkyvz added 2 commits August 6, 2019 19:01

update docs

335a92d

minor move for better separation

3e35d5c

use implicit class

41c4c0a

cloud-fan reviewed Aug 7, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala Show resolved Hide resolved

rdblue reviewed Aug 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala Show resolved Hide resolved

rdblue reviewed Aug 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala Show resolved Hide resolved

rdblue reviewed Aug 19, 2019

View reviewed changes

rdblue reviewed Aug 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/V1WriteFallbackSuite.scala Show resolved Hide resolved

brkyvz added 2 commits August 20, 2019 12:02

Merge branch 'master' of github.com:apache/spark into v1WriteFallback

9bfb76e

Add table capability to do V1_BATCH_WRITE

00347ee

test refactor

d4d6276

brkyvz commented Aug 20, 2019

View reviewed changes

Update V1WriteFallbackSuite.scala

749ae85

rdblue reviewed Aug 20, 2019

View reviewed changes

rdblue reviewed Aug 21, 2019

View reviewed changes

address nits

27598ce

cloud-fan closed this in 4855bfe Aug 21, 2019

This was referenced Oct 17, 2019

[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

Closed

[SPARK-29572][SQL] add v1 read fallback API in DS v2 #26231

Closed

[SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths #25348

[SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths #25348

Conversation

brkyvz commented Aug 3, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 3, 2019

SparkQA commented Aug 4, 2019

cloud-fan commented Aug 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Aug 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Aug 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Aug 7, 2019

brkyvz commented Aug 7, 2019

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

cloud-fan Aug 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2019

SparkQA commented Aug 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Aug 21, 2019

SparkQA commented Aug 21, 2019

SparkQA commented Aug 21, 2019

cloud-fan commented Aug 21, 2019

brkyvz commented Aug 3, 2019 •

edited

Loading

rdblue Aug 5, 2019 •

edited

Loading

RussellSpitzer Aug 7, 2019 •

edited

Loading

cloud-fan Aug 7, 2019 •

edited

Loading