[SPARK-27845][SQL] DataSourceV2: InsertTable #24832

jzhuge · 2019-06-10T18:20:58Z

What changes were proposed in this pull request?

Support multiple catalogs in the following InsertTable use cases:

INSERT INTO [TABLE] catalog.db.tbl
INSERT OVERWRITE TABLE catalog.db.tbl

Support matrix:

Overwrite	Partitioned Table	Partition Clause	Partition Overwrite Mode	Action
false	*	*	*	AppendData
true	no	(empty)	*	OverwriteByExpression(true)
true	yes	p1,p2 or p1 or p2 or (empty)	STATIC	OverwriteByExpression(true)
true	yes	p2,p2 or p1 or p2 or (empty)	DYNAMIC	OverwritePartitionsDynamic
true	yes	p1=23,p2=3	*	OverwriteByExpression(p1=23 and p2=3)
true	yes	p1=23,p2 or p1=23	STATIC	OverwriteByExpression(p1=23)
true	yes	p1=23,p2 or p1=23	DYNAMIC	OverwritePartitionsDynamic

Notes:

Assume the partitioned table has 2 partitions: p1 and p2.
STATIC is the default Partition Overwrite Mode for data source tables.
~~DSv2 tables currently do not support IfPartitionNotExists.~~

How was this patch tested?

New tests.
All existing catalyst and sql/core tests.

SparkQA · 2019-06-10T18:29:27Z

Test build #106359 has finished for PR 24832 at commit e6363a6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InsertTableStatement(

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

rdblue · 2019-06-10T21:21:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

@@ -274,7 +284,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
  override def visitInsertOverwriteTable(
      ctx: InsertOverwriteTableContext): InsertTableParams = withOrigin(ctx) {
    assert(ctx.OVERWRITE() != null)
-    val tableIdent = visitTableIdentifier(ctx.tableIdentifier)
+    val tableIdent = visitMultipartIdentifier(ctx.multipartIdentifier)


I think this needs to be updated to remove the ParseException thrown when IF NOT EXISTS is present and there are dynamic partitions. I think that is an analysis problem, not a parse problem.

Also, I don't see a reason why IF NOT EXISTS would not be supported with dynamic partitions. Wouldn't that fail if any partitions would be overwritten? It seems to make sense to me, but maybe there is a good reason why this is not allowed? @gatorsmile can you comment?

We discussed this in the DSv2 sync last night and decided to add a method to the write builder to pass this IF NOT EXISTS flag. This will be done in a follow-up to avoid over-complicating this commit.

Opened https://issues.apache.org/jira/browse/SPARK-28374

jzhuge · 2019-06-13T06:07:46Z

Rebase and squash

SparkQA · 2019-06-13T06:15:27Z

Test build #106455 has finished for PR 24832 at commit 2c17ced.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InsertTableStatement(

SparkQA · 2019-06-27T00:18:42Z

Test build #106950 has finished for PR 24832 at commit 2f8ada2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InsertTableStatement(

SparkQA · 2019-06-27T00:41:43Z

Test build #106951 has finished for PR 24832 at commit fe4b24e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InsertTableStatement(

jzhuge · 2019-06-27T01:05:15Z

Will do DataFrameWriter.insertInto in a separate PR, so this PR is no longer WIP.

jzhuge · 2019-06-27T01:12:24Z

Will look into supporting IfPartitionNotExists flag with DSv2 tables in follow-up PR.

SparkQA · 2019-06-27T03:55:41Z

Test build #106952 has finished for PR 24832 at commit c9c9bc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-06-27T17:01:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

@@ -38,7 +38,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.{First, Last}
 import org.apache.spark.sql.catalyst.parser.SqlBaseParser._
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
-import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DropTableStatement, DropViewStatement, QualifiedColType}
+import org.apache.spark.sql.catalyst.plans.logical.sql._


Nit: this is a bad practice because it can cause git conflicts and pollutes the namespace.

It's probably okay here because there are only logical plans in that package, but in other places this causes problems when it imports packages as well as classes.

SparkQA · 2019-07-03T04:01:52Z

Test build #107141 has finished for PR 24832 at commit 6c33ac3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InsertTableStatement(

rdblue · 2019-07-05T19:57:49Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala

+  private class Overwrite(filters: Array[Filter]) extends TestBatchWrite {
+    override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
+      val deleteKeys = dataMap.keys.filter { partValues =>
+        filters.exists {


Looks like this matches a key if any value matches a filter expression. exists Scaladoc says "Tests whether a predicate holds for at least one value", so this is implementing an OR of all the filters, but the desired behavior is an AND of all the filters.

this should be a forall

rdblue · 2019-07-05T20:00:40Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala

-    override def abort(messages: Array[WriterCommitMessage]): Unit = {
+  private object TruncateAndAppend extends TestBatchWrite {
+    override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
+      dataMap = mutable.Map.empty


This should use dataMap.clear instead of re-assigning because this is synchronized on the original dataMap instance. After reassignment, another thread will be able to enter a synchronized block on the new instance.

@rdblue You forgot to address this?

Thanks, I'll fix it.

rdblue · 2019-07-05T20:04:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+          val staticPartitionProjectList = {
+            // check that the data column counts match
+            val numColumns = table.output.size
+            if (numColumns > staticPartitions.size + i.query.output.size) {


The ResolveOutputRelation rule will check this and produces a more useful error message: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2336-L2339

I think this rule should simply transform the query assuming that it will work, and let ResolveOutputRelation ensure that the types and number of columns align.

To do this, I think you just need to add all remaining query columns from the iterator after table.output is exhausted.

I agree, it'd be great if ResolveOutputRelation does all the checking and necessary casting. Seems like it already has most of the logic built in. This method can just look at the partition values, and then convert to AppendData or OverwriteByExpression

rdblue · 2019-07-05T20:05:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+          // ifPartitionNotExists is append with validation, but validation is not supported
+          if (i.ifPartitionNotExists) {
+            throw new AnalysisException(
+              s"Cannot write, IF NOT EXISTS is not supported for table: ${table.table.name}")


This uses table to refer to a DataSourceV2Relation, which causes this awkward reference because the relation has an actual table: table.table.name. It would be better to call the relation rel or relation.

rdblue · 2019-07-05T20:15:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+              throw new AnalysisException(s"Cannot write: not enough columns")
+            }
+
+            val staticNames = staticPartitions.keySet


This should validate that staticPartitions are all names for columns used in identity partitions. It is not valid to supply a static partition value for a non-partition column and it is not allowed to supply a static partition value for transform-derived columns.

rdblue · 2019-07-05T20:16:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+            conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
+
+          val query =
+            if (staticPartitions.isEmpty) {


If this is true, then this should avoid building the staticPartitionProjectList. I'd recommend refactoring that into a method, or moving it into the else block.

SparkQA · 2019-07-16T00:39:35Z

Test build #107707 has finished for PR 24832 at commit df61fca.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-16T00:45:13Z

@brkyvz, I've updated this PR since John is out on vacation. Could you have another look?

SparkQA · 2019-07-16T02:06:07Z

Test build #107708 has finished for PR 24832 at commit a67aef7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-17T02:12:55Z

Test build #107766 has finished for PR 24832 at commit e72e656.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-17T20:42:59Z

Test build #107799 has finished for PR 24832 at commit cf74d67.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-18T00:11:05Z

Test build #107800 has finished for PR 24832 at commit 0f8aa32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-23T23:51:40Z

Test build #108065 has finished for PR 24832 at commit c2824d3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

TODO: - DataFrameWriter.insertInto

SparkQA · 2019-07-24T21:26:21Z

Test build #108118 has finished for PR 24832 at commit 97dc04c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

This LGTM. I have some very minor comments around the parser changes.

brkyvz · 2019-07-24T23:13:19Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

-    : INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)?                              #insertOverwriteTable
-    | INSERT INTO TABLE? tableIdentifier partitionSpec?                                                     #insertIntoTable
+    : INSERT OVERWRITE TABLE? multipartIdentifier (partitionSpec (IF NOT EXISTS)?)?                         #insertOverwriteTable
+    | INSERT INTO TABLE? multipartIdentifier partitionSpec? (IF NOT EXISTS)?                                #insertIntoTable


do we need to wrap with parentheses (partitionSpec (IF NOT EXISTS)?)? like above? Otherwise, what happens if there's no partitionSpec but the IF NOT EXISTS?

If the table not exists? Then wouldn't that be CTAS?

It isn't supported either way, so why combine the two?

brkyvz · 2019-07-25T03:37:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

    val partitionKeys = Option(ctx.partitionSpec).map(visitPartitionSpec).getOrElse(Map.empty)

+    if (ctx.EXISTS != null) {


what's the point of adding this to the parser, if we're not going to support it?

For a better error message that is testable. Before, there were no tests for this case and the error message listed expected symbols.

Also, since the PARTITION clause is optional for the above case, it shouldn't group the two together either. It is semantically incorrect because a write to a partitioned table is always a partitioned write.

brkyvz · 2019-07-25T03:43:36Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala

+              case _ =>
+                throw new IllegalArgumentException(s"Unknown filter attribute: $attr")
+            }
+          case f @ _ =>


nit, no need for @ _

brkyvz · 2019-07-25T03:43:59Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala

-    override def abort(messages: Array[WriterCommitMessage]): Unit = {
+  private object TruncateAndAppend extends TestBatchWrite {
+    override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
+      dataMap = mutable.Map.empty


@rdblue You forgot to address this?

brkyvz · 2019-07-25T03:44:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

@@ -23,17 +23,19 @@ import scala.collection.mutable

 import org.apache.spark.sql.{AnalysisException, SaveMode}
 import org.apache.spark.sql.catalog.v2.{CatalogPlugin, Identifier, LookupCatalog, TableCatalog}
-import org.apache.spark.sql.catalog.v2.expressions.Transform
+import org.apache.spark.sql.catalog.v2.expressions.{FieldReference, IdentityTransform, Transform}


are any of the changes here needed?

Looks like there were unused imports. I'll commit a fix.

cloud-fan · 2019-07-25T08:03:02Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

-    : INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)?                              #insertOverwriteTable
-    | INSERT INTO TABLE? tableIdentifier partitionSpec?                                                     #insertIntoTable
+    : INSERT OVERWRITE TABLE? multipartIdentifier (partitionSpec (IF NOT EXISTS)?)?                         #insertOverwriteTable
+    | INSERT INTO TABLE? multipartIdentifier partitionSpec? (IF NOT EXISTS)?                                #insertIntoTable


I think (partitionSpec (IF NOT EXISTS)?)? is better? INSERT INTO TABLE ... IF NOT EXISTS doesn't make sense. The IF NOT EXISTS is only for partitions.

Actually, IF NOT EXISTS doesn't make sense to partition either. It's append not overwrite, and it seems weird to me if we can't append to an existing partition.

Can we keep this unchanged and not add IF NOT EXISTS here?

This was changed to get a better error message. Instead of a parse exception that lists symbols, this is now a useful error message with a test.

cloud-fan · 2019-07-25T08:11:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

    val partitionKeys = Option(ctx.partitionSpec).map(visitPartitionSpec).getOrElse(Map.empty)

    val dynamicPartitionKeys: Map[String, Option[String]] = partitionKeys.filter(_._2.isEmpty)
    if (ctx.EXISTS != null && dynamicPartitionKeys.nonEmpty) {
-      throw new ParseException(s"Dynamic partitions do not support IF NOT EXISTS. Specified " +
-        "partitions with value: " + dynamicPartitionKeys.keys.mkString("[", ",", "]"), ctx)
+      operationNotAllowed("IF NOT EXISTS with dynamic partitions: " +


why do we change the error message here?

This uses operationNotAllowed instead of throwing a custom ParseException, like other methods that do not allow specific combinations. I think it's a good idea to standardize on the existing method. This also makes the error message like the others, where it states clear what is not allowed.

rdblue · 2019-07-25T16:02:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

+    assert(exc.getMessage.contains("p2"))
+  }
+
+  test("insert table: if not exists without overwrite fails") {


@cloud-fan, @brkyvz, this is the test that required adding the IF NOT EXISTS to INSERT INTO. I think it is better to have a good error message instead of relying on not being able to parse the statement.

brkyvz · 2019-07-25T17:58:29Z

ResolveOutputRelation now does the safe casting, correct?

rdblue · 2019-07-25T18:58:01Z

@brkyvz, that's correct. Tests also validate that the error messages for too many or too few columns are the ones from ResolveOutputRelation

SparkQA · 2019-07-25T21:18:00Z

Test build #108178 has finished for PR 24832 at commit 7f193ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-07-25T21:33:22Z

Magic 💯

brkyvz · 2019-07-25T21:39:25Z

Merging to master!

rdblue · 2019-07-25T22:05:41Z

Thanks for fixing this @jzhuge! And thanks to @brkyvz and @cloud-fan for the reviews.

jzhuge · 2019-07-25T22:28:29Z

Thanks @rdblue for covering for me while I am on vacation in addition to being a reviewer!
Thanks @brkyvz and @cloud-fan for the reviews!
Thanks @brkyvz for committing the fix.

I will rebase PR #24980 "[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo".

jzhuge changed the title ~~Spark 27845 pr~~ [SPARK-27845][SQL][WIP] DataSourceV2: Insert into tables in multiple catalogs Jun 10, 2019

rdblue reviewed Jun 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

rdblue reviewed Jun 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala Outdated Show resolved Hide resolved

rdblue reviewed Jun 10, 2019

View reviewed changes

jzhuge changed the title ~~[SPARK-27845][SQL][WIP] DataSourceV2: Insert into tables in multiple catalogs~~ [SPARK-27845][SQL][WIP] DataSourceV2: InsertTable Jun 11, 2019

dongjoon-hyun added the IMPROVEMENT label Jun 12, 2019

jzhuge force-pushed the SPARK-27845-pr branch from e6363a6 to 2c17ced Compare June 13, 2019 06:07

dongjoon-hyun added SQL and removed IMPROVEMENT labels Jun 14, 2019

jzhuge force-pushed the SPARK-27845-pr branch from 2c17ced to 2f8ada2 Compare June 27, 2019 00:11

jzhuge force-pushed the SPARK-27845-pr branch from 2f8ada2 to fe4b24e Compare June 27, 2019 00:33

jzhuge changed the title ~~[SPARK-27845][SQL][WIP] DataSourceV2: InsertTable~~ [SPARK-27845][SQL] DataSourceV2: InsertTable Jun 27, 2019

rdblue reviewed Jun 27, 2019

View reviewed changes

jzhuge force-pushed the SPARK-27845-pr branch 2 times, most recently from 7daa572 to 6c33ac3 Compare July 3, 2019 00:49

rdblue reviewed Jul 5, 2019

View reviewed changes

rdblue force-pushed the SPARK-27845-pr branch 2 times, most recently from faa2e85 to a67aef7 Compare July 16, 2019 00:43

rdblue force-pushed the SPARK-27845-pr branch from cf74d67 to 0f8aa32 Compare July 17, 2019 20:48

jzhuge and others added 4 commits July 24, 2019 10:26

[SPARK-27845][SQL][WIP] DataSourceV2: InsertTable

8a3b61d

TODO: - DataFrameWriter.insertInto

Update for review comments.

efc4bf7

Update error message to fix failing test.

8449d66

Move InsertInto rules into Analyzer to fix error messages.

97dc04c

rdblue force-pushed the SPARK-27845-pr branch from c2824d3 to 97dc04c Compare July 24, 2019 16:51

brkyvz approved these changes Jul 25, 2019

View reviewed changes

cloud-fan reviewed Jul 25, 2019

View reviewed changes

rdblue reviewed Jul 25, 2019

View reviewed changes

Address more review comments.

7f193ca

asfgit closed this in 443904a Jul 25, 2019

		val partitionKeys = Option(ctx.partitionSpec).map(visitPartitionSpec).getOrElse(Map.empty)

		if (ctx.EXISTS != null) {

[SPARK-27845][SQL] DataSourceV2: InsertTable #24832

[SPARK-27845][SQL] DataSourceV2: InsertTable #24832

Conversation

jzhuge commented Jun 10, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jzhuge commented Jun 13, 2019

SparkQA commented Jun 13, 2019

SparkQA commented Jun 27, 2019

SparkQA commented Jun 27, 2019

jzhuge commented Jun 27, 2019

jzhuge commented Jun 27, 2019

SparkQA commented Jun 27, 2019

Choose a reason for hiding this comment

SparkQA commented Jul 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2019

rdblue commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 17, 2019

SparkQA commented Jul 17, 2019

SparkQA commented Jul 18, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 24, 2019

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz commented Jul 25, 2019

rdblue commented Jul 25, 2019

SparkQA commented Jul 25, 2019

brkyvz commented Jul 25, 2019

brkyvz commented Jul 25, 2019

rdblue commented Jul 25, 2019

jzhuge commented Jul 25, 2019

jzhuge commented Jun 10, 2019 •

edited

Loading

cloud-fan Jul 25, 2019 •

edited

Loading