Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27845][SQL] DataSourceV2: InsertTable #24832

Closed
wants to merge 5 commits into from

Conversation

jzhuge
Copy link
Member

@jzhuge jzhuge commented Jun 10, 2019

What changes were proposed in this pull request?

Support multiple catalogs in the following InsertTable use cases:

  • INSERT INTO [TABLE] catalog.db.tbl
  • INSERT OVERWRITE TABLE catalog.db.tbl

Support matrix:

Overwrite Partitioned Table Partition Clause Partition Overwrite Mode Action
false * * * AppendData
true no (empty) * OverwriteByExpression(true)
true yes p1,p2 or p1 or p2 or (empty) STATIC OverwriteByExpression(true)
true yes p2,p2 or p1 or p2 or (empty) DYNAMIC OverwritePartitionsDynamic
true yes p1=23,p2=3 * OverwriteByExpression(p1=23 and p2=3)
true yes p1=23,p2 or p1=23 STATIC OverwriteByExpression(p1=23)
true yes p1=23,p2 or p1=23 DYNAMIC OverwritePartitionsDynamic

Notes:

  • Assume the partitioned table has 2 partitions: p1 and p2.
  • STATIC is the default Partition Overwrite Mode for data source tables.
  • DSv2 tables currently do not support IfPartitionNotExists.

How was this patch tested?

New tests.
All existing catalyst and sql/core tests.

@jzhuge jzhuge changed the title Spark 27845 pr [SPARK-27845][SQL][WIP] DataSourceV2: Insert into tables in multiple catalogs Jun 10, 2019
@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106359 has finished for PR 24832 at commit e6363a6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InsertTableStatement(

@@ -274,7 +284,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
override def visitInsertOverwriteTable(
ctx: InsertOverwriteTableContext): InsertTableParams = withOrigin(ctx) {
assert(ctx.OVERWRITE() != null)
val tableIdent = visitTableIdentifier(ctx.tableIdentifier)
val tableIdent = visitMultipartIdentifier(ctx.multipartIdentifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be updated to remove the ParseException thrown when IF NOT EXISTS is present and there are dynamic partitions. I think that is an analysis problem, not a parse problem.

Also, I don't see a reason why IF NOT EXISTS would not be supported with dynamic partitions. Wouldn't that fail if any partitions would be overwritten? It seems to make sense to me, but maybe there is a good reason why this is not allowed? @gatorsmile can you comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in the DSv2 sync last night and decided to add a method to the write builder to pass this IF NOT EXISTS flag. This will be done in a follow-up to avoid over-complicating this commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzhuge jzhuge changed the title [SPARK-27845][SQL][WIP] DataSourceV2: Insert into tables in multiple catalogs [SPARK-27845][SQL][WIP] DataSourceV2: InsertTable Jun 11, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Jun 13, 2019

Rebase and squash

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106455 has finished for PR 24832 at commit 2c17ced.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InsertTableStatement(

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106950 has finished for PR 24832 at commit 2f8ada2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InsertTableStatement(

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106951 has finished for PR 24832 at commit fe4b24e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InsertTableStatement(

@jzhuge
Copy link
Member Author

jzhuge commented Jun 27, 2019

Will do DataFrameWriter.insertInto in a separate PR, so this PR is no longer WIP.

@jzhuge jzhuge changed the title [SPARK-27845][SQL][WIP] DataSourceV2: InsertTable [SPARK-27845][SQL] DataSourceV2: InsertTable Jun 27, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Jun 27, 2019

Will look into supporting IfPartitionNotExists flag with DSv2 tables in follow-up PR.

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106952 has finished for PR 24832 at commit c9c9bc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -38,7 +38,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.{First, Last}
import org.apache.spark.sql.catalyst.parser.SqlBaseParser._
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DropTableStatement, DropViewStatement, QualifiedColType}
import org.apache.spark.sql.catalyst.plans.logical.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is a bad practice because it can cause git conflicts and pollutes the namespace.

It's probably okay here because there are only logical plans in that package, but in other places this causes problems when it imports packages as well as classes.

@jzhuge jzhuge force-pushed the SPARK-27845-pr branch 2 times, most recently from 7daa572 to 6c33ac3 Compare July 3, 2019 00:49
@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107141 has finished for PR 24832 at commit 6c33ac3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class InsertTableStatement(

private class Overwrite(filters: Array[Filter]) extends TestBatchWrite {
override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
val deleteKeys = dataMap.keys.filter { partValues =>
filters.exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this matches a key if any value matches a filter expression. exists Scaladoc says "Tests whether a predicate holds for at least one value", so this is implementing an OR of all the filters, but the desired behavior is an AND of all the filters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a forall

override def abort(messages: Array[WriterCommitMessage]): Unit = {
private object TruncateAndAppend extends TestBatchWrite {
override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
dataMap = mutable.Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use dataMap.clear instead of re-assigning because this is synchronized on the original dataMap instance. After reassignment, another thread will be able to enter a synchronized block on the new instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue You forgot to address this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix it.

val staticPartitionProjectList = {
// check that the data column counts match
val numColumns = table.output.size
if (numColumns > staticPartitions.size + i.query.output.size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResolveOutputRelation rule will check this and produces a more useful error message: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2336-L2339

I think this rule should simply transform the query assuming that it will work, and let ResolveOutputRelation ensure that the types and number of columns align.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this, I think you just need to add all remaining query columns from the iterator after table.output is exhausted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it'd be great if ResolveOutputRelation does all the checking and necessary casting. Seems like it already has most of the logic built in. This method can just look at the partition values, and then convert to AppendData or OverwriteByExpression

// ifPartitionNotExists is append with validation, but validation is not supported
if (i.ifPartitionNotExists) {
throw new AnalysisException(
s"Cannot write, IF NOT EXISTS is not supported for table: ${table.table.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses table to refer to a DataSourceV2Relation, which causes this awkward reference because the relation has an actual table: table.table.name. It would be better to call the relation rel or relation.

throw new AnalysisException(s"Cannot write: not enough columns")
}

val staticNames = staticPartitions.keySet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should validate that staticPartitions are all names for columns used in identity partitions. It is not valid to supply a static partition value for a non-partition column and it is not allowed to supply a static partition value for transform-derived columns.

conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC

val query =
if (staticPartitions.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true, then this should avoid building the staticPartitionProjectList. I'd recommend refactoring that into a method, or moving it into the else block.

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107707 has finished for PR 24832 at commit df61fca.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-27845-pr branch 2 times, most recently from faa2e85 to a67aef7 Compare July 16, 2019 00:43
@rdblue
Copy link
Contributor

rdblue commented Jul 16, 2019

@brkyvz, I've updated this PR since John is out on vacation. Could you have another look?

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107708 has finished for PR 24832 at commit a67aef7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107766 has finished for PR 24832 at commit e72e656.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107799 has finished for PR 24832 at commit cf74d67.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107800 has finished for PR 24832 at commit 0f8aa32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2019

Test build #108065 has finished for PR 24832 at commit c2824d3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108118 has finished for PR 24832 at commit 97dc04c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I have some very minor comments around the parser changes.

: INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable
| INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable
: INSERT OVERWRITE TABLE? multipartIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable
| INSERT INTO TABLE? multipartIdentifier partitionSpec? (IF NOT EXISTS)? #insertIntoTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to wrap with parentheses (partitionSpec (IF NOT EXISTS)?)? like above? Otherwise, what happens if there's no partitionSpec but the IF NOT EXISTS?

  • If the table not exists? Then wouldn't that be CTAS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't supported either way, so why combine the two?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

val partitionKeys = Option(ctx.partitionSpec).map(visitPartitionSpec).getOrElse(Map.empty)

if (ctx.EXISTS != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of adding this to the parser, if we're not going to support it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a better error message that is testable. Before, there were no tests for this case and the error message listed expected symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since the PARTITION clause is optional for the above case, it shouldn't group the two together either. It is semantically incorrect because a write to a partitioned table is always a partitioned write.

case _ =>
throw new IllegalArgumentException(s"Unknown filter attribute: $attr")
}
case f @ _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, no need for @ _

override def abort(messages: Array[WriterCommitMessage]): Unit = {
private object TruncateAndAppend extends TestBatchWrite {
override def commit(messages: Array[WriterCommitMessage]): Unit = dataMap.synchronized {
dataMap = mutable.Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue You forgot to address this?

@@ -23,17 +23,19 @@ import scala.collection.mutable

import org.apache.spark.sql.{AnalysisException, SaveMode}
import org.apache.spark.sql.catalog.v2.{CatalogPlugin, Identifier, LookupCatalog, TableCatalog}
import org.apache.spark.sql.catalog.v2.expressions.Transform
import org.apache.spark.sql.catalog.v2.expressions.{FieldReference, IdentityTransform, Transform}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are any of the changes here needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there were unused imports. I'll commit a fix.

: INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable
| INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable
: INSERT OVERWRITE TABLE? multipartIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertOverwriteTable
| INSERT INTO TABLE? multipartIdentifier partitionSpec? (IF NOT EXISTS)? #insertIntoTable
Copy link
Contributor

@cloud-fan cloud-fan Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (partitionSpec (IF NOT EXISTS)?)? is better? INSERT INTO TABLE ... IF NOT EXISTS doesn't make sense. The IF NOT EXISTS is only for partitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, IF NOT EXISTS doesn't make sense to partition either. It's append not overwrite, and it seems weird to me if we can't append to an existing partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this unchanged and not add IF NOT EXISTS here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed to get a better error message. Instead of a parse exception that lists symbols, this is now a useful error message with a test.

val partitionKeys = Option(ctx.partitionSpec).map(visitPartitionSpec).getOrElse(Map.empty)

val dynamicPartitionKeys: Map[String, Option[String]] = partitionKeys.filter(_._2.isEmpty)
if (ctx.EXISTS != null && dynamicPartitionKeys.nonEmpty) {
throw new ParseException(s"Dynamic partitions do not support IF NOT EXISTS. Specified " +
"partitions with value: " + dynamicPartitionKeys.keys.mkString("[", ",", "]"), ctx)
operationNotAllowed("IF NOT EXISTS with dynamic partitions: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we change the error message here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses operationNotAllowed instead of throwing a custom ParseException, like other methods that do not allow specific combinations. I think it's a good idea to standardize on the existing method. This also makes the error message like the others, where it states clear what is not allowed.

assert(exc.getMessage.contains("p2"))
}

test("insert table: if not exists without overwrite fails") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, @brkyvz, this is the test that required adding the IF NOT EXISTS to INSERT INTO. I think it is better to have a good error message instead of relying on not being able to parse the statement.

@brkyvz
Copy link
Contributor

brkyvz commented Jul 25, 2019

ResolveOutputRelation now does the safe casting, correct?

@rdblue
Copy link
Contributor

rdblue commented Jul 25, 2019

@brkyvz, that's correct. Tests also validate that the error messages for too many or too few columns are the ones from ResolveOutputRelation

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108178 has finished for PR 24832 at commit 7f193ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Jul 25, 2019

Magic 💯

@brkyvz
Copy link
Contributor

brkyvz commented Jul 25, 2019

Merging to master!

@rdblue
Copy link
Contributor

rdblue commented Jul 25, 2019

Thanks for fixing this @jzhuge! And thanks to @brkyvz and @cloud-fan for the reviews.

@asfgit asfgit closed this in 443904a Jul 25, 2019
@jzhuge
Copy link
Member Author

jzhuge commented Jul 25, 2019

Thanks @rdblue for covering for me while I am on vacation in addition to being a reviewer!
Thanks @brkyvz and @cloud-fan for the reviews!
Thanks @brkyvz for committing the fix.

I will rebase PR #24980 "[SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants