Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MergeIntoCommand not visible in QueryExecutionListener when using Python/Scala API #1521

Closed
sh0ck-wave opened this issue Dec 13, 2022 · 9 comments · Fixed by #3456
Closed
Assignees
Labels
bug Something isn't working

Comments

@sh0ck-wave
Copy link

Bug

MergeIntoCommand not visible in QueryExecutionListener when using Python/Scala API to execute a merge operation

Describe the problem

When using sql MERGE statement via spark.sql a LogicalPlan of type org.apache.spark.sql.delta.commands.MergeIntoCommand is visible to any registered spark QueryExecutionListener, this is useful for capturing statistics and data lineage.
When using the python API to execute the merge operation, no such LogicalPlan is visible to registered spark QueryExecutionListeners making it difficult to track merge related statistics and data lineage

Steps to reproduce

Execute the following scala spark application:

package com.foo.bar

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.util.QueryExecutionListener
import org.apache.spark.sql.execution.QueryExecution
import io.delta.tables.DeltaTable

class MyListener extends QueryExecutionListener {

  override def onSuccess(
      funcName: String,
      qe: QueryExecution,
      durationNs: Long
  ): Unit = {
    println(s"Received Event from ${qe.analyzed.getClass}")
  }

  override def onFailure(
      funcName: String,
      qe: QueryExecution,
      exception: Exception
  ): Unit = {
    println(s"Received Failure Event from ${qe.analyzed.getClass}")
  }
}

case class Player(id: Integer, name: String)

object SimpleApp {
  def main(args: Array[String]) = {
    println(s"starting")

    val spark_conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("SimpleApp")
      .set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .set(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog"
      )
    val spark = SparkSession.builder.config(spark_conf).getOrCreate()
    spark.sparkContext.setLogLevel("OFF")

    val df = spark.createDataFrame[Player](Seq(Player(1, "Quark")))
    val df1 = spark.createDataFrame[Player](Seq(Player(1, "Boson")))
    (
      df.write
        .mode("overwrite")
        .option("overwriteSchema", "true")
        .option("path", "/path/to/table1")
        .format("delta")
        .saveAsTable("base")
    )

    (
      df1.write
        .mode("overwrite")
        .option("overwriteSchema", "true")
        .option("path", "/path/to/table2")
        .format("delta")
        .saveAsTable("update")
    )

    spark.listenerManager.register(new MyListener())

    println("Captured plans for SQL MERGE:")

    spark.sql(
      """
      |MERGE INTO base
      |USING update
      |ON base.Id = update.Id
      |WHEN MATCHED THEN
      |    UPDATE SET *
      |WHEN NOT MATCHED THEN
      |    INSERT *
      """.stripMargin
    )
    Thread.sleep(5000)
    spark.listenerManager.clear()

    val base_table = DeltaTable.forPath(spark, "/path/to/table1")
    val update_table = DeltaTable.forPath(spark, "/path/to/table2")
    val merge = (
        base_table.alias("a")
        .merge(update_table.toDF.alias("b"), "a.Id == b.Id")
        .whenMatched().updateAll()
        .whenNotMatched().insertAll()
    )
    
    spark.listenerManager.register(new MyListener())
    println("")
    println("Captured plans for Delta API:")
    merge.execute()


    Thread.sleep(5000)
    spark.stop()
  }
}

Observed results

Captured plans for SQL MERGE:
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Aggregate
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
Received Event from class org.apache.spark.sql.delta.commands.MergeIntoCommand

Captured plans for Delta API:
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Aggregate
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.GlobalLimit

As can be seen in the case of Delta API there is no org.apache.spark.sql.delta.commands.MergeIntoCommand captured by the QueryExecutionListener

Expected results

org.apache.spark.sql.delta.commands.MergeIntoCommand should be captured by QueryExecutionListener for Delta API similar to SQL MERGE command

Environment information

  • Delta Lake version: 2.1.0
  • Spark version: 3.3.1
  • Scala version: 2.12.15
@sh0ck-wave sh0ck-wave added the bug Something isn't working label Dec 13, 2022
@sh0ck-wave sh0ck-wave changed the title [BUG] MergeIntoCommand not visible in QueryExecutionListener when using Python API [BUG] MergeIntoCommand not visible in QueryExecutionListener when using Python/Scala API Dec 13, 2022
@zsxwing
Copy link
Member

zsxwing commented Dec 15, 2022

The issue is we don't call Dataset.ofRows for Merge:

mergeIntoCommand.run(sparkSession)

There were some spark issues preventing us from doing this. Would you be willing to help us to try out and see if these spark issues have got fixed?

@sh0ck-wave
Copy link
Author

There were some spark issues preventing us from doing this. Would you be willing to help us to try out and see if these spark issues have got fixed?

Yes, Can you provide some guidance on how to repro these issues

@zsxwing
Copy link
Member

zsxwing commented Jan 4, 2023

Yes, Can you provide some guidance on how to repro these issues

You can call toDataset on the merge command like

and see if there is any test failed. If Spark hasn't fixed the issue, there will be tests failing.

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Mar 31, 2023

@zsxwing spark have fixed the issue . spark issue
Previously, I made these changes in a pull request and the tests passed.
should I create a pull request for this change

@zsxwing
Copy link
Member

zsxwing commented Apr 13, 2023

should I create a pull request for this change

Yep. Feel free to open a pull request.

@sh0ck-wave
Copy link
Author

sh0ck-wave commented Jun 26, 2023

@sherlockbeard thanks for working on this item
@zsxwing I can confirm that the PR solves this Bug. Testing the PR, I was able to get the following output with it from the same sample app as above:

Captured plans for SQL MERGE:
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Aggregate
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
Received Event from class org.apache.spark.sql.delta.commands.MergeIntoCommand

Captured plans for Delta API:
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Aggregate
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Project
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.Filter
Received Event from class org.apache.spark.sql.catalyst.plans.logical.SerializeFromObject
Received Event from class org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
Received Event from class org.apache.spark.sql.delta.commands.MergeIntoCommand

@fluxquantum
Copy link

@sh0ck-wave @sherlockbeard Hi, appreciate the effort you put into resolving and researching this. Is there a timeline for when this fix can be merged? Or is there something check that's blocking it's release/approval?

@tdas tdas closed this as completed in #3456 Aug 1, 2024
tdas pushed a commit that referenced this issue Aug 1, 2024
## Description
Due to Spark unfortunate behavior of resolving plan nodes it doesn't
know, the `DeltaMergeInto` plan created when using the MERGE scala API
needs to be manually resolved to ensure spark doesn't interfere with its
analysis.

This currently completely bypasses Spark's analysis as we then manually
execute the MERGE command which has negatiev effects, e.g. the execution
is not visible in QueryExecutionListener.

This change addresses this issue, by executing the plan using the
Dataframe API after it's manually resolved so that the command goes
through the regular code path.

Resolves #1521
## How was this patch tested?
Covered by existing tests.
@johanl-db
Copy link
Collaborator

I picked up the change from @sherlockbeard, ran some more tests and merged it: #3456
cc @fluxquantum

@fluxquantum
Copy link

Hi johanl, really appreciate the update. Awesome on the quick turnaround. I know I posed this question in an earlier thread. Is there a version of the library I can pull as a patch for now or will I need to wait for a major release?

vkorukanti pushed a commit to vkorukanti/delta that referenced this issue Aug 20, 2024
(cherrypick of delta-io#3456)

Due to Spark unfortunate behavior of resolving plan nodes it doesn't
know, the `DeltaMergeInto` plan created when using the MERGE scala API
needs to be manually resolved to ensure spark doesn't interfere with its
analysis.

This currently completely bypasses Spark's analysis as we then manually
execute the MERGE command which has negatiev effects, e.g. the execution
is not visible in QueryExecutionListener.

This change addresses this issue, by executing the plan using the
Dataframe API after it's manually resolved so that the command goes
through the regular code path.

Resolves delta-io#1521
Covered by existing tests.
vkorukanti added a commit that referenced this issue Aug 20, 2024
(cherrypick of #3456)

Due to Spark unfortunate behavior of resolving plan nodes it doesn't
know, the `DeltaMergeInto` plan created when using the MERGE scala API
needs to be manually resolved to ensure spark doesn't interfere with its
analysis.

This currently completely bypasses Spark's analysis as we then manually
execute the MERGE command which has negatiev effects, e.g. the execution
is not visible in QueryExecutionListener.

This change addresses this issue, by executing the plan using the
Dataframe API after it's manually resolved so that the command goes
through the regular code path.

Resolves #1521 Covered by
existing tests.

Co-authored-by: Johan Lasperas <johan.lasperas@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants