-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MergeIntoCommand not visible in QueryExecutionListener when using Python/Scala API #1521
Comments
The issue is we don't call Dataset.ofRows for Merge:
There were some spark issues preventing us from doing this. Would you be willing to help us to try out and see if these spark issues have got fixed? |
Yes, Can you provide some guidance on how to repro these issues |
You can call toDataset on the merge command like
|
@zsxwing spark have fixed the issue . spark issue |
Yep. Feel free to open a pull request. |
@sherlockbeard thanks for working on this item
|
@sh0ck-wave @sherlockbeard Hi, appreciate the effort you put into resolving and researching this. Is there a timeline for when this fix can be merged? Or is there something check that's blocking it's release/approval? |
## Description Due to Spark unfortunate behavior of resolving plan nodes it doesn't know, the `DeltaMergeInto` plan created when using the MERGE scala API needs to be manually resolved to ensure spark doesn't interfere with its analysis. This currently completely bypasses Spark's analysis as we then manually execute the MERGE command which has negatiev effects, e.g. the execution is not visible in QueryExecutionListener. This change addresses this issue, by executing the plan using the Dataframe API after it's manually resolved so that the command goes through the regular code path. Resolves #1521 ## How was this patch tested? Covered by existing tests.
I picked up the change from @sherlockbeard, ran some more tests and merged it: #3456 |
Hi johanl, really appreciate the update. Awesome on the quick turnaround. I know I posed this question in an earlier thread. Is there a version of the library I can pull as a patch for now or will I need to wait for a major release? |
(cherrypick of delta-io#3456) Due to Spark unfortunate behavior of resolving plan nodes it doesn't know, the `DeltaMergeInto` plan created when using the MERGE scala API needs to be manually resolved to ensure spark doesn't interfere with its analysis. This currently completely bypasses Spark's analysis as we then manually execute the MERGE command which has negatiev effects, e.g. the execution is not visible in QueryExecutionListener. This change addresses this issue, by executing the plan using the Dataframe API after it's manually resolved so that the command goes through the regular code path. Resolves delta-io#1521 Covered by existing tests.
(cherrypick of #3456) Due to Spark unfortunate behavior of resolving plan nodes it doesn't know, the `DeltaMergeInto` plan created when using the MERGE scala API needs to be manually resolved to ensure spark doesn't interfere with its analysis. This currently completely bypasses Spark's analysis as we then manually execute the MERGE command which has negatiev effects, e.g. the execution is not visible in QueryExecutionListener. This change addresses this issue, by executing the plan using the Dataframe API after it's manually resolved so that the command goes through the regular code path. Resolves #1521 Covered by existing tests. Co-authored-by: Johan Lasperas <johan.lasperas@databricks.com>
Bug
MergeIntoCommand not visible in QueryExecutionListener when using Python/Scala API to execute a merge operation
Describe the problem
When using sql MERGE statement via
spark.sql
a LogicalPlan of typeorg.apache.spark.sql.delta.commands.MergeIntoCommand
is visible to any registered spark QueryExecutionListener, this is useful for capturing statistics and data lineage.When using the python API to execute the merge operation, no such LogicalPlan is visible to registered spark QueryExecutionListeners making it difficult to track merge related statistics and data lineage
Steps to reproduce
Execute the following scala spark application:
Observed results
As can be seen in the case of Delta API there is no
org.apache.spark.sql.delta.commands.MergeIntoCommand
captured by the QueryExecutionListenerExpected results
org.apache.spark.sql.delta.commands.MergeIntoCommand
should be captured by QueryExecutionListener for Delta API similar to SQL MERGE commandEnvironment information
The text was updated successfully, but these errors were encountered: