MERGE and Deletion Vectors

japila-books · Jun 9, 2024 · 3adc8b5 · 3adc8b5
1 parent 56a76cb
commit 3adc8b5
Show file tree

Hide file tree

Showing 3 changed files with 25 additions and 6 deletions.
diff --git a/docs/commands/merge/ClassicMergeExecutor.md b/docs/commands/merge/ClassicMergeExecutor.md
@@ -293,17 +293,23 @@ writeAllChanges(
   spark: SparkSession,
   deltaTxn: OptimisticTransaction,
   filesToRewrite: Seq[AddFile],
-  deduplicateCDFDeletes: DeduplicateCDFDeletes): Seq[FileAction]
+  deduplicateCDFDeletes: DeduplicateCDFDeletes,
+  writeUnmodifiedRows: Boolean): Seq[FileAction]
 ```
 
 !!! note "Change Data Feed"
     `writeAllChanges` acts differently with or no [Change Data Feed](../../change-data-feed/index.md) enabled.
 
+!!! note "Deletion Vectors"
+    `writeUnmodifiedRows` input flag is disabled (`false`) to indicate that [Deletion Vectors](../../deletion-vectors/index.md) should be used (with  [shouldWritePersistentDeletionVectors](MergeIntoCommandBase.md#shouldWritePersistentDeletionVectors) enabled).
+
+    The unmodified rows do not have to be written out and `writeAllChanges` can perform stricter joins.
+
 `writeAllChanges` [records this merge operation](MergeIntoCommandBase.md#recordMergeOperation) with the following:
 
 Property | Value
 ---------|------
- `extraOpType` | <ul><li>**writeAllUpdatesAndDeletes** for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge)<li>**writeAllChanges** otherwise</ul>
+ `extraOpType` | <ul><li>**writeModifiedRowsOnly** for `writeUnmodifiedRows` disabled</li><li>**writeAllUpdatesAndDeletes** for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge)<li>**writeAllChanges** otherwise</ul>
  `status` | **MERGE operation - Rewriting [filesToRewrite] files**
  `sqlMetricName` | [rewriteTimeMs](MergeIntoCommandBase.md#rewriteTimeMs)
 
@@ -321,10 +327,18 @@ Property | Value
 
 `writeAllChanges` creates a `DataFrame` for the [target plan](#buildTargetPlanWithFiles) with the given [AddFile](../../AddFile.md)s to rewrite (`filesToRewrite`) (and no `columnsToDrop`).
 
-`writeAllChanges` determines the join type based on [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge):
+`writeAllChanges` determines the join type.
+With `writeUnmodifiedRows` enabled (`true`), the join type is as follows:
+
+1. `rightOuter` for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge) enabled
+1. `fullOuter` otherwise
+
+With `writeUnmodifiedRows` disabled (`false`), the join type is as follows (in that order):
 
-* `rightOuter` when enabled
-* `fullOuter` otherwise
+1. `inner` for `isMatchedOnly` enabled
+1. `leftOuter` for no `notMatchedBySourceClauses`
+1. `rightOuter` for no `notMatchedClauses`
+1. `fullOuter` otherwise
 
 ??? note "`shouldOptimizeMatchedOnlyMerge` Used Twice"
     [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge) is used twice for the following:

diff --git a/docs/configuration-properties/index.md b/docs/configuration-properties/index.md
@@ -342,6 +342,10 @@ Default: `50`
 
 Default: `true`
 
+Used when:
+
+* `MergeIntoCommandBase` is requested to [shouldWritePersistentDeletionVectors](../commands/merge/MergeIntoCommandBase.md#shouldWritePersistentDeletionVectors)
+
 ### <span id="MERGE_MATERIALIZE_SOURCE"> merge.materializeSource { #merge.materializeSource }
 
 **spark.databricks.delta.merge.materializeSource**

diff --git a/docs/features/index.md b/docs/features/index.md
@@ -10,6 +10,7 @@
 * [Column Statistics](../column-statistics/index.md)
 * [Commands](../commands/index.md)
 * [Data Skipping](../data-skipping/index.md)
+* [Deletion Vectors](../deletion-vectors/index.md)
 * [Delta SQL](../sql/index.md)
 * [Developer API](../DeltaTable.md)
 * [Generated Columns](../generated-columns/index.md)
@@ -25,7 +26,7 @@ Delta Lake can run with other execution engines like [Trino](https://trino.io/do
 
 Delta tables can be registered in a table catalog. Delta Lake creates a transaction log at the root directory of a table, and the catalog contains no information but the table format and the location of the table. All table properties, schema and partitioning information live in the transaction log to avoid a "split brain" situation ([Wikipedia](https://en.wikipedia.org/wiki/Split-brain_(computing))).
 
-Delta Lake {{ delta.version }} supports Apache Spark {{ spark.version }} (cf. [build.sbt]({{ delta.github }}/build.sbt#L38)).
+Delta Lake {{ delta.version }} supports Apache Spark {{ spark.version }} (cf. [build.sbt]({{ delta.github }}/build.sbt#L37)).
 
 ## Delta Tables