Skip to content

Commit

Permalink
MERGE and Deletion Vectors
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Jun 9, 2024
1 parent 56a76cb commit 3adc8b5
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 6 deletions.
24 changes: 19 additions & 5 deletions docs/commands/merge/ClassicMergeExecutor.md
Original file line number Diff line number Diff line change
Expand Up @@ -293,17 +293,23 @@ writeAllChanges(
spark: SparkSession,
deltaTxn: OptimisticTransaction,
filesToRewrite: Seq[AddFile],
deduplicateCDFDeletes: DeduplicateCDFDeletes): Seq[FileAction]
deduplicateCDFDeletes: DeduplicateCDFDeletes,
writeUnmodifiedRows: Boolean): Seq[FileAction]
```

!!! note "Change Data Feed"
`writeAllChanges` acts differently with or no [Change Data Feed](../../change-data-feed/index.md) enabled.

!!! note "Deletion Vectors"
`writeUnmodifiedRows` input flag is disabled (`false`) to indicate that [Deletion Vectors](../../deletion-vectors/index.md) should be used (with [shouldWritePersistentDeletionVectors](MergeIntoCommandBase.md#shouldWritePersistentDeletionVectors) enabled).

The unmodified rows do not have to be written out and `writeAllChanges` can perform stricter joins.

`writeAllChanges` [records this merge operation](MergeIntoCommandBase.md#recordMergeOperation) with the following:

Property | Value
---------|------
`extraOpType` | <ul><li>**writeAllUpdatesAndDeletes** for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge)<li>**writeAllChanges** otherwise</ul>
`extraOpType` | <ul><li>**writeModifiedRowsOnly** for `writeUnmodifiedRows` disabled</li><li>**writeAllUpdatesAndDeletes** for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge)<li>**writeAllChanges** otherwise</ul>
`status` | **MERGE operation - Rewriting [filesToRewrite] files**
`sqlMetricName` | [rewriteTimeMs](MergeIntoCommandBase.md#rewriteTimeMs)

Expand All @@ -321,10 +327,18 @@ Property | Value

`writeAllChanges` creates a `DataFrame` for the [target plan](#buildTargetPlanWithFiles) with the given [AddFile](../../AddFile.md)s to rewrite (`filesToRewrite`) (and no `columnsToDrop`).

`writeAllChanges` determines the join type based on [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge):
`writeAllChanges` determines the join type.
With `writeUnmodifiedRows` enabled (`true`), the join type is as follows:

1. `rightOuter` for [shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge) enabled
1. `fullOuter` otherwise

With `writeUnmodifiedRows` disabled (`false`), the join type is as follows (in that order):

* `rightOuter` when enabled
* `fullOuter` otherwise
1. `inner` for `isMatchedOnly` enabled
1. `leftOuter` for no `notMatchedBySourceClauses`
1. `rightOuter` for no `notMatchedClauses`
1. `fullOuter` otherwise

??? note "`shouldOptimizeMatchedOnlyMerge` Used Twice"
[shouldOptimizeMatchedOnlyMerge](MergeIntoCommandBase.md#shouldOptimizeMatchedOnlyMerge) is used twice for the following:
Expand Down
4 changes: 4 additions & 0 deletions docs/configuration-properties/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,10 @@ Default: `50`

Default: `true`

Used when:

* `MergeIntoCommandBase` is requested to [shouldWritePersistentDeletionVectors](../commands/merge/MergeIntoCommandBase.md#shouldWritePersistentDeletionVectors)

### <span id="MERGE_MATERIALIZE_SOURCE"> merge.materializeSource { #merge.materializeSource }

**spark.databricks.delta.merge.materializeSource**
Expand Down
3 changes: 2 additions & 1 deletion docs/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
* [Column Statistics](../column-statistics/index.md)
* [Commands](../commands/index.md)
* [Data Skipping](../data-skipping/index.md)
* [Deletion Vectors](../deletion-vectors/index.md)
* [Delta SQL](../sql/index.md)
* [Developer API](../DeltaTable.md)
* [Generated Columns](../generated-columns/index.md)
Expand All @@ -25,7 +26,7 @@ Delta Lake can run with other execution engines like [Trino](https://trino.io/do

Delta tables can be registered in a table catalog. Delta Lake creates a transaction log at the root directory of a table, and the catalog contains no information but the table format and the location of the table. All table properties, schema and partitioning information live in the transaction log to avoid a "split brain" situation ([Wikipedia](https://en.wikipedia.org/wiki/Split-brain_(computing))).

Delta Lake {{ delta.version }} supports Apache Spark {{ spark.version }} (cf. [build.sbt]({{ delta.github }}/build.sbt#L38)).
Delta Lake {{ delta.version }} supports Apache Spark {{ spark.version }} (cf. [build.sbt]({{ delta.github }}/build.sbt#L37)).

## Delta Tables

Expand Down

0 comments on commit 3adc8b5

Please sign in to comment.