You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1591 brought deletion vectors to Delta Lake, and changed the way DELETE works from "removing an old file & add a new file" to "removing a file and adding it back with a DV attached". This change breaks the assumption of CDF generation, which assumes all rows in the removed file are delete and all rows in the added file are insert. We must make the CDC reader handle DVs.
High-level implementation details
This FR proposes to make the CDC reader look at DVs in FileAction and compute a new, in-memory DV to mark deleted rows. Assuming we have two DVs, then there can be four cases:
Remove without DV, add without DV: not possible. The protocol does not allow this.
Remove without DV, add with DV1: rows masked by DV1 are deleted.
Remove with DV1, add without DV: rows masked by DV1 are added. This may happen when restoring a table.
Remove with DV1, add with DV2:
Rows masked by DV2 but not DV1 are deleted.
Rows masked by DV1 but not DV2 are re-added. This may happen when restoring a table.
Looking at the above cases, we could do a diff on DVs and attach the result to a file scan, to obtain desired rows. For cases 3 and 4.2, we must invert the DV so it keeps marked rows rather than removes them.
The implementation will be in two phases. The first one will do some preparations and the second one will change the CDC reader.
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
The text was updated successfully, but these errors were encountered:
…- Part 1/2
This PR is part of #1701. A detailed overview of changes is described at #1701.
This is the first PR to add support to allow reading CDC from files that have DV associated. In this PR we do some preparation work to allow fine control of how to handle masked rows: keep or drop. Later these two types will be used by CDCReader to pull masked rows out from files.
Closes#1680
GitOrigin-RevId: d0f49ee0a11e604f089d45df1611272a81d47813
This PR is part of #1701.
This is a follow-up of #1680 to add support to allow reading CDC from files that have DV associated. In this PR we modify the CDC reader to construct in-line DVs diff'ed from two existing DVs, and modify the corresponding FileIndex to use the in-line DV.
Closes#1704
GitOrigin-RevId: 9e3589eb576a773b9f05777521b01485ebeaf33e
Feature request
Overview & motivation
#1591 brought deletion vectors to Delta Lake, and changed the way DELETE works from "removing an old file & add a new file" to "removing a file and adding it back with a DV attached". This change breaks the assumption of CDF generation, which assumes all rows in the removed file are
delete
and all rows in the added file areinsert
. We must make the CDC reader handle DVs.High-level implementation details
This FR proposes to make the CDC reader look at DVs in FileAction and compute a new, in-memory DV to mark deleted rows. Assuming we have two DVs, then there can be four cases:
Looking at the above cases, we could do a diff on DVs and attach the result to a file scan, to obtain desired rows. For cases 3 and 4.2, we must invert the DV so it keeps marked rows rather than removes them.
The implementation will be in two phases. The first one will do some preparations and the second one will change the CDC reader.
First phase: #1680
Second phase: TBD.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: