Skip to content

Commit

Permalink
Refactor remove file handling in InMemoryLogReplay (#4180)
Browse files Browse the repository at this point in the history
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Refactor remove file hashset in InMemoryLogReplay into activeRemoveFiles
and cancelledRemoveFiles, because they are semantically different that
the later had cancelled an AddFile, while the former did not.

## How was this patch tested?

existing UT. 

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
  • Loading branch information
lzlfred authored Feb 20, 2025
1 parent c612006 commit 8045052
Showing 1 changed file with 16 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,12 @@ class InMemoryLogReplay(
private val transactions = new scala.collection.mutable.HashMap[String, SetTransaction]()
private val domainMetadatas = collection.mutable.Map.empty[String, DomainMetadata]
private val activeFiles = new scala.collection.mutable.HashMap[UniqueFileActionTuple, AddFile]()
private val tombstones = new scala.collection.mutable.HashMap[UniqueFileActionTuple, RemoveFile]()
// RemoveFiles that had cancelled AddFile during replay
private val cancelledRemoveFiles =
new scala.collection.mutable.HashMap[UniqueFileActionTuple, RemoveFile]()
// RemoveFiles that had NOT cancelled any AddFile during replay
private val activeRemoveFiles =
new scala.collection.mutable.HashMap[UniqueFileActionTuple, RemoveFile]()

override def append(version: Long, actions: Iterator[Action]): Unit = {
assert(currentVersion == -1 || version == currentVersion + 1,
Expand All @@ -69,19 +74,25 @@ class InMemoryLogReplay(
val uniquePath = UniqueFileActionTuple(add.pathAsUri, add.getDeletionVectorUniqueId)
activeFiles(uniquePath) = add.copy(dataChange = false)
// Remove the tombstone to make sure we only output one `FileAction`.
tombstones.remove(uniquePath)
cancelledRemoveFiles.remove(uniquePath)
// Remove from activeRemoveFiles to handle commits that add a previously-removed file
activeRemoveFiles.remove(uniquePath)
case remove: RemoveFile =>
val uniquePath = UniqueFileActionTuple(remove.pathAsUri, remove.getDeletionVectorUniqueId)
activeFiles.remove(uniquePath)
tombstones(uniquePath) = remove.copy(dataChange = false)
activeFiles.remove(uniquePath) match {
case Some(_) => cancelledRemoveFiles(uniquePath) = remove
case None => activeRemoveFiles(uniquePath) = remove
}
case _: CommitInfo => // do nothing
case _: AddCDCFile => // do nothing
case null => // Some crazy future feature. Ignore
}
}

private def getTombstones: Iterable[FileAction] = {
tombstones.values.filter(_.delTimestamp > minFileRetentionTimestamp)
(cancelledRemoveFiles.values ++ activeRemoveFiles.values)
.filter(_.delTimestamp > minFileRetentionTimestamp)
.map(_.copy(dataChange = false))
}

private[delta] def getTransactions: Iterable[SetTransaction] = {
Expand Down

0 comments on commit 8045052

Please sign in to comment.