Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DeltaConfig and tombstones retention policy #420

Merged
merged 2 commits into from
Sep 9, 2021

Conversation

mosyp
Copy link
Contributor

@mosyp mosyp commented Sep 1, 2021

There's a lot of delta configs which are ignored by delta-rs. This PR introduces the mechanism of working with these configs with the similar manner as in spark codebase https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/DeltaConfig.scala#L325.

When working with kafka-delta-ingest in production we noticed that removes logs are never cleared, however they are in spark writer, due to the config mention above.

Note that the vaccum function has not been modified in this PR (however it has to use the retention interval value from the configs as default instead of const value). Because when comparing it with spark vaccum, we found that spark deletes every file which is in the table but not referenced in delta logs.
Both delta logs entries and actual files time to live is controlled by deletedFileRetentionDuration which means, with the current version of vaccum we might end up with orphan files in the store. E.g. the vaccum called after the remove action is cleared from the log.

Also, there's bunch of other configs that are turned on by default in spark, such as enableExpiredLogCleanup.

@mosyp
Copy link
Contributor Author

mosyp commented Sep 1, 2021

Leaving it as draft since the checkpoint tests have to be done against writer-map-support branch, since arrow-rs has not yet release map support.

@mosyp
Copy link
Contributor Author

mosyp commented Sep 1, 2021

@fvaleye I'd love especially your thoughts on that since you're implemented vaccum command

@fvaleye
Copy link
Collaborator

fvaleye commented Sep 2, 2021

@fvaleye I'd love especially your thoughts on that since you're implemented vaccum command

@mosyp I would be interested to see what are the files that remained after applying the VACUUM command 😃

To provide more context on the execution:
When deleting files using the VACUUM command, we are listing all files using the DeltaTable URI
https://github.com/delta-io/delta-rs/blob/main/rust/src/delta.rs#L941, we exclude files and folders used by Databricks (for optimization: bloom filters, cdc...) https://github.com/delta-io/delta-rs/blob/main/rust/src/delta.rs#L916 and valid_filesstill in use, we only include the files expired_tombstones with a hard-coded 1-week limit https://github.com/delta-io/delta-rs/blob/main/rust/src/delta.rs#L894.

It was implemented using the 0.8 version of delta, there are new configuration files introduced in the 1.0 version that we should include.

@mosyp
Copy link
Contributor Author

mosyp commented Sep 2, 2021

@fvaleye Oh I see, you're absolutely right, sorry for the confusion.

Then what's left to do is to reuse the retention period form delta_config.

One more question tho, wdyt of changing the API of vacuum for the parameter retention_hours: u64 to retention_hours: Option<u64>, so, if that's an option then we use the default retention period.+

Also regarding the if retention_hours < 168 , so in spark I found they use val retentionSafe = retentionMs.forall(_ >= configuredRetention) meaning that we might change it to if retention_hours < self.tombstone_renetion as well. E.g. to compare against configured in table instead of hardcoded 1 week.

@fvaleye
Copy link
Collaborator

fvaleye commented Sep 3, 2021

@fvaleye Oh I see, you're absolutely right, sorry for the confusion.

@mosyp no worries 👍
It's a good opportunity for me to see the difference between the 0.8 and 1.0 versions. We have to keep in mind to not remove useful files when applying the VACUUM command 😅.

Then what's left to do is to reuse the retention period form delta_config.
Exactly!

One more question tho, wdyt of changing the API of vacuum for the parameter retention_hours: u64 to retention_hours: Option<u64>, so, if that's an option then we use the default retention period.+

Good idea, this is the right thing to do to make better use the new configuration parameter.

Also regarding the if retention_hours < 168 , so in spark I found they use val retentionSafe = retentionMs.forall(_ >= configuredRetention) meaning that we might change it to if retention_hours < self.tombstone_renetion as well. E.g. to compare against configured in table instead of hardcoded 1 week.

Agreed!

@mosyp mosyp marked this pull request as ready for review September 6, 2021 16:31
@mosyp mosyp requested a review from fvaleye September 6, 2021 16:38
@mosyp
Copy link
Contributor Author

mosyp commented Sep 6, 2021

@xianwill @rtyler @houqp ready for review

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question around tombstone expiration tracking. The rest looks good to me. The new config interface is really nice and has been a major missing feature 👍

@mosyp
Copy link
Contributor Author

mosyp commented Sep 8, 2021

@houqp @xianwill ready for the final review

xianwill
xianwill previously approved these changes Sep 8, 2021
@xianwill xianwill self-requested a review September 8, 2021 16:59
xianwill
xianwill previously approved these changes Sep 8, 2021
houqp
houqp previously approved these changes Sep 8, 2021
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@houqp houqp dismissed stale reviews from xianwill and themself via 9444c22 September 8, 2021 18:58
@houqp houqp requested review from xianwill and houqp September 8, 2021 18:59
@mosyp mosyp merged commit 5bdaf8d into delta-io:main Sep 9, 2021
@mosyp mosyp deleted the fix-tombstones branch September 9, 2021 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants