-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce DeltaConfig and tombstones retention policy #420
Conversation
Leaving it as draft since the checkpoint tests have to be done against |
@fvaleye I'd love especially your thoughts on that since you're implemented vaccum command |
@mosyp I would be interested to see what are the files that remained after applying the To provide more context on the execution: It was implemented using the |
@fvaleye Oh I see, you're absolutely right, sorry for the confusion. Then what's left to do is to reuse the retention period form delta_config. One more question tho, wdyt of changing the API of vacuum for the parameter Also regarding the |
0d840d3
to
ed754f8
Compare
@mosyp no worries 👍
Good idea, this is the right thing to do to make better use the new configuration parameter.
Agreed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a question around tombstone expiration tracking. The rest looks good to me. The new config interface is really nice and has been a major missing feature 👍
544ee3a
to
d425ffb
Compare
d425ffb
to
604a961
Compare
604a961
to
bdbe9bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There's a lot of delta configs which are ignored by delta-rs. This PR introduces the mechanism of working with these configs with the similar manner as in spark codebase https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/DeltaConfig.scala#L325.
When working with kafka-delta-ingest in production we noticed that removes logs are never cleared, however they are in spark writer, due to the config mention above.
Note that the
vaccum
function has not been modified in this PR (however it has to use the retention interval value from the configs as default instead of const value). Because when comparing it with sparkvaccum
, we found that spark deletes every file which is in the table but not referenced in delta logs.Both delta logs entries and actual files time to live is controlled by
deletedFileRetentionDuration
which means, with the current version of vaccum we might end up with orphan files in the store. E.g. the vaccum called after theremove
action is cleared from the log.Also, there's bunch of other configs that are turned on by default in spark, such as
enableExpiredLogCleanup
.