-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally ignore tombstones for read-only use cases #425
Comments
Extending on @dispanser 's idea, I think we could even extend it to all writers that doesn't handle checkpoints. By moving checkpoints to a dedicated writer/lambda function, we could even apply this optimization to all data writers. |
An easy way would be just an additional option/flag. And if set, then we'd keep tomstones list empty, destipe it's writer or reader |
I agree with @mosyp , let the user decide on instantiating the delta table. However, it would be interesting if we're able to detect the misuse, i.e. panic / warn (?) when a writer attempts to write a checkpoint despite not having tombstones available. Not sure if that's easy to implement. |
@dispanser If we stick with an optional flag, then if it's set, then prevent of creating a checkpoint and telling user a meaningful error |
I'd make a quick foray into that topic if someone assigns this to me. |
yep, it should be easy to prevent side effect of this optimization by checking for the flag in both checkpoint and vacuum code path. |
Now that I'm able to load my large tombstone-filled delta table, I plan on revisiting this issue. Still considering myself new to rust, I'm wondering how this feature could be represented in the public API. I assume that |
How about adding a new config value to https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_config.rs? Then we can pass in the full table config as an optional argument. We will have to either introduce breaking API change or add new functions. I am fine with us introducing breaking API change at the current stage of the project. Another pattern I have seen from other projects is builder struct. So the code will look something like this: let table = TableOpener::new(uri).with_config(config).with_version(1).open().await?; Then if we need to add a new knob in the future, e.g. storage backend, we could add a new method to the let table = TableOpener::new(uri).with_config(config).with_storage(storage).open().await?; I am cool with either way, but I think the builder pattern is likely going to be the long term solution we would adopt. |
Thanks, @houqp ! I'll be looking into the builder pattern. As an added bonus, we can keep the top-level entry points for backwards-compatibility, by having them use that config mechanism. |
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs is a good example of how the builder pattern works in real world. |
# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: #454 supersedes: #1837 closes: #1776 closes: #425 (should also be addressed in the current implementation) closes: #288 (multi-part checkpoints are deprecated) related: #435 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Description
In a discussion for PR #420, #420 (comment) an idea came up to have a config option to ignore all tombstone entries entirely. Applications that are only reading do not actually use tombstones.
Use Case
This would reduce memory footprint, in particular for read-only applications.
@houqp writes:
I'd like to explore that idea further. We actually have a delta table that is mostly tombstones, which would see a major decrease in memory consumption.
The text was updated successfully, but these errors were encountered: