Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally ignore tombstones for read-only use cases #425

Closed
dispanser opened this issue Sep 7, 2021 · 10 comments · Fixed by #2037
Closed

Optionally ignore tombstones for read-only use cases #425

dispanser opened this issue Sep 7, 2021 · 10 comments · Fixed by #2037
Assignees
Labels
enhancement New feature or request

Comments

@dispanser
Copy link
Contributor

Description

In a discussion for PR #420, #420 (comment) an idea came up to have a config option to ignore all tombstone entries entirely. Applications that are only reading do not actually use tombstones.

Use Case

This would reduce memory footprint, in particular for read-only applications.

@houqp writes:

Letting read-only application specifying whether tombstones should be kept at all sounds like a neat trick to reduce memory usage :) I think it's better handled as a separate self-contained PR though because it's an optimization specific to a subset of readers, while the problem we are dealing where applies to both readers and write

I'd like to explore that idea further. We actually have a delta table that is mostly tombstones, which would see a major decrease in memory consumption.

@houqp
Copy link
Member

houqp commented Sep 7, 2021

Extending on @dispanser 's idea, I think we could even extend it to all writers that doesn't handle checkpoints. By moving checkpoints to a dedicated writer/lambda function, we could even apply this optimization to all data writers.

@mosyp
Copy link
Contributor

mosyp commented Sep 8, 2021

An easy way would be just an additional option/flag. And if set, then we'd keep tomstones list empty, destipe it's writer or reader

@dispanser
Copy link
Contributor Author

I agree with @mosyp , let the user decide on instantiating the delta table.

However, it would be interesting if we're able to detect the misuse, i.e. panic / warn (?) when a writer attempts to write a checkpoint despite not having tombstones available. Not sure if that's easy to implement.

@mosyp
Copy link
Contributor

mosyp commented Sep 8, 2021

@dispanser If we stick with an optional flag, then if it's set, then prevent of creating a checkpoint and telling user a meaningful error

@dispanser
Copy link
Contributor Author

I'd make a quick foray into that topic if someone assigns this to me.

@houqp
Copy link
Member

houqp commented Sep 8, 2021

yep, it should be easy to prevent side effect of this optimization by checking for the flag in both checkpoint and vacuum code path.

@dispanser
Copy link
Contributor Author

Now that I'm able to load my large tombstone-filled delta table, I plan on revisiting this issue. Still considering myself new to rust, I'm wondering how this feature could be represented in the public API. I assume that pub async fn open_table(table_uri: &str) and related functions form the public interface. If this was scala, I'd just tuck on an additional method parameter ignoreTombstones: Boolen = false, which would be backwards compatible but expose the new feature to everyone who wants to use it. What's the best practice in rust?

@houqp
Copy link
Member

houqp commented Sep 23, 2021

How about adding a new config value to https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_config.rs? Then we can pass in the full table config as an optional argument. We will have to either introduce breaking API change or add new functions. I am fine with us introducing breaking API change at the current stage of the project.

Another pattern I have seen from other projects is builder struct. So the code will look something like this:

let table = TableOpener::new(uri).with_config(config).with_version(1).open().await?;

Then if we need to add a new knob in the future, e.g. storage backend, we could add a new method to the TableOpener struct without breaking the existing users:

let table = TableOpener::new(uri).with_config(config).with_storage(storage).open().await?;

I am cool with either way, but I think the builder pattern is likely going to be the long term solution we would adopt.

@dispanser
Copy link
Contributor Author

Thanks, @houqp ! I'll be looking into the builder pattern. As an added bonus, we can keep the top-level entry points for backwards-compatibility, by having them use that config mechanism.

@houqp
Copy link
Member

houqp commented Sep 23, 2021

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/builder.rs is a good example of how the builder pattern works in real world.

rtyler added a commit that referenced this issue Jan 23, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this issue Feb 2, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: delta-io#454
supersedes: delta-io#1837
closes: delta-io#1776
closes: delta-io#425 (should also be addressed in the current implementation)
closes: delta-io#288 (multi-part checkpoints are deprecated)
related: delta-io#435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants