Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Support getting snapshots by timestamp (time-travel) #2662

Merged

Conversation

allisonport-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #2276

Adds support for reading the snapshot of the table at a specific timestamp using Table:: getSnapshotAtTimestamp.

How was this patch tested?

Adds unit tests.

import io.delta.kernel.internal.util.Tuple2;
import static io.delta.kernel.internal.fs.Path.getName;

public final class DeltaHistoryManager {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allisonport-db
Copy link
Collaborator Author

Seemed like some of the test refactoring was convoluting this PR... Separated it out to #2663

@@ -66,4 +66,15 @@ Snapshot getLatestSnapshot(TableClient tableClient)
*/
Snapshot getSnapshotAtVersion(TableClient tableClient, long versionId)
throws TableNotFoundException;

/**
* Get the snapshot at the given {@code timestamp}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be a bit more specific?

- timestamp T-1: Table Version 10 created
- timestamp T: 
- timestamp T+1: Table Version 11 created

if we query for the snapshot at version T, what do we return?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Need clarity on latest snapshot at or before or at or after

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docs to be more clear

tablePath,
providedTimestamp,
commitTimestamp);
// TODO format the timestamps?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would print both, the millis and a human readable one. See how delta-spark formats timestamps (if they are ever in millis format). I'd expect you can print them in the local time zone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good thing to discuss when we revisit exceptions. I agree formatting them somehow makes sense but I'm not sure it's clear whether it should be in the local timezone or UTC and what we might expect the connectors to do with the exception. I'll add them in UTC for now.

* unix epoch
* @return an instance of {@link Snapshot}
*/
Snapshot getSnapshotAtTimestamp(TableClient tableClient, long millisSinceEpochUTC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will eventually need 2 types of time travel semantics for batch and streaming semantics

  • getVersionBeforeOrAtTimestamp
  • getVersionAtOrAfterTimestamp

We should think a bit more about this public API (method name) ... so that adding the other method doesn't cause confusion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standalone DeltaLog, the above-mentioned methods return a version instead of a snapshot.

Looking at Flink how these APIs are used:

  • can't find getVersionBeforeOrAtTimestamp used anywhere.
  • getVersionAtOrAfterTimestamp is used as to get the version immediately followed by `getSnapshotAtVersion(version returned by getVersionAtOrAfterTimestamp).

what if we provide two APIs:
getSnapshotBeforeOrAtTimestamp
getSnapshotAtOrAfterTimestamp

Are there any case where we just need the version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti was there an ask from delta-sharing to support version getting in this scenario? Or is just the snapshot sufficient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename this method to getSnapshotBeforeOrAtTimestamp? I'm wondering if that's a bit hard to parse for the simple batch time-travel case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Delta sharing protocol, it seems like sharing needs atOrAfter. @linzhou-db Please confirm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems like this requires versionAtOrAfter to get the version (to avoid unnecessarily loading the snapshot metadata/protocol)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, delta sharing do need both snapshot and version AtOrAfter a timestamp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #2679 to revisit this API + add additional functionality

allisonport-db added a commit that referenced this pull request Feb 22, 2024
@allisonport-db allisonport-db force-pushed the support-timetravel-by-timestamp branch from 81f4afc to a93d5d5 Compare February 22, 2024 21:27
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

while (commits.hasNext()) {
Commit newElem = commits.next();
assert(prevVersion < newElem.version); // Verify commits are ordered
if (prevTimestamp >= newElem.timestamp) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what delta-spark also does? If yes, should we have a deisclaimer in the API docs saying that the timestamps should be valid. If the commit files are copied over (or any operation that changes the timestamp of commit files), the API is a best effort to return the snapshot at the given timestamp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark (and standalone) does this. I couldn't find a good test in delta-spark for this scenario though. I think if commit files are copied over and timestamps are changed time-travel by version is fully expected to be broken. I'm guessing this is more aimed at commit versions with the same timestamp (since precision is only ms).

Comment on lines +152 to +153
throw new RuntimeException(
String.format("No recreatable commits found at %s", logPath));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to create an InvalidTableStateException to capture all these runtime exceptions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed let's add this to the exception discussion

@allisonport-db allisonport-db merged commit f50bd83 into delta-io:master Mar 4, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support getting snapshot by timestamp
4 participants