Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds support for detached commits #3028

Merged
merged 3 commits into from
Oct 24, 2024

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented Oct 21, 2024

A detached commit is a commit that is not part of the regular dataset lineage. It will never show up as the latest commit and is completely separate from the linear history of the dataset.

This can be useful for:

  • Making temporary changes to a dataset, like adding a column of values while computing an index without ever truly materializing the column.
  • Situations where the dataset's lineage is managed remotely and all commits are detached commits without any linear history (blob storage will use this)

Closes #2889

@github-actions github-actions bot added enhancement New feature or request python labels Oct 21, 2024
@westonpace
Copy link
Contributor Author

My original approach was much more complicated and used UUIDs as the version. However, if we keep the version as a u64 but borrow the most significant bit to flag detached vs. normal then we end up with much fewer changes and less overall complexity.

@westonpace westonpace requested a review from wjones127 October 21, 2024 15:03
@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 30.97826% with 127 lines in your changes missing coverage. Please review.

Project coverage is 78.26%. Comparing base (536e73d) to head (e490d01).

Files with missing lines Patch % Lines
rust/lance/src/io/commit.rs 2.98% 65 Missing ⚠️
rust/lance/src/dataset.rs 44.82% 44 Missing and 4 partials ⚠️
rust/lance-table/src/io/commit.rs 48.14% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3028      +/-   ##
==========================================
- Coverage   78.38%   78.26%   -0.12%     
==========================================
  Files         240      240              
  Lines       77122    77284     +162     
  Branches    77122    77284     +162     
==========================================
+ Hits        60449    60488      +39     
- Misses      13565    13697     +132     
+ Partials     3108     3099       -9     
Flag Coverage Δ
unittests 78.26% <30.97%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. Is there any special considerations we need to make for clean up?

Comment on lines +80 to +83
// Detached versions should never show up first in a list operation which
// means it needs to come lexicographically after all attached manifest
// files and so we add the prefix `d`. There is no need to invert the
// version number since detached versions are not part of the version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@westonpace
Copy link
Contributor Author

This seems reasonable. Is there any special considerations we need to make for clean up?

@wjones127

I'm planning on tackling cleanup later. My thinking is that cleanup will be something along the lines of:

cleanup_old_versions(
  since: ...,
  delete_unverified: bool,
  versions_to_retain: &[u64],
)

For the "detached versions are just temporary versions" case then versions_to_retain will be empty and the temporary versions will only be deleted if they are older than 7 days or delete_unverified (assuming the caller setting delete_unverified is confirming there are no ongoing temporary operations)

For the "this is a secondary database and everything is detached" case then cleanup will be triggered by a cleanup of the primary database. After the cleanup of the primary database we will scan all remaining versions (in the primary database) and collect which secondary versions are still referenced. These will be passed in as versions_to_retain.

@westonpace westonpace force-pushed the feat/detached-commit branch from bb46d90 to 78517ff Compare October 24, 2024 15:28
@westonpace
Copy link
Contributor Author

Failures are due to known issue with Ray (#3042). Merging

@westonpace westonpace merged commit 578ab5a into lancedb:main Oct 24, 2024
20 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for a "staged commit"
3 participants