-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add table config #2820
feat: add table config #2820
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2820 +/- ##
==========================================
+ Coverage 78.77% 78.86% +0.08%
==========================================
Files 237 237
Lines 74099 74482 +383
Branches 74099 74482 +383
==========================================
+ Hits 58374 58737 +363
- Misses 12703 12721 +18
- Partials 3022 3024 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. This seems to be in the direction I was thinking. I've left some guiding comments.
protos/table.proto
Outdated
@@ -85,6 +85,7 @@ message Manifest { | |||
// * 1: deletion files are present | |||
// * 2: move_stable_row_ids: row IDs are tracked and stable after move operations | |||
// (such as compaction), but not updates. | |||
// * 8: table metadata are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 4?
Also, I'm thinking this should be a writer flag only, at least in practice. My use case in my head is that we will use this as configuration for various write operations (how often to auto-cleanup, compact, etc.). But reading the table properly should require interpreting these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 4 taken by FLAG_USE_V2_FORMAT_DEPRECATED
?: https://github.com/lancedb/lance/blob/main/rust/lance-table/src/feature_flags.rs#L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay. Well then we should add that to the proto docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to include docs for 4.
Also, I'm thinking this should be a writer flag only, at least in practice. My use case in my head is that we will use this as configuration for various write operations (how often to auto-cleanup, compact, etc.). But reading the table properly should require interpreting these.
Does anything have to change to accomodate this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it to just be under the writer_feature_flags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I've done this in e62bd82, but let me know if anything else needs editing to make this only apply to writers. I'm not sure if that comment in the proto file needs to change? The code in feature_flags.rs
+ existing comments in table.proto
makes it seem like the flag should still present be under reader_feature_flags
even if it only applies to writers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in
feature_flags.rs
+ existing comments intable.proto
makes it seem like the flag should still present be underreader_feature_flags
even if it only applies to writers?
What do you mean by this? There are separate can_read_dataset
and can_write_dataset
methods that check the respective flags.
rust/lance/src/dataset.rs
Outdated
@@ -3039,6 +3089,53 @@ mod tests { | |||
assert!(fragments[0].metadata.deletion_file.is_some()); | |||
} | |||
|
|||
#[rstest] | |||
#[tokio::test] | |||
async fn test_table_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main test I would like to see is that if you have 5 concurrent writers each adding a different metadata key, that each of them can succeed and their write is reflected in the final result. Same for deletion. That would let us know that conflict resolution is working correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated lance::dataset::transaction::tests::test_conflicts
to include the new operations. Is this sufficient?
I couldn't immediately find any other existing tests that use concurrent writers to base additional testing on. Do you know of any tests that use concurrent writers iflance::dataset::transaction::tests::test_conflicts
is not sufficient? If no such tests exist yet, I would be tempted to add concurrent writer testing as a separate PR first before finishing off this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are testing the path where we detect conflicts, but not where we resolve them.
For example if you have two concurrent writers:
- Writer A: Set metadata x=1
- Writer B: Set metadata y=2
The final metadata should include both x=1
and y=2
. That's what I want tested.
Here is a test where we validate concurrent appends work right:
lance/rust/lance/src/io/commit.rs
Lines 838 to 839 in 9c361fe
#[tokio::test] | |
async fn test_concurrent_writes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Life has been a bit hectic lately so I've only just had a chance to look at this. I've added some tests here.
34326f1
to
dc474ba
Compare
Updated to address all feedback so far. See above for responses to some individual items. In addition, I have the following comments:
|
732f30e
to
cbf43db
Compare
4055898
to
caf02f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "table metadata" the best name for this? It is a bit confusing because metadata is referenced in other locations (e.g. schema, version). Would a name like "config" be better?
That's a fair criticism. I'm good with calling it "config".
I have a SetMetadata operation. Should this perhaps be renamed to MergeMetadata (or MergeConfig if 1. is accepted) to make it clear we are merging with the existing table metadata?
I actually am thinking there's no good reason we shouldn't merge SetMetadata
and DeleteMetadata
into a single UpdateMetadata
transaction. Would make things a bit easier.
protos/table.proto
Outdated
@@ -85,6 +85,7 @@ message Manifest { | |||
// * 1: deletion files are present | |||
// * 2: move_stable_row_ids: row IDs are tracked and stable after move operations | |||
// (such as compaction), but not updates. | |||
// * 8: table metadata are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it to just be under the writer_feature_flags
?
protos/transaction.proto
Outdated
// An operation that sets table metadata. | ||
message SetMetadata { | ||
map<string, string> metadata = 1; | ||
} | ||
|
||
// An operation that deletes table metadata. | ||
message DeleteMetadata { | ||
repeated string metadata_keys = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe we should combine these?
// An operation that sets table metadata. | |
message SetMetadata { | |
map<string, string> metadata = 1; | |
} | |
// An operation that deletes table metadata. | |
message DeleteMetadata { | |
repeated string metadata_keys = 1; | |
} | |
// An operation that modifies table metadata. | |
message UpdateMetadata { | |
// Values that should be inserted or updated. | |
map<string, string> upsert_values = 1; | |
// Keys that should be removed. | |
repeated string delete_keys = 2; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Should this API merge carry through to the other public table metadata methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it has to, but if you want you could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 5daaf16 - I decided to still keep the methods separate in dataset.rs
for now, but happy to merge if you would prefer.
Co-authored-by: Will Jones <willjones127@gmail.com>
77636dc
to
5daaf16
Compare
Applied in 5daaf16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work. Thanks for doing this!
Closes #2200
#2200 reference the concept of "table metadata". This PR uses the name "config" to avoid potential confusion with other uses of the word "metadata" throughout the Lance format.
This PR introduces:
table.proto
field calledconfig
.Dataset
methods:update_config
anddelete_config_keys
.config
field inManifest
with public methods for updating and deleting.UpdateConfig
along with conflict logic that returns an error if an operation mutates a key that is being upserted by another operation.FLAG_TABLE_CONFIG
.Dataset
methods, concurrent config updaters, and conflict resolution logic.