Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add table config #2820

Merged
merged 7 commits into from
Oct 14, 2024
Merged

Conversation

dsgibbons
Copy link
Contributor

@dsgibbons dsgibbons commented Sep 3, 2024

Closes #2200

#2200 reference the concept of "table metadata". This PR uses the name "config" to avoid potential confusion with other uses of the word "metadata" throughout the Lance format.

This PR introduces:

  • A new table.proto field called config.
  • New Dataset methods: update_config and delete_config_keys.
  • config field in Manifest with public methods for updating and deleting.
  • A new transaction operation UpdateConfig along with conflict logic that returns an error if an operation mutates a key that is being upserted by another operation.
  • A new writer feature flag called FLAG_TABLE_CONFIG.
  • Unit tests for new Dataset methods, concurrent config updaters, and conflict resolution logic.

@github-actions github-actions bot added the enhancement New feature or request label Sep 3, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 92.38329% with 31 lines in your changes missing coverage. Please review.

Project coverage is 78.86%. Comparing base (cdac5de) to head (e62bd82).

Files with missing lines Patch % Lines
rust/lance/src/dataset/transaction.rs 91.35% 15 Missing and 1 partial ⚠️
rust/lance/src/io/commit.rs 87.93% 0 Missing and 14 partials ⚠️
rust/lance/src/dataset.rs 98.03% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2820      +/-   ##
==========================================
+ Coverage   78.77%   78.86%   +0.08%     
==========================================
  Files         237      237              
  Lines       74099    74482     +383     
  Branches    74099    74482     +383     
==========================================
+ Hits        58374    58737     +363     
- Misses      12703    12721      +18     
- Partials     3022     3024       +2     
Flag Coverage Δ
unittests 78.86% <92.38%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 self-requested a review September 3, 2024 15:07
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. This seems to be in the direction I was thinking. I've left some guiding comments.

@@ -85,6 +85,7 @@ message Manifest {
// * 1: deletion files are present
// * 2: move_stable_row_ids: row IDs are tracked and stable after move operations
// (such as compaction), but not updates.
// * 8: table metadata are present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 4?

Also, I'm thinking this should be a writer flag only, at least in practice. My use case in my head is that we will use this as configuration for various write operations (how often to auto-cleanup, compact, etc.). But reading the table properly should require interpreting these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay. Well then we should add that to the proto docs.

Copy link
Contributor Author

@dsgibbons dsgibbons Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include docs for 4.

Also, I'm thinking this should be a writer flag only, at least in practice. My use case in my head is that we will use this as configuration for various write operations (how often to auto-cleanup, compact, etc.). But reading the table properly should require interpreting these.

Does anything have to change to accomodate this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to just be under the writer_feature_flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I've done this in e62bd82, but let me know if anything else needs editing to make this only apply to writers. I'm not sure if that comment in the proto file needs to change? The code in feature_flags.rs + existing comments in table.proto makes it seem like the flag should still present be under reader_feature_flags even if it only applies to writers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in feature_flags.rs + existing comments in table.proto makes it seem like the flag should still present be under reader_feature_flags even if it only applies to writers?

What do you mean by this? There are separate can_read_dataset and can_write_dataset methods that check the respective flags.

@@ -3039,6 +3089,53 @@ mod tests {
assert!(fragments[0].metadata.deletion_file.is_some());
}

#[rstest]
#[tokio::test]
async fn test_table_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main test I would like to see is that if you have 5 concurrent writers each adding a different metadata key, that each of them can succeed and their write is reflected in the final result. Same for deletion. That would let us know that conflict resolution is working correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated lance::dataset::transaction::tests::test_conflicts to include the new operations. Is this sufficient?

I couldn't immediately find any other existing tests that use concurrent writers to base additional testing on. Do you know of any tests that use concurrent writers iflance::dataset::transaction::tests::test_conflicts is not sufficient? If no such tests exist yet, I would be tempted to add concurrent writer testing as a separate PR first before finishing off this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are testing the path where we detect conflicts, but not where we resolve them.

For example if you have two concurrent writers:

  • Writer A: Set metadata x=1
  • Writer B: Set metadata y=2

The final metadata should include both x=1 and y=2. That's what I want tested.

Here is a test where we validate concurrent appends work right:

#[tokio::test]
async fn test_concurrent_writes() {

Copy link
Contributor Author

@dsgibbons dsgibbons Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Life has been a bit hectic lately so I've only just had a chance to look at this. I've added some tests here.

@dsgibbons
Copy link
Contributor Author

dsgibbons commented Sep 15, 2024

Updated to address all feedback so far. See above for responses to some individual items. In addition, I have the following comments:

  1. Is "table metadata" the best name for this? It is a bit confusing because metadata is referenced in other locations (e.g. schema, version). Would a name like "config" be better?
  2. I have a SetMetadata operation. Should this perhaps be renamed to MergeMetadata (or MergeConfig if 1. is accepted) to make it clear we are merging with the existing table metadata?

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "table metadata" the best name for this? It is a bit confusing because metadata is referenced in other locations (e.g. schema, version). Would a name like "config" be better?

That's a fair criticism. I'm good with calling it "config".

I have a SetMetadata operation. Should this perhaps be renamed to MergeMetadata (or MergeConfig if 1. is accepted) to make it clear we are merging with the existing table metadata?

I actually am thinking there's no good reason we shouldn't merge SetMetadata and DeleteMetadata into a single UpdateMetadata transaction. Would make things a bit easier.

@@ -85,6 +85,7 @@ message Manifest {
// * 1: deletion files are present
// * 2: move_stable_row_ids: row IDs are tracked and stable after move operations
// (such as compaction), but not updates.
// * 8: table metadata are present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to just be under the writer_feature_flags?

Comment on lines 162 to 170
// An operation that sets table metadata.
message SetMetadata {
map<string, string> metadata = 1;
}

// An operation that deletes table metadata.
message DeleteMetadata {
repeated string metadata_keys = 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe we should combine these?

Suggested change
// An operation that sets table metadata.
message SetMetadata {
map<string, string> metadata = 1;
}
// An operation that deletes table metadata.
message DeleteMetadata {
repeated string metadata_keys = 1;
}
// An operation that modifies table metadata.
message UpdateMetadata {
// Values that should be inserted or updated.
map<string, string> upsert_values = 1;
// Keys that should be removed.
repeated string delete_keys = 2;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Should this API merge carry through to the other public table metadata methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it has to, but if you want you could.

Copy link
Contributor Author

@dsgibbons dsgibbons Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5daaf16 - I decided to still keep the methods separate in dataset.rs for now, but happy to merge if you would prefer.

@dsgibbons dsgibbons force-pushed the feat/table-metadata branch from 77636dc to 5daaf16 Compare October 12, 2024 13:42
@dsgibbons
Copy link
Contributor Author

That's a fair criticism. I'm good with calling it "config".

Applied in 5daaf16.

@dsgibbons dsgibbons changed the title feat: add table_metadata field to Manifest feat: add table config Oct 12, 2024
@dsgibbons dsgibbons marked this pull request as ready for review October 12, 2024 14:07
@wjones127 wjones127 self-requested a review October 12, 2024 21:54
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. Thanks for doing this!

@wjones127 wjones127 merged commit 5c8f565 into lancedb:main Oct 14, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a table metadata field to table format
3 participants