Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a new BlobFile API that can be used to read blob data #2983

Merged
merged 8 commits into from
Oct 15, 2024

Conversation

westonpace
Copy link
Contributor

No description provided.

@github-actions github-actions bot added enhancement New feature or request python labels Oct 4, 2024
@westonpace
Copy link
Contributor Author

This adds the flag load_blobs which isn't yet configurable but I imagine it would become a new scanner parameter. If True then blobs are loaded during the scan and returned as LargeBinary. If False then blobs are not loaded during the scan and are instead returned as descriptions.

This all works pretty well but it is a bit weird that the same column might have two different data types depending on how it is read. However, I also want to enable this for strings/lists at some point (to read all strings/lists as small or all strings/lists as large) and so even though I find it slightly weird I am thinking it is ok? Welcome review on the idea.

@westonpace
Copy link
Contributor Author

I suppose another way we can tackle the issue is to create a virtual column __lance_blobdesc_{column_name} which loads the blob descriptions and then make sure that blob columns aren't included in the default (columns=None) case.

@wjones127 for second opinion

Comment on lines +548 to +559
Ok(if self.id >= 0 {
self.clone()
} else {
other.clone()
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are a little bit squirrel-y and the whole concept of intersection ignoring data types is a little odd...I'm leaning towards special column name at this point.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on whether you are considering deletions in take.

Comment on lines +189 to +197
let description_and_addr = dataset
.take_builder(row_ids, projection)?
.with_row_address(true)
.execute()
.await?;
let descriptions = description_and_addr.column(0).as_struct();
let positions = descriptions.column(0).as_primitive::<UInt64Type>();
let sizes = descriptions.column(1).as_primitive::<UInt64Type>();
let row_addrs = description_and_addr.column(1).as_primitive::<UInt64Type>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an assert that they have the same number of rows? I forget whether take guarantees that. Consider the case where the row_id has been deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test and added the assert. For now I just raise a NotImplementedError.

The current _take_rows behavior is to filter out deleted items but I'm not sure that's ideal. It is impossible for the caller to know which IDs were deleted which seems like it would be useful information.

We can either change the behavior to "fill with NULL" or we can make it configurable in a future PR.

@westonpace westonpace force-pushed the feat/blobs branch 3 times, most recently from 26c7fac to befe86f Compare October 15, 2024 14:03
@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 75.53444% with 103 lines in your changes missing coverage. Please review.

Project coverage is 78.22%. Comparing base (f60a6ce) to head (0e965ae).

Files with missing lines Patch % Lines
rust/lance/src/dataset/blob.rs 69.62% 53 Missing and 12 partials ⚠️
rust/lance/src/dataset/take.rs 61.66% 15 Missing and 8 partials ⚠️
rust/lance-datafusion/src/projection.rs 91.66% 1 Missing and 2 partials ⚠️
rust/lance-datagen/src/generator.rs 91.42% 3 Missing ⚠️
rust/lance-core/src/datatypes/field.rs 83.33% 2 Missing ⚠️
rust/lance/src/dataset.rs 90.90% 0 Missing and 2 partials ⚠️
rust/lance-core/src/datatypes/schema.rs 85.71% 0 Missing and 1 partial ⚠️
rust/lance-encoding/src/decoder.rs 91.66% 0 Missing and 1 partial ⚠️
rust/lance-file/src/v2/reader.rs 87.50% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs 85.71% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
- Coverage   78.25%   78.22%   -0.03%     
==========================================
  Files         238      239       +1     
  Lines       76234    76594     +360     
  Branches    76234    76594     +360     
==========================================
+ Hits        59657    59917     +260     
- Misses      13549    13609      +60     
- Partials     3028     3068      +40     
Flag Coverage Δ
unittests 78.22% <75.53%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westonpace westonpace merged commit 19d947e into lancedb:main Oct 15, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants