feat: add a new BlobFile API that can be used to read blob data #2983

westonpace · 2024-10-04T23:00:57Z

No description provided.

westonpace · 2024-10-04T23:03:02Z

This adds the flag load_blobs which isn't yet configurable but I imagine it would become a new scanner parameter. If True then blobs are loaded during the scan and returned as LargeBinary. If False then blobs are not loaded during the scan and are instead returned as descriptions.

This all works pretty well but it is a bit weird that the same column might have two different data types depending on how it is read. However, I also want to enable this for strings/lists at some point (to read all strings/lists as small or all strings/lists as large) and so even though I find it slightly weird I am thinking it is ok? Welcome review on the idea.

westonpace · 2024-10-04T23:41:23Z

I suppose another way we can tackle the issue is to create a virtual column __lance_blobdesc_{column_name} which loads the blob descriptions and then make sure that blob columns aren't included in the default (columns=None) case.

@wjones127 for second opinion

westonpace · 2024-10-05T03:13:49Z

rust/lance-core/src/datatypes/field.rs

+        Ok(if self.id >= 0 {
+            self.clone()
+        } else {
+            other.clone()
+        })
+    }


These changes are a little bit squirrel-y and the whole concept of intersection ignoring data types is a little odd...I'm leaning towards special column name at this point.

wjones127

One question on whether you are considering deletions in take.

wjones127 · 2024-10-14T18:40:35Z

rust/lance/src/dataset/blob.rs

+    let description_and_addr = dataset
+        .take_builder(row_ids, projection)?
+        .with_row_address(true)
+        .execute()
+        .await?;
+    let descriptions = description_and_addr.column(0).as_struct();
+    let positions = descriptions.column(0).as_primitive::<UInt64Type>();
+    let sizes = descriptions.column(1).as_primitive::<UInt64Type>();
+    let row_addrs = description_and_addr.column(1).as_primitive::<UInt64Type>();


Should we add an assert that they have the same number of rows? I forget whether take guarantees that. Consider the case where the row_id has been deleted.

Ah, good idea

I added a test and added the assert. For now I just raise a NotImplementedError.

The current _take_rows behavior is to filter out deleted items but I'm not sure that's ideal. It is impossible for the caller to know which IDs were deleted which seems like it would be useful information.

We can either change the behavior to "fill with NULL" or we can make it configurable in a future PR.

codecov-commenter · 2024-10-15T14:28:03Z

Codecov Report

Attention: Patch coverage is 75.53444% with 103 lines in your changes missing coverage. Please review.

Project coverage is 78.22%. Comparing base (f60a6ce) to head (0e965ae).

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/blob.rs	69.62%	53 Missing and 12 partials ⚠️
rust/lance/src/dataset/take.rs	61.66%	15 Missing and 8 partials ⚠️
rust/lance-datafusion/src/projection.rs	91.66%	1 Missing and 2 partials ⚠️
rust/lance-datagen/src/generator.rs	91.42%	3 Missing ⚠️
rust/lance-core/src/datatypes/field.rs	83.33%	2 Missing ⚠️
rust/lance/src/dataset.rs	90.90%	0 Missing and 2 partials ⚠️
rust/lance-core/src/datatypes/schema.rs	85.71%	0 Missing and 1 partial ⚠️
rust/lance-encoding/src/decoder.rs	91.66%	0 Missing and 1 partial ⚠️
rust/lance-file/src/v2/reader.rs	87.50%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs	85.71%	0 Missing and 1 partial ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
- Coverage   78.25%   78.22%   -0.03%     
==========================================
  Files         238      239       +1     
  Lines       76234    76594     +360     
  Branches    76234    76594     +360     
==========================================
+ Hits        59657    59917     +260     
- Misses      13549    13609      +60     
- Partials     3028     3068      +40

Flag	Coverage Δ
unittests	`78.22% <75.53%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…e-like API

github-actions bot added enhancement New feature or request python labels Oct 4, 2024

westonpace commented Oct 5, 2024

View reviewed changes

wjones127 approved these changes Oct 14, 2024

View reviewed changes

westonpace force-pushed the feat/blobs branch 3 times, most recently from 26c7fac to befe86f Compare October 15, 2024 14:03

westonpace added 7 commits October 15, 2024 09:30

Add a new BlobFile API that can be used to read blob data using a fil…

76b4e2c

…e-like API

Make the unit test pass

90342eb

Fix a few bugs and add a test for blob seek

b8cf69d

WIP

0887e59

Add error if take tries to take deleted rows

f581543

Fix issue where schema was passed into take without correct field ids

8e4c253

Fix test error message expectation

80086f0

westonpace force-pushed the feat/blobs branch from df9bcb6 to 80086f0 Compare October 15, 2024 16:30

Fix incorrect import introduced during rebase

0e965ae

westonpace merged commit 19d947e into lancedb:main Oct 15, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a new BlobFile API that can be used to read blob data #2983

feat: add a new BlobFile API that can be used to read blob data #2983

westonpace commented Oct 4, 2024

westonpace commented Oct 4, 2024

westonpace commented Oct 4, 2024

westonpace Oct 5, 2024

wjones127 left a comment

wjones127 Oct 14, 2024

westonpace Oct 14, 2024

westonpace Oct 15, 2024

codecov-commenter commented Oct 15, 2024 •

edited

Loading

feat: add a new BlobFile API that can be used to read blob data #2983

feat: add a new BlobFile API that can be used to read blob data #2983

Conversation

westonpace commented Oct 4, 2024

westonpace commented Oct 4, 2024

westonpace commented Oct 4, 2024

westonpace Oct 5, 2024

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Oct 14, 2024

Choose a reason for hiding this comment

westonpace Oct 14, 2024

Choose a reason for hiding this comment

westonpace Oct 15, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 15, 2024 • edited Loading

Codecov Report

codecov-commenter commented Oct 15, 2024 •

edited

Loading