-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support add all null column as metadata-only operation via sql #3504
Conversation
|
||
#[test] | ||
fn test_new_column_sql_to_all_nulls_transform_optimizer() { | ||
// TODO: write a test to ensure the optimizer for all null sql gets used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're OK with the general approach in this PR, I'm planning to fill in this test before we merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: can we add a python test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach seems fine. A few thoughts but nothing concerning. Thanks for digging into this
fn has_legacy_files(fragments: &[FileFragment]) -> bool { | ||
let has_legacy_files = fragments | ||
.iter() | ||
.map(|f| &f.metadata) | ||
.flat_map(|fragment_meta| fragment_meta.files.iter()) | ||
.any(|file_meta| { | ||
matches!( | ||
LanceFileVersion::try_from_major_minor( | ||
file_meta.file_major_version, | ||
file_meta.file_minor_version | ||
), | ||
Ok(LanceFileVersion::Legacy) | ||
) | ||
}); | ||
|
||
!has_legacy_files | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to just use dataset.is_legacy_storage()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I wasn't aware of that method
/// Optimizes a `NewColumnTransform` into | ||
pub(super) trait NewColumnTransformOptimizer: Send + Sync { | ||
/// Optimize the passed `NewColumnTransform` to a more efficient form. | ||
fn optimize( | ||
&self, | ||
dataset: &Dataset, | ||
transform: NewColumnTransform, | ||
) -> Result<NewColumnTransform>; | ||
} | ||
|
||
/// A `NewColumnTransformOptimizer` that chains multiple `NewColumnTransformOptimizer`s together. | ||
pub(super) struct ChainedNewColumnTransformOptimizer { | ||
optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>, | ||
} | ||
|
||
impl ChainedNewColumnTransformOptimizer { | ||
pub(super) fn new(optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>) -> Self { | ||
Self { optimizers } | ||
} | ||
|
||
pub(super) fn add_optimizer(&mut self, optimizer: Box<dyn NewColumnTransformOptimizer>) { | ||
self.optimizers.push(optimizer); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to this pattern but it seems slightly heavier than we need for this particular fix. Do you anticipate additional optimizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I recognize that. I hesitated deeply over how/where to add the actual core optimization logic.
My thinking was that the alternatives could have been:
- to run the transform logic
add_columns_to_fragments
, but that method is already pretty long, so it would be better if the optimization was extracted to somewhere else. - do the optimization higher up in the call-stack, but I thought there might eventually be additional context we compute just before applying the transform, so it's best to keep the optimization near where that context could be available.
- I could have added some ad-hoc functions instead of the trait. But my thinking was that if there were eventually additional optimizations we make to the transform, without putting some structure around where those functions get invoked, figuring out the actual state of the transformation could get messy
I don't have concrete plans to add more optimizers, but in the future I think it's possible we could. For hypothetical example, maybe there are additional SQL patterns that we could recognize and write as a more optimal UDF. Given that this could happen in the future, I thought it might be worthwhile to at least try to to have a bit of structure around how these optimizers are organized
|
||
// Optimize the transforms | ||
let mut optimizer = ChainedNewColumnTransformOptimizer::new(vec![]); | ||
if has_legacy_files { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be !has_legacy_files
? Or am I misunderstanding the above comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there should have been a !
here. Caught this after adding tests
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3504 +/- ##
==========================================
+ Coverage 78.44% 78.49% +0.04%
==========================================
Files 252 253 +1
Lines 94044 94262 +218
Branches 94044 94262 +218
==========================================
+ Hits 73773 73989 +216
+ Misses 17275 17273 -2
- Partials 2996 3000 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Adds support for adding all-null column via SQL.
If the user passes:
We'll discover that the intention is to to create an all null column, and optimize the transform to:
The motivation here is to be able to expose the capability to add the all null column as a metadata-only operation through the LanceDB SDKs. Currently these methods only support passing SQL expressions. A different option would have been to modify the arguments to the python table.add_column & typescript table.addColumn, but that seemed like more work so I wanted to propose this solution first.