Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support add all null column as metadata-only operation via sql #3504

Merged
merged 5 commits into from
Mar 3, 2025

Conversation

albertlockett
Copy link
Contributor

Adds support for adding all-null column via SQL.

If the user passes:

dataset.add_column(NewColumnTransform::SqlExpressions(vec!["new_col", "CAST(NULL AS int)"]);

We'll discover that the intention is to to create an all null column, and optimize the transform to:

dataset.add_column(NewColumnTransform::AllNull(Arc::new(
  Schema::new(vec![
    Field::new("new_col", DataType:Int32, true),
  ])
)

The motivation here is to be able to expose the capability to add the all null column as a metadata-only operation through the LanceDB SDKs. Currently these methods only support passing SQL expressions. A different option would have been to modify the arguments to the python table.add_column & typescript table.addColumn, but that seemed like more work so I wanted to propose this solution first.

@albertlockett albertlockett marked this pull request as draft March 3, 2025 16:11
@github-actions github-actions bot added the enhancement New feature or request label Mar 3, 2025

#[test]
fn test_new_column_sql_to_all_nulls_transform_optimizer() {
// TODO: write a test to ensure the optimizer for all null sql gets used
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're OK with the general approach in this PR, I'm planning to fill in this test before we merge

Copy link
Contributor

@chebbyChefNEQ chebbyChefNEQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: can we add a python test

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good approach.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems fine. A few thoughts but nothing concerning. Thanks for digging into this

Comment on lines 298 to 314
fn has_legacy_files(fragments: &[FileFragment]) -> bool {
let has_legacy_files = fragments
.iter()
.map(|f| &f.metadata)
.flat_map(|fragment_meta| fragment_meta.files.iter())
.any(|file_meta| {
matches!(
LanceFileVersion::try_from_major_minor(
file_meta.file_major_version,
file_meta.file_minor_version
),
Ok(LanceFileVersion::Legacy)
)
});

!has_legacy_files
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier to just use dataset.is_legacy_storage().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I wasn't aware of that method

Comment on lines +13 to +36
/// Optimizes a `NewColumnTransform` into
pub(super) trait NewColumnTransformOptimizer: Send + Sync {
/// Optimize the passed `NewColumnTransform` to a more efficient form.
fn optimize(
&self,
dataset: &Dataset,
transform: NewColumnTransform,
) -> Result<NewColumnTransform>;
}

/// A `NewColumnTransformOptimizer` that chains multiple `NewColumnTransformOptimizer`s together.
pub(super) struct ChainedNewColumnTransformOptimizer {
optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>,
}

impl ChainedNewColumnTransformOptimizer {
pub(super) fn new(optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>) -> Self {
Self { optimizers }
}

pub(super) fn add_optimizer(&mut self, optimizer: Box<dyn NewColumnTransformOptimizer>) {
self.optimizers.push(optimizer);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to this pattern but it seems slightly heavier than we need for this particular fix. Do you anticipate additional optimizers?

Copy link
Contributor Author

@albertlockett albertlockett Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I recognize that. I hesitated deeply over how/where to add the actual core optimization logic.

My thinking was that the alternatives could have been:

  • to run the transform logic add_columns_to_fragments , but that method is already pretty long, so it would be better if the optimization was extracted to somewhere else.
  • do the optimization higher up in the call-stack, but I thought there might eventually be additional context we compute just before applying the transform, so it's best to keep the optimization near where that context could be available.
  • I could have added some ad-hoc functions instead of the trait. But my thinking was that if there were eventually additional optimizations we make to the transform, without putting some structure around where those functions get invoked, figuring out the actual state of the transformation could get messy

I don't have concrete plans to add more optimizers, but in the future I think it's possible we could. For hypothetical example, maybe there are additional SQL patterns that we could recognize and write as a more optimal UDF. Given that this could happen in the future, I thought it might be worthwhile to at least try to to have a bit of structure around how these optimizers are organized


// Optimize the transforms
let mut optimizer = ChainedNewColumnTransformOptimizer::new(vec![]);
if has_legacy_files {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be !has_legacy_files? Or am I misunderstanding the above comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there should have been a ! here. Caught this after adding tests

@github-actions github-actions bot added the python label Mar 3, 2025
@albertlockett albertlockett marked this pull request as ready for review March 3, 2025 20:15
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 96.33508% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.49%. Comparing base (89a33b7) to head (2454c44).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ust/lance/src/dataset/schema_evolution/optimize.rs 92.20% 1 Missing and 5 partials ⚠️
rust/lance/src/dataset/schema_evolution.rs 99.12% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3504      +/-   ##
==========================================
+ Coverage   78.44%   78.49%   +0.04%     
==========================================
  Files         252      253       +1     
  Lines       94044    94262     +218     
  Branches    94044    94262     +218     
==========================================
+ Hits        73773    73989     +216     
+ Misses      17275    17273       -2     
- Partials     2996     3000       +4     
Flag Coverage Δ
unittests 78.49% <96.33%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@albertlockett albertlockett merged commit dca745b into main Mar 3, 2025
30 checks passed
@albertlockett albertlockett deleted the all-null-add-col-via-sql branch March 3, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants