feat: support add all null column as metadata-only operation via sql #3504

albertlockett · 2025-03-03T16:11:16Z

Adds support for adding all-null column via SQL.

If the user passes:

dataset.add_column(NewColumnTransform::SqlExpressions(vec!["new_col", "CAST(NULL AS int)"]);

We'll discover that the intention is to to create an all null column, and optimize the transform to:

dataset.add_column(NewColumnTransform::AllNull(Arc::new(
  Schema::new(vec![
    Field::new("new_col", DataType:Int32, true),
  ])
)

The motivation here is to be able to expose the capability to add the all null column as a metadata-only operation through the LanceDB SDKs. Currently these methods only support passing SQL expressions. A different option would have been to modify the arguments to the python table.add_column & typescript table.addColumn, but that seemed like more work so I wanted to propose this solution first.

albertlockett · 2025-03-03T16:11:55Z

rust/lance/src/dataset/schema_evolution.rs

+
+    #[test]
+    fn test_new_column_sql_to_all_nulls_transform_optimizer() {
+        // TODO: write a test to ensure the optimizer for all null sql gets used


If we're OK with the general approach in this PR, I'm planning to fill in this test before we merge

chebbyChefNEQ

tiny nit: can we add a python test

wjones127

Seems like a good approach.

westonpace

The approach seems fine. A few thoughts but nothing concerning. Thanks for digging into this

westonpace · 2025-03-03T19:20:05Z

rust/lance/src/dataset/schema_evolution.rs

+fn has_legacy_files(fragments: &[FileFragment]) -> bool {
+    let has_legacy_files = fragments
+        .iter()
+        .map(|f| &f.metadata)
+        .flat_map(|fragment_meta| fragment_meta.files.iter())
+        .any(|file_meta| {
+            matches!(
+                LanceFileVersion::try_from_major_minor(
+                    file_meta.file_major_version,
+                    file_meta.file_minor_version
+                ),
+                Ok(LanceFileVersion::Legacy)
+            )
+        });
+
+    !has_legacy_files
+}


Might be easier to just use dataset.is_legacy_storage().

Thanks. I wasn't aware of that method

westonpace · 2025-03-03T19:24:18Z

rust/lance/src/dataset/schema_evolution/optimize.rs

+/// Optimizes a `NewColumnTransform` into
+pub(super) trait NewColumnTransformOptimizer: Send + Sync {
+    /// Optimize the passed `NewColumnTransform` to a more efficient form.
+    fn optimize(
+        &self,
+        dataset: &Dataset,
+        transform: NewColumnTransform,
+    ) -> Result<NewColumnTransform>;
+}
+
+/// A `NewColumnTransformOptimizer` that chains multiple `NewColumnTransformOptimizer`s together.
+pub(super) struct ChainedNewColumnTransformOptimizer {
+    optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>,
+}
+
+impl ChainedNewColumnTransformOptimizer {
+    pub(super) fn new(optimizers: Vec<Box<dyn NewColumnTransformOptimizer>>) -> Self {
+        Self { optimizers }
+    }
+
+    pub(super) fn add_optimizer(&mut self, optimizer: Box<dyn NewColumnTransformOptimizer>) {
+        self.optimizers.push(optimizer);
+    }
+}


I'm not opposed to this pattern but it seems slightly heavier than we need for this particular fix. Do you anticipate additional optimizers?

Yeah, I recognize that. I hesitated deeply over how/where to add the actual core optimization logic.

My thinking was that the alternatives could have been:

to run the transform logic add_columns_to_fragments , but that method is already pretty long, so it would be better if the optimization was extracted to somewhere else.

do the optimization higher up in the call-stack, but I thought there might eventually be additional context we compute just before applying the transform, so it's best to keep the optimization near where that context could be available.

I could have added some ad-hoc functions instead of the trait. But my thinking was that if there were eventually additional optimizations we make to the transform, without putting some structure around where those functions get invoked, figuring out the actual state of the transformation could get messy

I don't have concrete plans to add more optimizers, but in the future I think it's possible we could. For hypothetical example, maybe there are additional SQL patterns that we could recognize and write as a more optimal UDF. Given that this could happen in the future, I thought it might be worthwhile to at least try to to have a bit of structure around how these optimizers are organized

westonpace · 2025-03-03T19:24:51Z

rust/lance/src/dataset/schema_evolution.rs

+
+    // Optimize the transforms
+    let mut optimizer = ChainedNewColumnTransformOptimizer::new(vec![]);
+    if has_legacy_files {


Should this be !has_legacy_files? Or am I misunderstanding the above comment?

Yeah, there should have been a ! here. Caught this after adding tests

codecov-commenter · 2025-03-03T20:34:18Z

Codecov Report

Attention: Patch coverage is 96.33508% with 7 lines in your changes missing coverage. Please review.

Project coverage is 78.49%. Comparing base (89a33b7) to head (2454c44).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...ust/lance/src/dataset/schema_evolution/optimize.rs	92.20%	1 Missing and 5 partials ⚠️
rust/lance/src/dataset/schema_evolution.rs	99.12%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3504      +/-   ##
==========================================
+ Coverage   78.44%   78.49%   +0.04%     
==========================================
  Files         252      253       +1     
  Lines       94044    94262     +218     
  Branches    94044    94262     +218     
==========================================
+ Hits        73773    73989     +216     
+ Misses      17275    17273       -2     
- Partials     2996     3000       +4

Flag	Coverage Δ
unittests	`78.49% <96.33%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feat: support add all null column as metadata-only operation via sql

88443f2

albertlockett requested review from wjones127 and chebbyChefNEQ March 3, 2025 16:11

albertlockett marked this pull request as draft March 3, 2025 16:11

github-actions bot added the enhancement New feature or request label Mar 3, 2025

albertlockett commented Mar 3, 2025

View reviewed changes

chebbyChefNEQ approved these changes Mar 3, 2025

View reviewed changes

wjones127 reviewed Mar 3, 2025

View reviewed changes

westonpace reviewed Mar 3, 2025

View reviewed changes

albertlockett added 2 commits March 3, 2025 14:26

added tests for using the new optimizer

0457b7f

added test using python sdk

ee24837

github-actions bot added the python label Mar 3, 2025

albertlockett added 2 commits March 3, 2025 14:50

added license header

4c05f58

some PR feedback

2454c44

albertlockett marked this pull request as ready for review March 3, 2025 20:15

albertlockett merged commit dca745b into main Mar 3, 2025
30 checks passed

albertlockett deleted the all-null-add-col-via-sql branch March 3, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support add all null column as metadata-only operation via sql #3504

feat: support add all null column as metadata-only operation via sql #3504

albertlockett commented Mar 3, 2025

albertlockett Mar 3, 2025

chebbyChefNEQ left a comment

wjones127 left a comment

westonpace left a comment

westonpace Mar 3, 2025

albertlockett Mar 3, 2025

westonpace Mar 3, 2025

albertlockett Mar 3, 2025 •

edited

Loading

westonpace Mar 3, 2025

albertlockett Mar 3, 2025

codecov-commenter commented Mar 3, 2025

feat: support add all null column as metadata-only operation via sql #3504

feat: support add all null column as metadata-only operation via sql #3504

Conversation

albertlockett commented Mar 3, 2025

albertlockett Mar 3, 2025

Choose a reason for hiding this comment

chebbyChefNEQ left a comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Mar 3, 2025

Choose a reason for hiding this comment

albertlockett Mar 3, 2025

Choose a reason for hiding this comment

westonpace Mar 3, 2025

Choose a reason for hiding this comment

albertlockett Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

westonpace Mar 3, 2025

Choose a reason for hiding this comment

albertlockett Mar 3, 2025

Choose a reason for hiding this comment

codecov-commenter commented Mar 3, 2025

Codecov Report

albertlockett Mar 3, 2025 •

edited

Loading