Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge-insert supports inserting subset of columns #3100

Merged
merged 4 commits into from
Dec 18, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Nov 6, 2024

In #2639 we added support for updating subcolumns. In #3041 we added support for inserting subcolumns. This PR adds support for upserting them (or doing insert-if-not-exists).

Closes #2904

Example

import pyarrow as pa
import lance

table = pa.table({
    "id": range(3),
    "a": [1.0, 2.0, 3.0],
    "c": ["x", "x", "x"]
})
dataset = lance.write_dataset(table, "example")

# Upsert: when_matched_update_all + when_not_matched_insert_all
new_data = pa.table({
    "id": [2, 3],
    "c": ["y", "y"]
})
(
    dataset
    .merge_insert(on="id")
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(new_data)
)
dataset.to_table().to_pandas()
   id    a  c
0   0  1.0  x
1   1  2.0  x
2   2  3.0  y
3   3  NaN  y
# Insert-if-not-exists: when_not_matched_insert_all
new_data = pa.table({
    "id": [3, 4],
    "c": ["z", "z"]
})
(
    dataset
    .merge_insert(on="id")
    .when_not_matched_insert_all()
    .execute(new_data)
)
dataset.to_table().to_pandas()

   id    a  c
0   0  1.0  x
1   1  2.0  x
2   2  3.0  y
3   3  NaN  y
4   4  NaN  z

@github-actions github-actions bot added the python label Nov 6, 2024
Copy link

github-actions bot commented Nov 6, 2024

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 changed the title Feat/merge insert subschema feat: merge-insert supports inserting subset of columns Nov 6, 2024
@github-actions github-actions bot added the enhancement New feature or request label Nov 6, 2024
@wjones127 wjones127 force-pushed the feat/merge-insert-subschema branch from 50bcf3a to e7a79cf Compare December 17, 2024 21:53
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 83.83838% with 16 lines in your changes missing coverage. Please review.

Project coverage is 78.91%. Comparing base (f2906cf) to head (e7a79cf).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 83.83% 9 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3100      +/-   ##
==========================================
- Coverage   78.97%   78.91%   -0.06%     
==========================================
  Files         246      246              
  Lines       86313    86560     +247     
  Branches    86313    86560     +247     
==========================================
+ Hits        68162    68311     +149     
- Misses      15328    15421      +93     
- Partials     2823     2828       +5     
Flag Coverage Δ
unittests 78.91% <83.83%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review December 17, 2024 22:32
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

) -> Result<usize> {
// Batches still have _rowaddr (used elsewhere to merge with existing data)
// We need to remove it before writing to Lance files.
let num_fields = batches[0].schema().fields().len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR but we should add a drop columns method to schema.

@wjones127 wjones127 merged commit d038e34 into lancedb:main Dec 18, 2024
24 of 26 checks passed
@wjones127 wjones127 deleted the feat/merge-insert-subschema branch December 18, 2024 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support inserting in merge_insert with a subset of columns
3 participants