Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: merge_insert with subcols sometimes outputs unexpected nulls #3407

Merged
merged 2 commits into from
Jan 22, 2025

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jan 22, 2025

Fixes #3406

At the root of this is a bit of a footgun with DataFusion. Prior to this change, the query plan for getting data that was supposed to be sorted by _rowaddr was:

ProjectionExec: expr=[id@0 as id, vector@1 as vector, _rowaddr@2 as _rowaddr, _rowaddr@2 >> 32 as _fragment_id], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N]
  RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N]
    SortExec: expr=[_rowaddr@2 ASC], preserve_partitioning=[false], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N]
      StreamingTableExec: partition_sizes=1, projection=[id, vector, _rowaddr], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N]

Note the RepartitionExec after the SortExec. This caused the final order to be non-deterministic.

After these changes, the plan is:

SortPreservingMergeExec: [_rowaddr@2 ASC], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N]
  SortExec: expr=[_rowaddr@2 ASC], preserve_partitioning=[true], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N]
    ProjectionExec: expr=[id@0 as id, vector@1 as vector, _rowaddr@2 as _rowaddr, _rowaddr@2 >> 32 as _fragment_id], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N, _fragment_id:UInt64;N]
      RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N]
        StreamingTableExec: partition_sizes=1, projection=[id, vector, _rowaddr], schema=[id:Int64;N, vector:FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 32);N, _rowaddr:UInt64;N]

Which does provide a deterministic order.

@github-actions github-actions bot added the bug Something isn't working label Jan 22, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 78.69%. Comparing base (7f60aa0) to head (6c5660b).

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3407      +/-   ##
==========================================
- Coverage   78.72%   78.69%   -0.03%     
==========================================
  Files         250      250              
  Lines       90879    90879              
  Branches    90879    90879              
==========================================
- Hits        71540    71517      -23     
- Misses      16397    16417      +20     
- Partials     2942     2945       +3     
Flag Coverage Δ
unittests 78.69% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review January 22, 2025 22:21
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, good find!

@wjones127 wjones127 merged commit 3cb54c6 into lancedb:main Jan 22, 2025
28 of 30 checks passed
@wjones127 wjones127 deleted the fix/merge-insert branch January 22, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge-insert error: Got updated row address that is not in the original batch
3 participants