Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure the v2 file writer does not mix up the order of a column #2836

Merged
merged 3 commits into from
Sep 6, 2024

Conversation

westonpace
Copy link
Contributor

Sometimes the v2 writer will split a batch into multiple pages. When it does this it encodes those pages in parallel. It is possible those encoding tasks finish out of order and the writer was then writing the pages out of order. This mean the order of one column could get out of sync with the order of another column.

@github-actions github-actions bot added bug Something isn't working python labels Sep 5, 2024
@westonpace westonpace requested a review from wjones127 September 5, 2024 18:17
@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.05%. Comparing base (f30a679) to head (60cfb22).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2836   +/-   ##
=======================================
  Coverage   78.04%   78.05%           
=======================================
  Files         229      229           
  Lines       70299    70299           
  Branches    70299    70299           
=======================================
+ Hits        54866    54871    +5     
+ Misses      12339    12332    -7     
- Partials     3094     3096    +2     
Flag Coverage Δ
unittests 78.05% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented Sep 5, 2024

is there any reason we don't like adding a page_idx field in EncodedPage then sort it in write_pages?

one benefit could be that we can still write pages in different columns parallelly

@westonpace
Copy link
Contributor Author

@broccoliSpicy that's pretty much what FuturesOrdered is doing. It will still run all the encode tasks in parallel, it just sorts the results when it collects them before the write to disk.

@westonpace westonpace merged commit ef0953d into lancedb:main Sep 6, 2024
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants