-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add versioning and bypass broken row counts #1534
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
3197e78
to
2c7d560
Compare
manifest.fragments = Arc::new(migrate_fragments(dataset, dataset.fragments()).await?); | ||
manifest.fragments = | ||
Arc::new(migrate_fragments(dataset, &manifest.fragments, recompute_stats).await?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found another critical bug here 😞
We migrate the old fragments not the new, so this basically is rolling back the transactions changes. This means data loss in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some tests to cover this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on adding tests now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests have been added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused right now. This seems like it would have caused new fragments to be ignored, but my repro in #1531 does not seem to cause this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured it out, the data loss bug was added in a later version.
Validated this manually building off of repro in README: Repro setupInstall pylance 0.7.5 python -m venv .venv
source .venv/bin/activate
pip install pylance==0.7.5
python Write a table, and delete some rows import lance
import pyarrow as pa
tab = pa.table({'x': range(100)})
dataset = lance.write_dataset(tab, 'test')
dataset.delete("x >= 10 and x < 20")
dataset.count_rows()
Now, install pylance 0.8.0: pip install pylance==0.8.0
python Count rows at first is correct. However, once we write to the table, it becomes incorrect: import lance
import pyarrow as pa
dataset = lance.dataset('test')
dataset.count_rows() # 90
tab = pa.table({'x': range(10)})
dataset = lance.write_dataset(tab, 'test', mode='append')
dataset.count_rows() # 90
dataset.to_table().num_rows # 100 Steps in this fixUsing feature branch: import lance
dataset = lance.dataset("test")
dataset.count_rows()
import pyarrow as pa
tab = pa.table({'x': range(2)})
dataset = lance.write_dataset(tab, dataset.uri, mode="append")
dataset.count_rows()
Validating in old versionBack in pylance 0.8.0: >>> import lance
>>> dataset = lance.dataset('test')
>>> dataset.count_rows()
|
49d0087
to
febe576
Compare
python/python/lance/fragment.py
Outdated
@@ -347,7 +347,7 @@ def delete(self, predicate: str) -> FragmentMetadata | None: | |||
>>> dataset = lance.write_dataset(tab, "dataset") | |||
>>> frag = dataset.get_fragment(0) | |||
>>> frag.delete("a > 1") | |||
Fragment { id: 0, files: ..., deletion_file: Some(...), physical_rows: 3 } | |||
Fragment { id: 0, files: ..., deletion_file: Some(...), physical_rows: Some(3) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what condition that physical_rows = None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the dataset was written prior to 0.8.0, this field is not filled in (because the field didn't exist then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW that change is more relevant to #1529, which this PR is based on. So worth reviewing that one first.
e512103
to
935f8b6
Compare
935f8b6
to
edeee0f
Compare
83395f8
to
725c8a3
Compare
Adds a new feature:
WriterVersion
in the manifest.Also fixes two bugs:
physical_rows
when old version is detected.Will update the value on write as needed.
migrate_fragments
runs, we do it on the new fragments and not the old,fixing a data loss issue.
Fixes #1531
Fixes #1535