Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add versioning and bypass broken row counts #1534

Merged
merged 4 commits into from
Nov 7, 2023

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Nov 6, 2023

Adds a new feature: WriterVersion in the manifest.

Also fixes two bugs:

Fixes #1531
Fixes #1535

Copy link

github-actions bot commented Nov 6, 2023

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 force-pushed the wjones127/version-checking branch from 3197e78 to 2c7d560 Compare November 7, 2023 00:01
Comment on lines -170 to +176
manifest.fragments = Arc::new(migrate_fragments(dataset, dataset.fragments()).await?);
manifest.fragments =
Arc::new(migrate_fragments(dataset, &manifest.fragments, recompute_stats).await?);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found another critical bug here 😞

We migrate the old fragments not the new, so this basically is rolling back the transactions changes. This means data loss in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some tests to cover this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on adding tests now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have been added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused right now. This seems like it would have caused new fragments to be ignored, but my repro in #1531 does not seem to cause this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured it out, the data loss bug was added in a later version.

@wjones127 wjones127 changed the title wip: add versioning and bypass broken row counts fix: add versioning and bypass broken row counts Nov 7, 2023
@wjones127
Copy link
Contributor Author

Validated this manually building off of repro in README:

Repro setup

Install pylance 0.7.5

python -m venv .venv
source .venv/bin/activate
pip install pylance==0.7.5
python

Write a table, and delete some rows

import lance
import pyarrow as pa

tab = pa.table({'x': range(100)})
dataset = lance.write_dataset(tab, 'test')
dataset.delete("x >= 10 and x < 20")
dataset.count_rows()
90

Now, install pylance 0.8.0:

pip install pylance==0.8.0
python

Count rows at first is correct. However, once we write to the table, it becomes incorrect:

import lance
import pyarrow as pa

dataset = lance.dataset('test')
dataset.count_rows() # 90

tab = pa.table({'x': range(10)})
dataset = lance.write_dataset(tab, 'test', mode='append')
dataset.count_rows() # 90 
dataset.to_table().num_rows # 100

Steps in this fix

Using feature branch:

import lance
dataset = lance.dataset("test")

dataset.count_rows()
100
import pyarrow as pa
tab = pa.table({'x': range(2)})
dataset = lance.write_dataset(tab, dataset.uri, mode="append")
dataset.count_rows()
102

Validating in old version

Back in pylance 0.8.0:

>>> import lance
>>> dataset = lance.dataset('test')
>>> dataset.count_rows()
102

@wjones127 wjones127 force-pushed the wjones127/version-checking branch from 49d0087 to febe576 Compare November 7, 2023 01:05
@wjones127 wjones127 marked this pull request as ready for review November 7, 2023 01:48
@@ -347,7 +347,7 @@ def delete(self, predicate: str) -> FragmentMetadata | None:
>>> dataset = lance.write_dataset(tab, "dataset")
>>> frag = dataset.get_fragment(0)
>>> frag.delete("a > 1")
Fragment { id: 0, files: ..., deletion_file: Some(...), physical_rows: 3 }
Fragment { id: 0, files: ..., deletion_file: Some(...), physical_rows: Some(3) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what condition that physical_rows = None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the dataset was written prior to 0.8.0, this field is not filled in (because the field didn't exist then).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW that change is more relevant to #1529, which this PR is based on. So worth reviewing that one first.

@wjones127 wjones127 changed the base branch from main to wjones127/physical-rows November 7, 2023 15:47
Base automatically changed from wjones127/physical-rows to main November 7, 2023 18:33
@wjones127 wjones127 force-pushed the wjones127/version-checking branch from e512103 to 935f8b6 Compare November 7, 2023 18:36
@wjones127 wjones127 force-pushed the wjones127/version-checking branch from 935f8b6 to edeee0f Compare November 7, 2023 18:46
@wjones127 wjones127 force-pushed the wjones127/version-checking branch from 83395f8 to 725c8a3 Compare November 7, 2023 19:01
@wjones127 wjones127 merged commit c901cbe into main Nov 7, 2023
@wjones127 wjones127 deleted the wjones127/version-checking branch November 7, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants