Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong count result for query with indexes #769

Closed
kuba-- opened this issue Mar 28, 2019 · 3 comments
Closed

Wrong count result for query with indexes #769

kuba-- opened this issue Mar 28, 2019 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@kuba--
Copy link
Contributor

kuba-- commented Mar 28, 2019

Repos:

pga list -l javascript -f json | head -n 5 | jq -r '.sivaFilenames[]' | pga get -i -o repos
pga list -l python -f json | head -n 5 | jq -r '.sivaFilenames[]' | pga get -i -o repos

Query:

SELECT count(*) FROM commits WHERE commit_author_email!='sjdflkjsdlfkj';

Count result: 4344

Query with index:

CREATE INDEX email_idx ON commits USING pilosa (commit_author_email);
SELECT count(*) FROM commits WHERE commit_author_email!='sjdflkjsdlfkj';

Wrong count result: 4301

@kuba-- kuba-- self-assigned this Mar 29, 2019
@kuba-- kuba-- transferred this issue from src-d/go-mysql-server Mar 30, 2019
@kuba-- kuba-- added the bug Something isn't working label Mar 30, 2019
@kuba--
Copy link
Contributor Author

kuba-- commented Mar 30, 2019

@erizocosmico - something is wrong with partition iterator (first thought), just for commits.
As you can see in the example, with indexes we get fewer rows.

For testing purpose, I rollbacked locally the commit files (commits, commit_files, commit_trees, commit_blobs) to the revision 9748ee33ec694003c72d50763e3180b6238bc673 (just before walk commits in history lazily 0c6afe85544c67b12fe177e02949e0d7f1edb55f).

And now I get for both cases 4301 rows!
So, I assume 4301 is the right value, but with new lazy commit iterators, we return too many rows. Btw., when I tested the parallel index, I got for both cases 4344.

Can you take a look?

@erizocosmico
Copy link
Contributor

I think this is actually a bug in the previous versions that we somehow just solved.

This is with the old gitbase version:

select count(*) from commits where repository_id = 'db600b30cdd2398d577eb827ce4fd3c1efa09bc1.siva' AND commit_hash = '5d24d5d884bbdd8e205c19e8d8c7d32bbdcf05fe';
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+

As we can see, that commit is in that repo.

But then, if we do this:

SELECT COUNT(*) FROM (SELECT commit_hash, repository_id FROM commits WHERE commit_author_email!='sjdflkjsdlfkj') t WHERE repository_id = 'db600b30cdd2398d577eb827ce4fd3c1efa09bc1.siva' AND commit_hash = '5d24d5d884bbdd8e205c19e8d8c7d32bbdcf05fe';
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+

We can see it's not.

With the new implementation it shows up in both.

So the fix is not to revert the walk commits lazily, but to actually fix whatever we're doing for indexing that comes out wrong.

@kuba--
Copy link
Contributor Author

kuba-- commented Apr 2, 2019

So the wrong count comes from indexes, the problem is in the iterator which is passed to the index driver, because actually it already iterates through the wrong number of commits.
The partition iterator which comes to the index driver. In other words we should fix the iterator which we pass to indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants