Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: incomplete query result, missing id #34820

Closed
1 task done
prrs opened this issue Jul 18, 2024 · 13 comments
Closed
1 task done

[Bug]: incomplete query result, missing id #34820

prrs opened this issue Jul 18, 2024 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@prrs
Copy link

prrs commented Jul 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.11
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.3 java
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

For one of the collection it is very consistent. I am uploading the log trace.
incomplete_query.csv

Expected Behavior

It should fetch the result.

Steps To Reproduce

This is a production cluster. I am trying to get the data. One thing is for sure that it's not a case of duplicate record as over a period of time it complained about different record Ids:

ce8fbef9-eeb5-4c07-bf8f-ec79de3d6327
89c7f8ac-45b8-401e-864f-bdf9b838d198
ff99388f-e747-4c22-bdeb-26bcb6e67925

Milvus Log

incomplete_query.csv

Anything else?

Discord thread: https://discord.com/channels/1160323594396635310/1257950915269230634/1257950915269230634

@prrs prrs added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 18, 2024
@prrs
Copy link
Author

prrs commented Jul 18, 2024

I got the data. I need to check whether I can upload it. I dropped the collection re-ingested the data and this time it works. The only difference I see is 2 segments.
working.data.csv

So it's definitely not data. We should investigate that what can cause such kind of issues? Please, let me know if you need more details to debug it. We are seeing this issue here and there in our prod environment.

@yanliang567
Copy link
Contributor

/assign @bigsheeper
a similar issue to #34021, please help to take a look

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 19, 2024
@prrs
Copy link
Author

prrs commented Jul 19, 2024

@yanliang567 I have looked into the issue that you have shared. I also had a word with @xiaofan-luan in a different forum, where he talked about a known issue in 2.4.x, which is "If there are two or more duplicate row(key & value) within same segment, this could happen.

But, in this case, I looked into the data, there is no duplicate rows, also I were not able to see "592". When I looked into code, it's getting all the data data in Result, but the specified key was missing when it fetched field data for primary key.

@bigsheeper
Copy link
Contributor

Hi @prrs , you might try using milvus v2.4.6, as similar issues have been resolved in this version.

@prrs
Copy link
Author

prrs commented Jul 19, 2024

@bigsheeper At this point of time we can't move away from 2.3.x, because of below reasons.

  1. We are live in prod. @xiaofan-luan mentioned that 2.4.x have some issue, which is a work in progress. So, once we get a confirmation from Milvus, will verify this in our env before promoting this to prod.
  2. Along with point 1, we are making some changes in index path, will be doing open source contribution. As 2.4.x stabilisation taking some time, we decided to built this on top of 2.3.x. So, we are going to stick with 2.3.x for next few months. James is aware of it.
  3. Can you point me to fixes that went in 2.4.x? Do you think, they have common RCA?

@xiaofan-luan
Copy link
Collaborator

@bigsheeper At this point of time we can't move away from 2.3.x, because of below reasons.

  1. We are live in prod. @xiaofan-luan mentioned that 2.4.x have some issue, which is a work in progress. So, once we get a confirmation from Milvus, will verify this in our env before promoting this to prod.
  2. Along with point 1, we are making some changes in index path, will be doing open source contribution. As 2.4.x stabilisation taking some time, we decided to built this on top of 2.3.x. So, we are going to stick with 2.3.x for next few months. James is aware of it.
  3. Can you point me to fixes that went in 2.4.x? Do you think, they have common RCA?

This is actually a complicated fix so it's hard to backport to 2.3.

The reason is of this issue is that there are duplicated PK result in one segment.

We don't find a easy way to to backport to 2.3, but maybe you can write a small tool to find duplicate pk and fix that by delete the old duplicted data. (Trigger compaction can work as well.)

@xiaofan-luan
Copy link
Collaborator

@bigsheeper we need to a tool find duplicate pks in same segment

@prrs
Copy link
Author

prrs commented Jul 21, 2024

@bigsheeper @xiaofan-luan

I don't think this is happening because of duplicate pk in 2.3.11, because of below reasons:

  • In my local, I tried various permutations with duplicate PK and data but it didn't repro. Below are scenarios:
    -- Two row with same PK and data in same sealed segment
    -- All the row with same PK and data in the same sealed segment
    -- Multiple row with same PK and data in same sealed segment
    -- All the above scenarios with one segment which was growing

Also, for the collection in production, the issue had happened had no duplicate row, this I verified by looking into source. I didn't had a way to dump the rows from Milvus, it's a dual write system so we know what exactly went into Milvus. As I mentioned above, the issue got mitigated when I dropped the partition and inserted the same data.

Now, my question is how to debug it further to identify the issue to find a mitigation and long term solution with 2.3.x?

@bigsheeper
Copy link
Contributor

@prrs From your description, it doesn't seem to be caused by duplicate primary keys. Is the "incomplete query result" error occasional? and how often it occurs.

@prrs
Copy link
Author

prrs commented Jul 27, 2024

@bigsheeper it's occasional. All of these time it happened in prod, we haven't plugged in the back up tool in prod so not able to get dump. We are prioritising this to better debug the issue.

@xiaofan-luan
Copy link
Collaborator

@prrs From your description, it doesn't seem to be caused by duplicate primary keys. Is the "incomplete query result" error occasional? and how often it occurs.

There is another fix in " Restore the MVCC functionality", but it should be there on any version after 2.3.5

we need a segment with that error to reproduce.
most likely there will be still duplicates

sre-ci-robot pushed a commit that referenced this issue Aug 5, 2024
This PR cherry-picks the following PRs:

1. Return specific error codes when encountering incomplete requery
results error. #31343
2. Retry on incomplete requery result in proxy.
#31713

issue: #34820

pr: #31343,
#31713

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor

bigsheeper commented Aug 23, 2024

Hi @prrs ~ This issue has been fixed, you can update to the latest version v2.3.21, which includes the necessary fix. If you encounter any further issues, don't hesitate to reach out.

Copy link

stale bot commented Nov 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants