Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: datanode import error "invalid argument to Intn" #35662

Closed
1 task done
wangqia0309 opened this issue Aug 23, 2024 · 13 comments
Closed
1 task done

[Bug]: datanode import error "invalid argument to Intn" #35662

wangqia0309 opened this issue Aug 23, 2024 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. stale indicates no udpates for 30 days

Comments

@wangqia0309
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.9
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    s3
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.x
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

when datanode import parquet file in s3, occured one error as below image
27312d948d281cb120f523bdbca79bb
we look for code and found the reason is that the candidates length is 0, but it's diffcult to me to go on researching
internal/datanode/importv2/util.go#PickSegment

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

{"log":"[2024/08/23 02:16:05.409 +00:00] [ERROR] [conc/options.go:54] ["Conc pool panicked"] [panic="invalid argument to Intn"] [stack="github.com/milvus-io/milvus/pkg/util/conc.(*poolOption).antsOptions.func1\n\t/workspace/source/pkg/util/conc/options.go:54\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1.1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:54\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1.1\n\t/workspace/source/pkg/util/conc/pool.go:74\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\nmath/rand.(*Rand).Intn\n\t/usr/local/go/src/math/rand/rand.go:180\ngithub.com/milvus-io/milvus/internal/datanode/importv2.PickSegment\n\t/workspace/source/internal/datanode/importv2/util.go:121\ngithub.com/milvus-io/milvus/internal/datanode/importv2.(*ImportTask).sync\n\t/workspace/source/internal/datanode/importv2/task_import.go:218\ngithub.com/milvus-io/milvus/internal/datanode/importv2.(*ImportTask).importFile\n\t/workspace/source/internal/datanode/importv2/task_import.go:185\ngithub.com/milvus-io/milvus/internal/datanode/importv2.(*ImportTask).Execute.func1\n\t/workspace/source/internal/datanode/importv2/task_import.go:142\ngithub.com/milvus-io/milvus/internal/datanode/importv2.(*ImportTask).Execute.func2\n\t/workspace/source/internal/datanode/importv2/task_import.go:157\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n\t/workspace/source/pkg/util/conc/pool.go:81\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67"]\n","stream":"stdout","time":"2024-08-23T02:16:05.409212576Z"}
{"log":"panic: invalid argument to Intn [recovered]\n","stream":"stderr","time":"2024-08-23T02:16:05.412405212Z"}
{"log":"\u0009panic: invalid argument to Intn [recovered]\n","stream":"stderr","time":"2024-08-23T02:16:05.412423674Z"}
{"log":"\u0009panic: invalid argument to Intn\n","stream":"stderr","time":"2024-08-23T02:16:05.412437793Z"}

Anything else?

No response

@wangqia0309 wangqia0309 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 23, 2024
@bigsheeper
Copy link
Contributor

/assign

@bigsheeper
Copy link
Contributor

/unassign @yanliang567

@bigsheeper
Copy link
Contributor

bigsheeper commented Aug 23, 2024

Hello @wangqia0309, could you please provide all datanode+datacoord logs? The logs will give us more context and help us identify the root cause more quickly. Additionally, how many shards does the collection have? And does the collection using partitionKey?

@wangqia0309
Copy link
Author

logs.zip
log is much big, so i compress them to a zip file
we don't set shard nor set partitionkey

@bigsheeper
Copy link
Contributor

logs.zip log is much big, so i compress them to a zip file we don't set shard nor set partitionkey

@wangqia0309 The issue seems to be related to this parquet file milvus_3b_dense/glm_3b_msa_embeds/new_online_0820/396_002145_000000.parquet. One workaround to try is to avoid importing this file for now.
Could you help by grepping this log in mixcoord?

cat mixcoord.log.log| grep 396_002145_000000.parquet | grep fileStats

@wangqia0309
Copy link
Author

logs.zip log is much big, so i compress them to a zip file we don't set shard nor set partitionkey

@wangqia0309 The issue seems to be related to this parquet file milvus_3b_dense/glm_3b_msa_embeds/new_online_0820/396_002145_000000.parquet. One workaround to try is to avoid importing this file for now. Could you help by grepping this log in mixcoord?

cat mixcoord.log.log| grep 396_002145_000000.parquet | grep fileStats

i just found the log like
[INFO] [datacoord/services.go:1743] ["GetImportProgress done"] [jobID=452016732656013460] [resp="status:\u003c\u003e state:Importing progress:100 collection_name:\"milvus_3b_dense_test_hash\" task_progresses:\u003cfile_name:\"[milvus_3b_dense/glm_3b_msa_embeds/new_online_0820/396_002145_000000.parquet]\" file_size:472309868 progress:100 state:\"InProgress\" \u003e start_time:\"2024-08-22T10:26:13Z\" "]\n","stream":"stdout","time":"2024-08-23T05:02:22.261878478Z"}
but no log contains fileStats

@bigsheeper
Copy link
Contributor

i just found the log like [INFO] [datacoord/services.go:1743] ["GetImportProgress done"] [jobID=452016732656013460] [resp="status:\u003c\u003e state:Importing progress:100 collection_name:"milvus_3b_dense_test_hash" task_progresses:\u003cfile_name:"[milvus_3b_dense/glm_3b_msa_embeds/new_online_0820/396_002145_000000.parquet]" file_size:472309868 progress:100 state:"InProgress" \u003e start_time:"2024-08-22T10:26:13Z" "]\n","stream":"stdout","time":"2024-08-23T05:02:22.261878478Z"} but no log contains fileStats

@wangqia0309 I understand that the relevant logs may have been overwritten or lost. Could you please provide the Parquet file for us? Having the data will allow us to attempt to reproduce the problem locally and investigate further.

@bigsheeper
Copy link
Contributor

Submitted a PR to prevent the panic. I'll continue investigating the root cause.

@bigsheeper
Copy link
Contributor

Hi @wangqia0309 ,
Could you please clarify if there was concurrent writing to the Parquet file during the import process?
During the Milvus import process, the file is read twice. From what we observe, it seems that the file had 0 rows during the first read, but during the second read, the file had rows greater than 0. This discrepancy might indicate concurrent writes.

@milinxiaobo
Copy link

Hi @wangqia0309 ,
Could you please clarify if there was concurrent writing to the Parquet file during the import process?
During the Milvus import process, the file is read twice. From what we observe, it seems that the file had 0 rows during the first read, but during the second read, the file had rows greater than 0. This discrepancy might indicate concurrent writes.

Hi @wangqia0309 ,
Could you please clarify if there was concurrent writing to the Parquet file during the import process?
During the Milvus import process, the file is read twice. From what we observe, it seems that the file had 0 rows during the first read, but during the second read, the file had rows greater than 0. This discrepancy might indicate concurrent writes.

不是;我们是先写parquet文件到s3上,等所有parquet文件写完才启动milvus import process的.

sre-ci-robot pushed a commit that referenced this issue Aug 27, 2024
issue: #35662

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Aug 27, 2024
issue: #35662

pr: #35673

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor

bigsheeper commented Aug 28, 2024

Root cause: apache/arrow#43860

sre-ci-robot pushed a commit that referenced this issue Aug 29, 2024
issue: #35662

pr: #35819

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Aug 29, 2024
issue: #35662

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Copy link

stale bot commented Sep 29, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 29, 2024
@bigsheeper
Copy link
Contributor

should be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. stale indicates no udpates for 30 days
Projects
None yet
Development

No branches or pull requests

4 participants