Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete multipart upload fails with timeout #843

Closed
evgeniiz321 opened this issue Oct 11, 2023 · 8 comments · Fixed by #931
Closed

Complete multipart upload fails with timeout #843

evgeniiz321 opened this issue Oct 11, 2023 · 8 comments · Fixed by #931
Assignees
Labels
bug Something isn't working I3 Minimal impact S4 Routine U3 Regular
Milestone

Comments

@evgeniiz321
Copy link

test_multipart_upload at s3tests_boto3/functional/test_s3_neofs.py

We create a multipart upload, upload parts, then try to complete it and fail on timeout:

2023-10-11T00:11:51.416Z	debug	layer/multipart_upload.go:629	part details	{"reqId": "6046e93a-4bc4-4401-a4ec-7bb92968286a", "bucket": "yournamehere-cl96tp4ul4bvyi6i-2", "cid": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "object": "mymultipart", "upload id": "08bfda57-93e3-4898-b2fc-9df9384d12e0", "part numbers": [3, 2, 4, 5, 1, 6], "oids": ["EfgAgV5k82WaagxVRSmzhEJVoujq6eejm6ofHvd8iBkE", "HerSXJEZzBGZDqMk6Esj8xahWsTMzZfbQPxxnvqCSrhn", "6rdwrex8kvV7LEPY3h47dL3YyYR88pyPNP5GjgWSEZsE", "3nzFgKZBfyBC2oyAJnihgdezcCfEsUWJUpi5xR5qLNNG", "9gnu1Esq6tgwYnKSWBf5m3XSs9XJWK8UksVPPWGfCXYb", "HU1JbiFiSxeH4narXqXpgFZZZorcyF9toUNYKpCaieF8"]}
2023-10-11T00:12:44.473Z	debug	layer/object.go:252	put object	{"reqId": "6046e93a-4bc4-4401-a4ec-7bb92968286a", "bucket": "yournamehere-cl96tp4ul4bvyi6i-2", "cid": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "object": "mymultipart", "oid": "HS45zgdNH3et4amxMgUg3CWXwGodhTRiG4pztMnwhrVL"}
2023-10-11T00:12:51.459Z	warn	layer/multipart_upload.go:467	could not delete upload part	{"object id": "6rdwrex8kvV7LEPY3h47dL3YyYR88pyPNP5GjgWSEZsE", "bucket id": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "error": "mark object removal via connection pool: remove object via client: delete object on client: rpc error: code = Canceled desc = context canceled"}
2023-10-11T00:12:51.460Z	warn	layer/multipart_upload.go:467	could not delete upload part	{"object id": "3nzFgKZBfyBC2oyAJnihgdezcCfEsUWJUpi5xR5qLNNG", "bucket id": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "error": "mark object removal via connection pool: failed to collect relatives: failed to get raw object header: read object header via client: write request: rpc error: code = Canceled desc = context canceled"}
2023-10-11T00:12:51.461Z	warn	layer/multipart_upload.go:467	could not delete upload part	{"object id": "9gnu1Esq6tgwYnKSWBf5m3XSs9XJWK8UksVPPWGfCXYb", "bucket id": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "error": "mark object removal via connection pool: failed to collect relatives: failed to get raw object header: read object header via client: write request: rpc error: code = Canceled desc = context canceled"}
2023-10-11T00:12:51.461Z	warn	layer/multipart_upload.go:467	could not delete upload part	{"object id": "HU1JbiFiSxeH4narXqXpgFZZZorcyF9toUNYKpCaieF8", "bucket id": "8E8VhzySor5D5KWSNTpwiSKuu6afVvDmVzJffGmnDf1g", "error": "mark object removal via connection pool: failed to collect relatives: failed to get raw object header: read object header via client: write request: rpc error: code = Canceled desc = context canceled"}
2023-10-11T00:12:51.462Z	error	handler/util.go:29	call method	{"status": 500, "request_id": "6046e93a-4bc4-4401-a4ec-7bb92968286a", "method": "CompleteMultipartUpload", "bucket": "yournamehere-cl96tp4ul4bvyi6i-2", "object": "mymultipart", "description": "could not complete multipart upload", "error": "failed to remove node: rpc error: code = Canceled desc = context canceled", "uploadID": "08bfda57-93e3-4898-b2fc-9df9384d12e0", "Key": "mymultipart"}

Regression

Not sure if it ever worked, but it definitely should

Full logs - containers_logs.tar.gz

@evgeniiz321 evgeniiz321 added bug Something isn't working triage labels Oct 11, 2023
@roman-khimov roman-khimov added this to the v0.30.0 milestone Oct 11, 2023
@smallhive smallhive self-assigned this Oct 13, 2023
@smallhive
Copy link
Contributor

Yeah, I consider this error was fixed in 0.28.0 version.

According to the logs old version was used:

2023-10-10T23:01:38.619Z	info	s3-gw/app.go:405	application started	{"name": "neofs-s3-gw", "version": "v0.27.1"}

Highly recommended to update to the latest 0.29.0. On the last s3 version test is green

@evgeniiz321
Copy link
Author

Now there is an error during delete operation:

2023-10-18T01:35:55.672Z        debug   layer/tagging.go:182    target details  {"reqId": "703395af-c63b-4561-af41-4a8b3519bdfe", "bucket": "yournamehere-b908dd3koqrlao9s-1", "cid": "DBfnRAb5ox76DCN3Pm81rVEVyNosmgTPENwueSmDjNPe", "object": "mymultipart", "oid": "DWpxgn1LKzeR5EYtUKBYXWv3JnUaVTxegcVWCPedzfqo"}
2023-10-18T01:36:05.674Z        error   handler/util.go:29      call method     {"status": 500, "request_id": "703395af-c63b-4561-af41-4a8b3519bdfe", "method": "DeleteObject", "bucket": "yournamehere-b908dd3koqrlao9s-1", "object": "mymultipart", "description": "could not delete object", "error": "mark object removal via connection pool: context deadline exceeded"}

@smallhive
Copy link
Contributor

It looks like there is the same problem we have in neofs-testcases repo with multipart - too many parts. I tried to replace 30mb object with a 6mb object in test settings, and the result was ~35sec for the test (with 30mb it is 90sec) and 3/3 runs are green.
6mb object still allows to check multipart, because it will have two parts in current dev-env

@smallhive
Copy link
Contributor

But anyway, in some real case, user may have the same issue. Do we need to react somehow to this? I mean, maybe try to return something like ErrBusy.

Will the node continue the deleting operation if the gate has context cancelled? The user may come later and try to delete but have the error - object not found, because the node already done its work.

@evgeniiz321
Copy link
Author

Another good thing to consider is 202 status code, but the operation should be async in this case (or in case of timeouts). Like if we encounter a timeout, we can return 202 and on a next request check if an object was deleted.

@evgeniiz321
Copy link
Author

evgeniiz321 commented Oct 23, 2023

Found another failing test - test_versioning_bucket_multipart_upload_return_version_id
This one fails as the original one, with:

2023-10-23T20:04:49.764Z        debug   layer/object.go:252     put object      {"reqId": "49883a89-0051-4a25-9782-457626593c06", "bucket": "yournamehere-840y0o5hwn7icbkr-1", "cid": "5fWDqULDzS2e9ym93xbnrwZHgt3g6Bw7L7bmJWkHX55r", "object": "bar", "oid": "AR4BV2eVbVKYp85oBNifSjfBW7ccXufDDPfV7CM1vLaJ"}
2023-10-23T20:05:08.703Z        warn    layer/multipart_upload.go:467   could not delete upload part    {"object id": "573ZfRY1E1PGh8xmZNCVeJ9TKRAJ5czTyBvV5oyGdTXu", "bucket id": "5fWDqULDzS2e9ym93xbnrwZHgt3g6Bw7L7bmJWkHX55r", "error": "mark object removal via connection pool: rpc error: code = Canceled desc = context canceled"}
2023-10-23T20:05:08.704Z        error   handler/util.go:29      call method     {"status": 500, "request_id": "49883a89-0051-4a25-9782-457626593c06", "method": "CompleteMultipartUpload", "bucket": "yournamehere-840y0o5hwn7icbkr-1", "object": "bar", "description": "could not complete multipart upload", "error": "failed to remove node: rpc error: code = Canceled desc = context canceled", "uploadID": "a1bb51c2-d199-41cf-831e-baa2078d0c8d", "Key": "bar"}

Latest dev-env from master was used

@roman-khimov
Copy link
Member

IIUC the problem is object reslicing during finalization, it can take arbitrary amount of time. In that sense:

  1. Reslicing is bad. It can be avoided, at the moment we allow parts to have different sizes (but Make it easy to seek through large objects neofs-api#264).
  2. Having compound slices is bad as well, it should not happen. Parts of the multipart upload should create slices of the overall object, we don't need per-part compound objects.

@roman-khimov
Copy link
Member

We need to slice objects in the gateway in a way that doesn't need reslicing (a part in -> a set of additional objects out). It will change somewhat after nspcc-dev/neofs-node#2729, but these changes will be minor.

smallhive added a commit that referenced this issue Feb 22, 2024
Closes #843.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Feb 28, 2024
smallhive added a commit that referenced this issue Feb 28, 2024
smallhive added a commit that referenced this issue Feb 28, 2024
Closes #843.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Feb 29, 2024
Closes #843.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Mar 14, 2024
Closes #843.

Before, on multipartComplete all parts of Big object downloaded in memory
and a new object put to the NeoFS. With these changes we are slicing object
during part uploading routine.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Mar 15, 2024
Closes #843.

Before, on multipartComplete all parts of Big object downloaded in memory
and a new object put to the NeoFS. With these changes we are slicing object
during part uploading routine.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Mar 15, 2024
Closes #843.

Before, on multipartComplete all parts of Big object downloaded in memory
and a new object put to the NeoFS. With these changes we are slicing object
during part uploading routine.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Mar 15, 2024
Closes #843.

Before, multipartComplete read all parts of Big object to the memory, combine them and generate final Big object.
These step consume time and memory, eventually any system will fail to load all parts in mem or timeout during
the process.
After, object slicing process works from the first uploaded part. Calculating each part hash and whole object
hash during whole process. Storing object hash state to each part metadata in tree service.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
smallhive added a commit that referenced this issue Mar 15, 2024
Closes #843.

Before, multipartComplete read all parts of Big object to the memory, combine them and generate final Big object.
These step consume time and memory, eventually any system will fail to load all parts in mem or timeout during
the process.
After, object slicing process works from the first uploaded part. Calculating each part hash and whole object
hash during whole process. Storing object hash state to each part metadata in tree service.

Signed-off-by: Evgenii Baidakov <evgenii@nspcc.io>
roman-khimov added a commit that referenced this issue Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working I3 Minimal impact S4 Routine U3 Regular
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants