Fixes for a failover scenario. #2260

fyrchik · 2023-02-17T08:02:26Z

In master:
network/cache-related changes will be replaced with gRPC-native keepalive mechanism.
Client cache separation should be approached more rigourously. It seems like a good idea, but we need to evaluate it further.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

Timeouts on client side should node affect inter-node communication. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

The problem is that accidental timeout errors can make us to ignore other nodes for some time. The primary purpose of the whole ignore mechanism is not to degrade in case of failover. For this case, closing connection and limiting the amount of dials is enough. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

Speed up big object construction in some cases. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

Currently, under a mixed load one failed PUT can lead to closing connection for all concurrent GETs. For PUT it does no harm: we have many other nodes to choose from. For GET we are limited by `REP N` factor, so in case of failover we can close the connection with the only node posessing an object, which leads to failing the whole operation. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

codecov · 2023-02-17T08:05:09Z

Codecov Report

Merging #2260 (893fcd1) into support/v0.35 (8508f88) will decrease coverage by 0.01%.
The diff coverage is 5.26%.

❗ Current head 893fcd1 differs from pull request most recent head cb6fe97. Consider uploading reports for the commit cb6fe97 to get more accurate results

@@                Coverage Diff                @@
##           support/v0.35    #2260      +/-   ##
=================================================
- Coverage          31.01%   31.00%   -0.01%     
=================================================
  Files                384      384              
  Lines              28453    28461       +8     
=================================================
  Hits                8824     8824              
- Misses             18889    18897       +8     
  Partials             740      740

Impacted Files	Coverage Δ
cmd/neofs-node/config.go	`0.00% <0.00%> (ø)`
cmd/neofs-node/object.go	`0.00% <0.00%> (ø)`
pkg/services/object/get/exec.go	`71.58% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

carpawell

While that helped, do we have an issue related to failover problems? I am not totally sure that separating client caches solves 20s+ object streaming/fetching and it still requires additional investigations.

carpawell · 2023-02-17T11:49:39Z

pkg/services/object/get/exec.go

@@ -128,7 +128,7 @@ func (exec execCtx) key() (*ecdsa.PrivateKey, error) {
 }

 func (exec *execCtx) canAssemble() bool {
-	return exec.svc.assembly && !exec.isRaw() && !exec.headOnly()
+	return exec.svc.assembly && !exec.isRaw() && !exec.headOnly() && !exec.isLocal()


just to highlight: we have exec.svc.assembly and it is somehow always true

carpawell · 2023-02-17T11:58:26Z

CHANGELOG would be also appreciated.

fyrchik · 2023-02-17T12:29:15Z

Yes, we temporarily disabled compression. Currently, we cannot control disk load for policer (well, we can, but it is suboptimal) and it iterates over each shard in order. Failover case makes things worse because with high probability we will also read full object in memory to replicate.

This reverts commit 907f427. GET requests with TTL=1 are absolutely correct regardless of the need to assemble the object. This TTL blocks request forwarding - it does not happen. At the same time, assembling an object on the server spawns new requests from scratch, which is not forwarding. The original commit does not have a description of the cause, and it was never discovered. Fixes #2447. Signed-off-by: Leonard Lyubich <leonard@morphbits.io>

fyrchik added 5 commits February 17, 2023 11:02

[nspcc-dev#2260] services/object: Do not assemble object with TTL=1

d524425

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

[nspcc-dev#2260] network/cache: Ignore context cancelled errors

3c6fcf5

Timeouts on client side should node affect inter-node communication. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

[nspcc-dev#2260] services/object: Broadcast linking objects

9cc9f6d

Speed up big object construction in some cases. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>

acid-ant approved these changes Feb 17, 2023

View reviewed changes

carpawell approved these changes Feb 17, 2023

View reviewed changes

fyrchik merged commit dea9713 into nspcc-dev:support/v0.35 Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for a failover scenario. #2260

Fixes for a failover scenario. #2260

fyrchik commented Feb 17, 2023

codecov bot commented Feb 17, 2023 •

edited

Loading

carpawell left a comment

carpawell Feb 17, 2023

carpawell commented Feb 17, 2023

fyrchik commented Feb 17, 2023

Fixes for a failover scenario. #2260

Fixes for a failover scenario. #2260

Conversation

fyrchik commented Feb 17, 2023

codecov bot commented Feb 17, 2023 • edited Loading

Codecov Report

carpawell left a comment

Choose a reason for hiding this comment

carpawell Feb 17, 2023

Choose a reason for hiding this comment

carpawell commented Feb 17, 2023

fyrchik commented Feb 17, 2023

codecov bot commented Feb 17, 2023 •

edited

Loading