Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for a failover scenario. #2260

Merged
merged 5 commits into from
Feb 18, 2023
Merged

Conversation

fyrchik
Copy link
Contributor

@fyrchik fyrchik commented Feb 17, 2023

In master:
network/cache-related changes will be replaced with gRPC-native keepalive mechanism.
Client cache separation should be approached more rigourously. It seems like a good idea, but we need to evaluate it further.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Timeouts on client side should node affect inter-node communication.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
The problem is that accidental timeout errors can make us to ignore
other nodes for some time. The primary purpose of the whole ignore
mechanism is not to degrade in case of failover. For this case,
closing connection and limiting the amount of dials is enough.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Speed up big object construction in some cases.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Currently, under a mixed load one failed PUT can lead to closing
connection for all concurrent GETs. For PUT it does no harm: we have
many other nodes to choose from. For GET we are limited by `REP N`
factor, so in case of failover we can close the connection with the only
node posessing an object, which leads to failing the whole operation.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
@codecov
Copy link

codecov bot commented Feb 17, 2023

Codecov Report

Merging #2260 (893fcd1) into support/v0.35 (8508f88) will decrease coverage by 0.01%.
The diff coverage is 5.26%.

❗ Current head 893fcd1 differs from pull request most recent head cb6fe97. Consider uploading reports for the commit cb6fe97 to get more accurate results

@@                Coverage Diff                @@
##           support/v0.35    #2260      +/-   ##
=================================================
- Coverage          31.01%   31.00%   -0.01%     
=================================================
  Files                384      384              
  Lines              28453    28461       +8     
=================================================
  Hits                8824     8824              
- Misses             18889    18897       +8     
  Partials             740      740              
Impacted Files Coverage Δ
cmd/neofs-node/config.go 0.00% <0.00%> (ø)
cmd/neofs-node/object.go 0.00% <0.00%> (ø)
pkg/services/object/get/exec.go 71.58% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@carpawell carpawell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that helped, do we have an issue related to failover problems? I am not totally sure that separating client caches solves 20s+ object streaming/fetching and it still requires additional investigations.

@@ -128,7 +128,7 @@ func (exec execCtx) key() (*ecdsa.PrivateKey, error) {
}

func (exec *execCtx) canAssemble() bool {
return exec.svc.assembly && !exec.isRaw() && !exec.headOnly()
return exec.svc.assembly && !exec.isRaw() && !exec.headOnly() && !exec.isLocal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to highlight: we have exec.svc.assembly and it is somehow always true

@carpawell
Copy link
Member

CHANGELOG would be also appreciated.

@fyrchik
Copy link
Contributor Author

fyrchik commented Feb 17, 2023

Yes, we temporarily disabled compression. Currently, we cannot control disk load for policer (well, we can, but it is suboptimal) and it iterates over each shard in order. Failover case makes things worse because with high probability we will also read full object in memory to replicate.

@fyrchik fyrchik merged commit dea9713 into nspcc-dev:support/v0.35 Feb 18, 2023
cthulhu-rider added a commit that referenced this pull request Feb 27, 2024
This reverts commit 907f427.

GET requests with TTL=1 are absolutely correct regardless of the need to
assemble the object. This TTL blocks request forwarding - it does not
happen. At the same time, assembling an object on the server spawns
new requests from scratch, which is not forwarding. The original commit
does not have a description of the cause, and it was never discovered.

Fixes #2447.

Signed-off-by: Leonard Lyubich <leonard@morphbits.io>
cthulhu-rider added a commit that referenced this pull request Feb 27, 2024
This reverts commit 907f427.

GET requests with TTL=1 are absolutely correct regardless of the need to
assemble the object. This TTL blocks request forwarding - it does not
happen. At the same time, assembling an object on the server spawns
new requests from scratch, which is not forwarding. The original commit
does not have a description of the cause, and it was never discovered.

Fixes #2447.

Signed-off-by: Leonard Lyubich <leonard@morphbits.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants