-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for a failover scenario. #2260
Conversation
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Timeouts on client side should node affect inter-node communication. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
The problem is that accidental timeout errors can make us to ignore other nodes for some time. The primary purpose of the whole ignore mechanism is not to degrade in case of failover. For this case, closing connection and limiting the amount of dials is enough. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Speed up big object construction in some cases. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Currently, under a mixed load one failed PUT can lead to closing connection for all concurrent GETs. For PUT it does no harm: we have many other nodes to choose from. For GET we are limited by `REP N` factor, so in case of failover we can close the connection with the only node posessing an object, which leads to failing the whole operation. Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Codecov Report
@@ Coverage Diff @@
## support/v0.35 #2260 +/- ##
=================================================
- Coverage 31.01% 31.00% -0.01%
=================================================
Files 384 384
Lines 28453 28461 +8
=================================================
Hits 8824 8824
- Misses 18889 18897 +8
Partials 740 740
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that helped, do we have an issue related to failover problems? I am not totally sure that separating client caches solves 20s+ object streaming/fetching and it still requires additional investigations.
@@ -128,7 +128,7 @@ func (exec execCtx) key() (*ecdsa.PrivateKey, error) { | |||
} | |||
|
|||
func (exec *execCtx) canAssemble() bool { | |||
return exec.svc.assembly && !exec.isRaw() && !exec.headOnly() | |||
return exec.svc.assembly && !exec.isRaw() && !exec.headOnly() && !exec.isLocal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to highlight: we have exec.svc.assembly
and it is somehow always true
CHANGELOG would be also appreciated. |
Yes, we temporarily disabled compression. Currently, we cannot control disk load for policer (well, we can, but it is suboptimal) and it iterates over each shard in order. Failover case makes things worse because with high probability we will also read full object in memory to replicate. |
This reverts commit 907f427. GET requests with TTL=1 are absolutely correct regardless of the need to assemble the object. This TTL blocks request forwarding - it does not happen. At the same time, assembling an object on the server spawns new requests from scratch, which is not forwarding. The original commit does not have a description of the cause, and it was never discovered. Fixes #2447. Signed-off-by: Leonard Lyubich <leonard@morphbits.io>
This reverts commit 907f427. GET requests with TTL=1 are absolutely correct regardless of the need to assemble the object. This TTL blocks request forwarding - it does not happen. At the same time, assembling an object on the server spawns new requests from scratch, which is not forwarding. The original commit does not have a description of the cause, and it was never discovered. Fixes #2447. Signed-off-by: Leonard Lyubich <leonard@morphbits.io>
In master:
network/cache
-related changes will be replaced with gRPC-native keepalive mechanism.Client cache separation should be approached more rigourously. It seems like a good idea, but we need to evaluate it further.