Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid retrying on IO errors when it’s unclear if the server received the request #2479

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

barshaul
Copy link
Collaborator

Main Change:

ATM, on I/O errors, we would reconnect to the failed node and retry the request if there were more retries left. This approach had a critical issue: we couldn’t reliably determine if the server had already received the request before the connection was broken. Retrying in such cases could result in duplicate command execution.

Example:

  1. Client sends INCR key.
  2. Server receives the request and increments the key (e.g., key = 1).
  3. A network issue disconnects the client before the server can respond.
  4. The client reconnects and retries the INCR key.
  5. Server increments the key again (now key = 2). BAD outcome.

This PR differentiates between errors where it’s safe to retry and those where it’s not. Specifically, with multiplexed connections, if the send function returns an error, it guarantees that the server never received the data, making retries safe (see https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send). For other errors, where we can’t be certain, retries are unsafe and will not be automatically attempted. Instead, these errors will now be returned to the user, who must manually retry if they determine it’s safe.

Test Changes:

Since I/O errors are now returned to the user, tests that previously killed the server now loop to retry the request, simulating the handling of I/O errors on the user side.

Refresh Slots Change:

While testing this fix, I found that when all connections were unavailable and refresh_slots was called, it didn’t raise the expected allConnectionsUnavailable error. This has been fixed by updating random_connections function to return an Option. Now, if no connections are found, refresh_slots raises the allConnectionsUnavailable error immediately. The state then shifts to reconnecting to the initial nodes, and slot refreshes are attempted using the new connections.

Out of Scope:

A future PR could introduce a new configuration option to enable retries on connection error, allowing users to control this behavior at the client level.

@barshaul barshaul requested a review from a team as a code owner October 20, 2024 10:32
@barshaul barshaul force-pushed the dont_retry_on_io_errors branch from 13a11da to 53bfd90 Compare October 20, 2024 10:32
…the request

Signed-off-by: barshaul <barshaul@amazon.com>
Signed-off-by: barshaul <barshaul@amazon.com>
@barshaul barshaul requested a review from eifrah-aws October 21, 2024 11:54
@@ -7,7 +7,7 @@ use crate::connection::{
resp2_is_pub_sub_state_cleared, resp3_is_pub_sub_state_cleared, ConnectionAddr, ConnectionInfo,
Msg, RedisConnectionInfo,
};
#[cfg(any(feature = "tokio-comp"))]
#[cfg(feature = "tokio-comp")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 👍🏽

@@ -83,7 +83,7 @@ macro_rules! reconnect_if_dropped {
macro_rules! reconnect_if_io_error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: macro should be renamed to reflect the change: e.g. reconnect_if_conn_dropped

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll change it, but in a different PR we need to remove this whole file - we have no use in redis-rs's connection_manager

Signed-off-by: barshaul <barshaul@amazon.com>
Signed-off-by: barshaul <barshaul@amazon.com>
@barshaul barshaul merged commit 914fd60 into valkey-io:release-1.2 Oct 21, 2024
12 checks passed
prateek-kumar-improving pushed a commit that referenced this pull request Oct 21, 2024
…the request (#2479)

* Avoid retrying on IO errors when it’s unclear if the server received the request

Signed-off-by: barshaul <barshaul@amazon.com>
avifenesh pushed a commit to avifenesh/valkey-glide that referenced this pull request Oct 21, 2024
…the request (valkey-io#2479)

* Avoid retrying on IO errors when it’s unclear if the server received the request

Signed-off-by: barshaul <barshaul@amazon.com>
Signed-off-by: avifenesh <aviarchi1994@gmail.com>
Muhammad-awawdi-amazon pushed a commit to Muhammad-awawdi-amazon/valkey-glide that referenced this pull request Oct 22, 2024
…the request (valkey-io#2479)

* Avoid retrying on IO errors when it’s unclear if the server received the request

Signed-off-by: barshaul <barshaul@amazon.com>
Muhammad-awawdi-amazon pushed a commit to Muhammad-awawdi-amazon/valkey-glide that referenced this pull request Oct 22, 2024
…the request (valkey-io#2479)

* Avoid retrying on IO errors when it’s unclear if the server received the request

Signed-off-by: barshaul <barshaul@amazon.com>
cyip10 pushed a commit that referenced this pull request Oct 23, 2024
…the request (#2479)

* Avoid retrying on IO errors when it’s unclear if the server received the request

Signed-off-by: barshaul <barshaul@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants