Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransportService: Improve connection stability by downgrading connections on substream inactivity #253

Closed
lexnv opened this issue Sep 25, 2024 · 0 comments · Fixed by #260
Labels
bug Something isn't working

Comments

@lexnv
Copy link
Collaborator

lexnv commented Sep 25, 2024

Context

The transport service handler intent is to downgrade connections (leading to possibly closing them), if substreams where not open in a given timeframe:

/// Close the connection if no substreams are open within this time frame.
keep_alive_timeout: Duration,
/// Pending keep-alive timeouts.
pending_keep_alive_timeouts: FuturesUnordered<BoxFuture<'static, (PeerId, ConnectionId)>>,

When a connection is established, the timeout is properly tracked:

self.pending_keep_alive_timeouts.push(Box::pin(async move {
tokio::time::sleep(keep_alive_timeout).await;
(peer, connection_id)
}));

When the timeout expires, the connection is downgraded:

while let Poll::Ready(Some((peer, connection_id))) =
self.pending_keep_alive_timeouts.poll_next_unpin(cx)
{
if let Some(context) = self.connections.get_mut(&peer) {
tracing::trace!(
target: LOG_TARGET,
?peer,
?connection_id,
"keep-alive timeout over, downgrade connection",
);
context.downgrade(&connection_id);
}
}

Issue

Opening of substreams are not taken into account for downgrading connections:

Some(event) => return Poll::Ready(Some(event.into())),

Here we should match for Some(InnerTransportEvent::SubstreamOpened and move forward the timeout future.

With the current implementation, we downgrade connections at 5 seconds intervals.

Implications

We need to double-check if the protocol name matches the substream open protocol.
Further, InnerTransportEvent::SubstreamOpened needs to be extended with a connectionID because the TransportService holds up to 2 connections (primary and secondary). We should properly advance the timeout of the connection ID.

@lexnv lexnv added the bug Something isn't working label Sep 25, 2024
lexnv added a commit that referenced this issue Oct 31, 2024
…tions on substream inactivity (#260)

This PR advances the keep-alive timeout of the transport service.

Previously, the keep-alive timeout was triggered 5 seconds after the
connection was reported to the transport service regardless of substream
activity.

- (0secs) T0: connection established; keep-alive timeout set to 5seconds
in the future
- (4secs) T1: substream A, B, C opened
- (5secs) T2: keep-alive timeout triggered and the connection is
downgraded. T1 was not taken into account, otherwise, the keep-alive
timeout should be triggered at second 9 (T1 at 4 seconds + keepalive 5
seconds)
- (6secs) T3: substreams A, B, C closed -> connection closes
- (7secs) T4: cannot open new substreams even if we expected the
connection to be kept alive for longer

In this PR:
- `KeepAliveTracker` structure to forward the keep-alive timeout of
connections.
- Connection ID is forwarded to `SubstreamOpened` events to identify
properly substream Ids. This is needed because the `ConnectionContext`
contains up to two connections (primary and secondary)

### Testing Done
- test to ensure keepalive downgrades the connection after 5 seconds
- test to ensure keepalive is forwarded on substream activity
- test to ensure a downgraded connection with dropped substreams is
closed


Closes #253.

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant