Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport_service: Improve connection stability by downgrading connections on substream inactivity #260

Merged
merged 35 commits into from
Oct 31, 2024

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Sep 30, 2024

This PR advances the keep-alive timeout of the transport service.

Previously, the keep-alive timeout was triggered 5 seconds after the connection was reported to the transport service regardless of substream activity.

  • (0secs) T0: connection established; keep-alive timeout set to 5seconds in the future
  • (4secs) T1: substream A, B, C opened
  • (5secs) T2: keep-alive timeout triggered and the connection is downgraded. T1 was not taken into account, otherwise, the keep-alive timeout should be triggered at second 9 (T1 at 4 seconds + keepalive 5 seconds)
  • (6secs) T3: substreams A, B, C closed -> connection closes
  • (7secs) T4: cannot open new substreams even if we expected the connection to be kept alive for longer

In this PR:

  • KeepAliveTracker structure to forward the keep-alive timeout of connections.
  • Connection ID is forwarded to SubstreamOpened events to identify properly substream Ids. This is needed because the ConnectionContext contains up to two connections (primary and secondary)

Testing Done

  • test to ensure keepalive downgrades the connection after 5 seconds
  • test to ensure keepalive is forwarded on substream activity
  • test to ensure a downgraded connection with dropped substreams is closed

Closes #253.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
timeouts

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added the enhancement New feature or request label Sep 30, 2024
@lexnv lexnv self-assigned this Sep 30, 2024
@lexnv lexnv changed the title TransportService: Improve connection stability by downgrading connections on substream inactivity transport_service: Improve connection stability by downgrading connections on substream inactivity Sep 30, 2024
Base automatically changed from lexnv/improve-protocol-logs to master October 1, 2024 14:48
lexnv added 12 commits October 1, 2024 17:49
…tions

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
dropped substreams

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
dropped

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
polling

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Copy link
Collaborator Author

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have another pass over KeepAliveTracker. Im not entirely happy at the moment that we need a service notification to advance the polling here, will have to see how this plays with low-connectivity environments

lexnv added 4 commits October 28, 2024 15:19
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Comment on lines +217 to +218
cx.waker().wake_by_ref();
return Poll::Pending;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simplified example in the rust playground where I've reproduced this: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=182bf56449de7827c8f83b649dbba4fb

Copy link
Collaborator

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It looks like we don't track actual data transmission when rescheduling keep alive timeouts, do we? Then, is it possible we downgrade the connections and disconnect peers if no requests are sent/received, as Notifications protocol works over long-lived substreams and doesn't open new ones?

src/protocol/connection.rs Outdated Show resolved Hide resolved
src/protocol/transport_service.rs Outdated Show resolved Hide resolved
src/protocol/transport_service.rs Outdated Show resolved Hide resolved
src/protocol/transport_service.rs Outdated Show resolved Hide resolved
src/protocol/transport_service.rs Outdated Show resolved Hide resolved
lexnv and others added 3 commits October 31, 2024 11:43
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Oct 31, 2024

Then, is it possible we downgrade the connections and disconnect peers if no requests are sent/received, as Notifications protocol works over long-lived substreams and doesn't open new ones?

Yep absolutely! Offhand, the easiest way may be to use BandwidthSink to expose more info, will work on this as a followup to unblock this PR, thanks!

@lexnv lexnv merged commit 314a2e9 into master Oct 31, 2024
8 checks passed
@lexnv lexnv deleted the lexnv/stable-connections branch October 31, 2024 10:21
lexnv added a commit that referenced this pull request Nov 4, 2024
## [0.8.0] - 2024-11-01

This release adds support for content provider advertisement and
discovery to Kademlia protocol implementation (see libp2p
[spec](https://github.com/libp2p/specs/blob/master/kad-dht/README.md#content-provider-advertisement-and-discovery)).
Additionally, the release includes several improvements and memory leak
fixes to enhance the stability and performance of the litep2p library.

### Added

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](#246))
- kad: Providers part 6: stop providing
([#245](#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](#236))
- kad: Providers part 4: refresh local providers
([#235](#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](#234))

### Changed

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](#255))
- kad/executor: Add timeout for writting frames
([#277](#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](#233))
- peer_state: Robust state machine transitions
([#251](#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](#250))
- kad: Remove unused serde cfg
([#262](#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](#254))

### Fixed

- tcp/websocket/quic: Fix cancel memory leak
([#272](#272))
- transport: Fix pending dials memory leak
([#271](#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](#273))
- kad: Fix not retrieving local records
([#221](#221))

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Nov 5, 2024
This PR updates litep2p to the latest release.

- `KademliaEvent::PutRecordSucess` is renamed to fix word typo
- `KademliaEvent::GetProvidersSuccess` and
`KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work
and will be utilized later


### Added

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](paritytech/litep2p#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](paritytech/litep2p#246))
- kad: Providers part 6: stop providing
([#245](paritytech/litep2p#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](paritytech/litep2p#236))
- kad: Providers part 4: refresh local providers
([#235](paritytech/litep2p#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](paritytech/litep2p#234))

### Changed

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](paritytech/litep2p#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](paritytech/litep2p#255))
- kad/executor: Add timeout for writting frames
([#277](paritytech/litep2p#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](paritytech/litep2p#233))
- peer_state: Robust state machine transitions
([#251](paritytech/litep2p#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](paritytech/litep2p#250))
- kad: Remove unused serde cfg
([#262](paritytech/litep2p#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](paritytech/litep2p#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](paritytech/litep2p#254))

### Fixed

- tcp/websocket/quic: Fix cancel memory leak
([#272](paritytech/litep2p#272))
- transport: Fix pending dials memory leak
([#271](paritytech/litep2p#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](paritytech/litep2p#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](paritytech/litep2p#273))
- kad: Fix not retrieving local records
([#221](paritytech/litep2p#221))

See release changelog for more details:
https://github.com/paritytech/litep2p/releases/tag/v0.8.0

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
lexnv added a commit to paritytech/polkadot-sdk that referenced this pull request Nov 15, 2024
This PR updates litep2p to the latest release.

- `KademliaEvent::PutRecordSucess` is renamed to fix word typo
- `KademliaEvent::GetProvidersSuccess` and
`KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work
and will be utilized later

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](paritytech/litep2p#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](paritytech/litep2p#246))
- kad: Providers part 6: stop providing
([#245](paritytech/litep2p#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](paritytech/litep2p#236))
- kad: Providers part 4: refresh local providers
([#235](paritytech/litep2p#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](paritytech/litep2p#234))

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](paritytech/litep2p#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](paritytech/litep2p#255))
- kad/executor: Add timeout for writting frames
([#277](paritytech/litep2p#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](paritytech/litep2p#233))
- peer_state: Robust state machine transitions
([#251](paritytech/litep2p#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](paritytech/litep2p#250))
- kad: Remove unused serde cfg
([#262](paritytech/litep2p#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](paritytech/litep2p#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](paritytech/litep2p#254))

- tcp/websocket/quic: Fix cancel memory leak
([#272](paritytech/litep2p#272))
- transport: Fix pending dials memory leak
([#271](paritytech/litep2p#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](paritytech/litep2p#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](paritytech/litep2p#273))
- kad: Fix not retrieving local records
([#221](paritytech/litep2p#221))

See release changelog for more details:
https://github.com/paritytech/litep2p/releases/tag/v0.8.0

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TransportService: Improve connection stability by downgrading connections on substream inactivity
2 participants