-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport: Abort canceled dial attempts for TCP, WebSocket and Quic #255
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv
changed the title
transport: Abort canceled dial attempts for TCP and WebSocket
transport: Abort canceled dial attempts for TCP, WebSocket and Quic
Sep 26, 2024
This was referenced Oct 22, 2024
dmitry-markin
approved these changes
Oct 29, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done)
lexnv
added a commit
that referenced
this pull request
Oct 30, 2024
This fixes a bug in the TCP and Websocket transports that was leaking memory for: - `canceled: HashSet<ConnectionId>` - leak since the beginning of litep2p - `cancel_futures: HashMap<ConnectionId, AbortHandle>` added in unmerged #255 The memory leak is happening in the following scenarios: - T0: transport manager: dials K (parallelism factor = 8) addresses on TCP and WebSocket on ConnectionId=1 - T1: TCP: establishes a connection with the peer ConnectionId=1 - T2: WebSocket: establishes a connection with the peer ConnectionId=1 - T3: transport manager: receives TCP establishment event and cancels `WebSocket` dials The issue happens when T2 finishes before T3. In this situation, the WebSocket transport no longer has a future with a corresponding ConnectionId=1. The canceling method simply inserts ConnectionId=1 into a hashset This leads to the hashset growing over time, without a way to clean-up stale connection IDs. The fix relies on the changes added in #255: - `cancel_futures` maps a connection ID to an abort handle - the `cancel_futures` is guaranteed to contain a connection ID that corresponds to an unfinished `pending_raw_connections` future - the cancel method just aborts the in-flight future, if it exists - state of the `cancel_futures` is done when polling `pending_raw_connections` ### Testing Done I used a custom-patched version of litep2p to log the number of pending dials. After a few hours, the pending dials for both TCP and WebSocket connections stabilized at just a few. (Same as #271). ``` 2024-10-22 17:37:56.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=1 pending_inbound_connections=0 pending_connections=1 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 2024-10-22 17:38:26.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=1 opened_raw=0 cancel_futures=1 pending_open=0 2024-10-22 17:38:56.253 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 2024-10-22 17:39:26.253 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 2024-10-22 17:39:56.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 2024-10-22 17:40:26.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=1 opened_raw=0 cancel_futures=1 pending_open=0 2024-10-22 17:40:56.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 2024-10-22 17:41:26.252 INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0 ``` Build on: #255 Closes: #270 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Dmitry Markin <dmitry@markin.tech>
lexnv
added a commit
that referenced
this pull request
Nov 4, 2024
## [0.8.0] - 2024-11-01 This release adds support for content provider advertisement and discovery to Kademlia protocol implementation (see libp2p [spec](https://github.com/libp2p/specs/blob/master/kad-dht/README.md#content-provider-advertisement-and-discovery)). Additionally, the release includes several improvements and memory leak fixes to enhance the stability and performance of the litep2p library. ### Added - kad: Providers part 8: unit, e2e, and `libp2p` conformance tests ([#258](#258)) - kad: Providers part 7: better types and public API, public addresses & known providers ([#246](#246)) - kad: Providers part 6: stop providing ([#245](#245)) - kad: Providers part 5: `GET_PROVIDERS` query ([#236](#236)) - kad: Providers part 4: refresh local providers ([#235](#235)) - kad: Providers part 3: publish provider records (start providing) ([#234](#234)) ### Changed - transport_service: Improve connection stability by downgrading connections on substream inactivity ([#260](#260)) - transport: Abort canceled dial attempts for TCP, WebSocket and Quic ([#255](#255)) - kad/executor: Add timeout for writting frames ([#277](#277)) - kad: Avoid cloning the `KademliaMessage` and use reference for `RoutingTable::closest` ([#233](#233)) - peer_state: Robust state machine transitions ([#251](#251)) - address_store: Improve address tracking and add eviction algorithm ([#250](#250)) - kad: Remove unused serde cfg ([#262](#262)) - req-resp: Refactor to move functionality to dedicated methods ([#244](#244)) - transport_service: Improve logs and move code from tokio::select macro ([#254](#254)) ### Fixed - tcp/websocket/quic: Fix cancel memory leak ([#272](#272)) - transport: Fix pending dials memory leak ([#271](#271)) - ping: Fix memory leak of unremoved `pending_opens` ([#274](#274)) - identify: Fix memory leak of unused `pending_opens` ([#273](#273)) - kad: Fix not retrieving local records ([#221](#221)) --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Dmitry Markin <dmitry@markin.tech>
github-merge-queue bot
pushed a commit
to paritytech/polkadot-sdk
that referenced
this pull request
Nov 5, 2024
This PR updates litep2p to the latest release. - `KademliaEvent::PutRecordSucess` is renamed to fix word typo - `KademliaEvent::GetProvidersSuccess` and `KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work and will be utilized later ### Added - kad: Providers part 8: unit, e2e, and `libp2p` conformance tests ([#258](paritytech/litep2p#258)) - kad: Providers part 7: better types and public API, public addresses & known providers ([#246](paritytech/litep2p#246)) - kad: Providers part 6: stop providing ([#245](paritytech/litep2p#245)) - kad: Providers part 5: `GET_PROVIDERS` query ([#236](paritytech/litep2p#236)) - kad: Providers part 4: refresh local providers ([#235](paritytech/litep2p#235)) - kad: Providers part 3: publish provider records (start providing) ([#234](paritytech/litep2p#234)) ### Changed - transport_service: Improve connection stability by downgrading connections on substream inactivity ([#260](paritytech/litep2p#260)) - transport: Abort canceled dial attempts for TCP, WebSocket and Quic ([#255](paritytech/litep2p#255)) - kad/executor: Add timeout for writting frames ([#277](paritytech/litep2p#277)) - kad: Avoid cloning the `KademliaMessage` and use reference for `RoutingTable::closest` ([#233](paritytech/litep2p#233)) - peer_state: Robust state machine transitions ([#251](paritytech/litep2p#251)) - address_store: Improve address tracking and add eviction algorithm ([#250](paritytech/litep2p#250)) - kad: Remove unused serde cfg ([#262](paritytech/litep2p#262)) - req-resp: Refactor to move functionality to dedicated methods ([#244](paritytech/litep2p#244)) - transport_service: Improve logs and move code from tokio::select macro ([#254](paritytech/litep2p#254)) ### Fixed - tcp/websocket/quic: Fix cancel memory leak ([#272](paritytech/litep2p#272)) - transport: Fix pending dials memory leak ([#271](paritytech/litep2p#271)) - ping: Fix memory leak of unremoved `pending_opens` ([#274](paritytech/litep2p#274)) - identify: Fix memory leak of unused `pending_opens` ([#273](paritytech/litep2p#273)) - kad: Fix not retrieving local records ([#221](paritytech/litep2p#221)) See release changelog for more details: https://github.com/paritytech/litep2p/releases/tag/v0.8.0 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Dmitry Markin <dmitry@markin.tech>
lexnv
added a commit
to paritytech/polkadot-sdk
that referenced
this pull request
Nov 15, 2024
This PR updates litep2p to the latest release. - `KademliaEvent::PutRecordSucess` is renamed to fix word typo - `KademliaEvent::GetProvidersSuccess` and `KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work and will be utilized later - kad: Providers part 8: unit, e2e, and `libp2p` conformance tests ([#258](paritytech/litep2p#258)) - kad: Providers part 7: better types and public API, public addresses & known providers ([#246](paritytech/litep2p#246)) - kad: Providers part 6: stop providing ([#245](paritytech/litep2p#245)) - kad: Providers part 5: `GET_PROVIDERS` query ([#236](paritytech/litep2p#236)) - kad: Providers part 4: refresh local providers ([#235](paritytech/litep2p#235)) - kad: Providers part 3: publish provider records (start providing) ([#234](paritytech/litep2p#234)) - transport_service: Improve connection stability by downgrading connections on substream inactivity ([#260](paritytech/litep2p#260)) - transport: Abort canceled dial attempts for TCP, WebSocket and Quic ([#255](paritytech/litep2p#255)) - kad/executor: Add timeout for writting frames ([#277](paritytech/litep2p#277)) - kad: Avoid cloning the `KademliaMessage` and use reference for `RoutingTable::closest` ([#233](paritytech/litep2p#233)) - peer_state: Robust state machine transitions ([#251](paritytech/litep2p#251)) - address_store: Improve address tracking and add eviction algorithm ([#250](paritytech/litep2p#250)) - kad: Remove unused serde cfg ([#262](paritytech/litep2p#262)) - req-resp: Refactor to move functionality to dedicated methods ([#244](paritytech/litep2p#244)) - transport_service: Improve logs and move code from tokio::select macro ([#254](paritytech/litep2p#254)) - tcp/websocket/quic: Fix cancel memory leak ([#272](paritytech/litep2p#272)) - transport: Fix pending dials memory leak ([#271](paritytech/litep2p#271)) - ping: Fix memory leak of unremoved `pending_opens` ([#274](paritytech/litep2p#274)) - identify: Fix memory leak of unused `pending_opens` ([#273](paritytech/litep2p#273)) - kad: Fix not retrieving local records ([#221](paritytech/litep2p#221)) See release changelog for more details: https://github.com/paritytech/litep2p/releases/tag/v0.8.0 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Dmitry Markin <dmitry@markin.tech> Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
TransportManager
initiates a dialing process on multiple addresses and multiple transports (ie TCP WebSocket).The first established connection is reported back from the Transport layer to the
TransportManger
.Then, the
Transport Manager
cancels all ongoing dialing attempts on the remaining layers.Previously, cancelling implies storing a
ConnectionID
to a HashSet.This has the downside that all ongoing dial attempts are polled to completion.
In this PR, the futures that establish socket connections are now aborted directly.
Example
T0. Manager initiates dialing on A, B, C for TCP and D, E, F on WebSocket
T1. Established socket connection on address A from TCP (B and C are dropped)
T2. Manager cancels D, E, F from WebSocket
Before
T2 implies adding a connectionID to a hashset:
litep2p/src/transport/tcp/mod.rs
Lines 518 to 519 in 14dc4cc
The worst case scenario:
The best case scenario:
litep2p/src/transport/tcp/mod.rs
Lines 536 to 542 in d50ec10
After
The future that handles dialing is abortable. This way, we don't have to wait for (worst case) all addresses to fail to connect, or (best case) one address to connect.