Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect closed notification substreams instead of evicting all peers #3983

Merged
merged 8 commits into from
Apr 9, 2024

Conversation

dmitry-markin
Copy link
Contributor

This PR brings the fix paritytech/substrate#13396 to polkadot-sdk.

In the past, due to insufficient inbound slot count on polkadot & kusama, this fix led to low peer count. The situation has improved since then after changing the default ratio between --in-peers & --out-peers.

Nevertheless, it's expected that the reported total peer count with this fix is going to be lower than without it. This should be seen as the correct number of working connections reported, as opposed to also reporting already closed connections, and not as lower count of working connections with peers.

This PR also removes the peer eviction mechanism, as closed substream detection is a more granular way of detecting peers that stopped syncing with us.

The burn-in has been already performed as part of testing these changes in #3426.

@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Apr 4, 2024
@dmitry-markin dmitry-markin requested review from bkchr and lexnv April 4, 2024 16:37
@dmitry-markin dmitry-markin self-assigned this Apr 4, 2024
@dmitry-markin dmitry-markin changed the title Detect closed notifications substream instead of evicting all peers Detect closed notification substreams instead of evicting all peers Apr 4, 2024
///
/// To prevent this from happening, define a threshold for how long `SyncingEngine` should wait
/// before it starts evicting peers.
const INITIAL_EVICTION_WAIT_PERIOD: Duration = Duration::from_secs(2 * 60);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: The worst that can happen now is to stall indefinitely? And we presume that previously, we evicted peers because they closed their notification substreams but we didn't notice it on time (ie before the eviction timer triggered all peers to be dropped)? 🤔

I think this is related to: #3346

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for this to happen, all peers have to keep the notification channel open and misbehave by not providing any useful data. This seems unlikely to me.

Yes, the peer eviction kicked in only if we had all notification streams half-closed, but didn't notice it.

I think we can revisit peer eviction and may be come up with a solution tracking individual peers (what has been done at some point i the past).

Also, people complained that suddenly the number of connected peers dropped to zero due to eviction. Now it should work on an individual peer level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can keep the eviction as is for now and merge only the closed substream detection, and handle eviction in a separate PR.

@lexnv @bkchr what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just added this to work around the bug that this pr is fixing, thus I think it is fine to remove this. Especially as we already have seen issues with it.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitry-markin dmitry-markin added this pull request to the merge queue Apr 9, 2024
@dmitry-markin dmitry-markin removed this pull request from the merge queue due to a manual request Apr 9, 2024
@dmitry-markin dmitry-markin enabled auto-merge April 9, 2024 12:19
@dmitry-markin dmitry-markin added this pull request to the merge queue Apr 9, 2024
Merged via the queue into master with commit a26d25d Apr 9, 2024
102 of 133 checks passed
@dmitry-markin dmitry-markin deleted the dm-detect-closed-substream branch April 9, 2024 13:03
girazoki pushed a commit to moondance-labs/polkadot-sdk that referenced this pull request Apr 23, 2024
…aritytech#3983)

This PR brings the fix
paritytech/substrate#13396 to polkadot-sdk.

In the past, due to insufficient inbound slot count on polkadot &
kusama, this fix led to low peer count. The situation has improved since
then after changing the default ratio between `--in-peers` &
`--out-peers`.

Nevertheless, it's expected that the reported total peer count with this
fix is going to be lower than without it. This should be seen as the
correct number of working connections reported, as opposed to also
reporting already closed connections, and not as lower count of working
connections with peers.

This PR also removes the peer eviction mechanism, as closed substream
detection is a more granular way of detecting peers that stopped syncing
with us.

The burn-in has been already performed as part of testing these changes
in paritytech#3426.

---------

Co-authored-by: Aaro Altonen <a.altonen@hotmail.com>
girazoki pushed a commit to moondance-labs/polkadot-sdk that referenced this pull request Apr 23, 2024
…aritytech#3983)

This PR brings the fix
paritytech/substrate#13396 to polkadot-sdk.

In the past, due to insufficient inbound slot count on polkadot &
kusama, this fix led to low peer count. The situation has improved since
then after changing the default ratio between `--in-peers` &
`--out-peers`.

Nevertheless, it's expected that the reported total peer count with this
fix is going to be lower than without it. This should be seen as the
correct number of working connections reported, as opposed to also
reporting already closed connections, and not as lower count of working
connections with peers.

This PR also removes the peer eviction mechanism, as closed substream
detection is a more granular way of detecting peers that stopped syncing
with us.

The burn-in has been already performed as part of testing these changes
in paritytech#3426.

---------

Co-authored-by: Aaro Altonen <a.altonen@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants