two p2p feature enhancement proposal #4053

Hyung-bharvest · 2019-10-14T09:45:35Z

I would like to suggest two p2p feature inhancement.

wildcard_peer_ids

If id of a peer is in this list, the peer can try peering my node even if max_num_inbound(outbound)_peers is reached.
benefit : nodes can allow peering with most trusted list of peers(wildcard peers) even if maximum is reached by other peers.

persistent_peers_dialing_max_period(in seconds)

When exponential backoff, the dialing period will not increase more than this period
benefit : Tendermint will continuesly try dialing all persistent peers indefinitely so that even though it disconnected for a while, it tries to reconnect.

Please share opinions(agreement, disagreement, additional feature related to these features, possible vulnerabilities, etc). Also if anyone wants to add other improvement on p2p peer management, we can add those to our job list.(such as blacklist_peer_ids)

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

Based on feedbacks from community, we start hacking on these two issues. We will ask a grant of 2k atom from community fund by governance proposal "after" we PR this. We expect PR to be ready in October.

mdyring · 2019-10-15T10:01:30Z

Good suggestions.

Not sure I understand the rationale for differentiating between persistent peers and "wildcard peers", is it because you don't want to provide a static IP for a peers that should always be allowed an inbound slot?

I think the default behavior should be to allow explicitly configured peers a connection to slot (similar to what is suggested in #2041).

Instead of having both a "wildcard peers" and "persistent peers", perhaps something like this would work instead:

persistent_peers are used as today to establish outbound connections
if an inbound connection arrives with a nodeid listed in persistent_peers, and there is no existing outbound connection to that nodeid, it will be allowed an inbound slot
limits on in/outbound connections should be changed to "on-demand" connections and not apply to explicitly configured peers in the config.

persistent_peers_dialing_max_period sounds sensible!

Creamers158 · 2019-10-15T10:33:12Z

Agree, think you would get something like:
persistent_peers = "e4d19c423f4b2ft7e40905b320d54155907b6b54@23.34.567.34:26656,7f40e45c3e3fc1d4dc68edc977b8e3eeb4e50724,b2ft7e40905320d541423f4b2ft@75.34.564.67:26656"

This way you can also setup a connection direction without worrying about open peer slots.

Also, when having persistent peers I can imaging that some of them need to kept private. When 'fixing' the persistent_peer make sure to look at #1705 so it wont expose private peers that are part of the persistent_peer.

Hyung-bharvest · 2019-10-15T13:52:34Z

I like both suggestion! Thanks for sharing your thoughts! So in summary,

gaiad will allow any peer having node id which is listed in persistent peers even if when max inbound(outbound) peer num is reached.
we can write id@ip(to allow and also to dial outbound) or id(to allow inbound or maybe outbound dialing from addressbook) in persistent peers to wildcard the peer.
now we dont need additional parameter wildcard peer ids in config.

It sounds neater solution. How do you think? @mdrying @Creamers158

//edit

We found out that putting different structures in one list(id@ip and id) is creating some implementation codebase complexity. So we came back to the original idea to have seperate wildcard_peer_ids, which only allow to write in id-only-format.

Another option raised is to put inbound wildcard in persistent_peers as originally allowed id@0.0.0.0:0 format. But it seems like creating new confusion and unnecessary errors.

To include every persistent_peer as a wildcard_peer is still an open question.

Hyung-bharvest · 2019-10-18T06:18:15Z

Below is our up-to-date detail specs.
Please share opinions.

wildcard_peer_ids

1) problem

When max_num_inbound_peers or max_num_outbound_peers of a node reached, the node cannot spare more slots to any peer. Therefore, in case of disconnection, very important connections can be lost indefinitely because all slots consumed by other peers.

2) objective

Allow list of peers connect to my node by inbound or outbound, regardless of max_num_inbound_peers or max_num_outbound_peers reached.
This configure is especially helpful if a node operator want to dedicate some peer slots to number of trusted peers.

3) implementation detail

add WildcardPeerIDs in config/config.go

    // Wildcard node id list for limit of MaxNumInboundPeers, MaxNumOutboundPeers
    WildCardPeerIDs string `mapstructure:"wildcard_peer_ids"`

add WildcardPeerIDs in config/toml.go

    # Wildcard node id list for limit of max_num_inbound_peers, max_num_outbound_peers
    wildcard_peer_ids = "{{ .P2P.WildCardPeerIDs }}"

add WildcardPeerIDs error check in node/node.go

    err = sw.AddWildCardPeerIDs(splitAndTrimEmpty(config.P2P.WildCardPeerIDs, ",", " "))
    if err != nil {
    	return nil, errors.Wrap(err, "could not add peer ids from wildcard_peer_ids field")
    }

add 1 to numToDial and maxAttempts when the peer is wildcard in p2p/pex/pex_reactor.go

    if r.Switch.IsWildCardPeer(try.ID){
    	numToDial += 1
    	maxAttempts += 1
    }

add WildcardPeerIDs in p2p/switch.go

    wildCardPeerIDs      map[ID]struct{}
    
    ...
    
    wildCardPeerIDs:      make(map[ID]struct{}),

add exception of WildcardPeerIDs when counting NumPeers in p2p/switch.go

    if !sw.IsWildCardPeer(peer.ID()){
    	if peer.IsOutbound() {
    		outbound++
    	} else {
    		inbound++
    	}
    }

add IsWildCardPeer and AddWildCardPeerIDs in p2p/switch.go

    func (sw *Switch) IsWildCardPeer(id ID) bool {
    	_, ok := sw.wildCardPeerIDs[id]
    	return ok
    }
    
    func (sw *Switch) AddWildCardPeerIDs(ids []string) error {
    	sw.Logger.Info("Adding wildcard peer ids", "ids", ids)
    	for _, id := range ids {
    		err := validateID(ID(id))
    		if err != nil {
    			return err
    		}
    		sw.wildCardPeerIDs[ID(id)] = struct{}{}
    	}
    	return nil
    }

add exception of WildcardPeerIDs when checking MaxNumInboundPeers in p2p/switch.go

    if !sw.IsWildCardPeer(p.NodeInfo().ID()){
    	// Ignore connection if we already have enough peers, except wildcard peer
    	_, in, _ := sw.NumPeers()
    	if in >= sw.config.MaxNumInboundPeers {
    		sw.Logger.Info(
    			"Ignoring inbound connection: already have enough inbound peers",
    			"address", p.SocketAddr(),
    			"have", in,
    			"max", sw.config.MaxNumInboundPeers,
    		)
    
    		sw.transport.Cleanup(p)
    
    		continue
    	}
    
    }

maximum_dial_period

1) problem

exponential backoff of dialing combined with short term disconnection of a persistent peer can cause maximum times of exponential backoff resulting in no-dial to this persistent peer for indefinite future.

2) objective

For persistent peers, prevent exponential backoff of dialing backoffs more than given maximum_dial_period.

3) implementation detail

- add `MaximumDialPeriod` in config/config.go

    MaximumDialPeriod time.Duration `mapstructure:"maximum_dial_period"`
    
    ...
    
    MaximumDialPeriod:       60 * time.Second,

add MaximumDialPeriod in config/toml.go

    # Maximum dial period seconds when exponential back off for persistent peers, 0 == default as exponential back off
    maximum_dial_period = "{{ .P2P.MaximumDialPeriod }}"

add MaximumDialPeriod in node/node.go

    MaximumDialPeriod: config.P2P.MaximumDialPeriod,

add isBackoffExpected in p2p/pex/pex_reactor.go

    func (r *PEXReactor) isBackoffExcepted(addr *p2p.NetAddress) bool {
    	return r.config.MaximumDialPeriod != 0 && r.Switch.IsPeerPersistent(addr)
    }
    
    ...
    
    // Maximum dial period seconds when exponential backoff for persistent peers, when value exist
    	MaximumDialPeriod time.Duration

add exception on backoffDuration for persistent peers in p2p/pex/pex_reactor.go

    func (r *PEXReactor) dialPeer(addr *p2p.NetAddress) error {
    	attempts, lastDialed := r.dialAttemptsInfo(addr)
    
    	if attempts > maxAttemptsToDial && !r.isBackoffExcepted(addr) {
    		// TODO(melekes): have a blacklist in the addrbook with peers whom we've
    		// failed to connect to. Then we can clean up attemptsToDial, which acts as
    		// a blacklist currently.
    		// https://github.com/tendermint/tendermint/issues/3572
    		r.book.MarkBad(addr)
    		return errMaxAttemptsToDial{}
    	}
    
    	// exponential backoff if it's not our first attempt to dial given address
    	if attempts > 0 {
    		jitterSeconds := time.Duration(cmn.RandFloat64() * float64(time.Second)) // 1s == (1e9 ns)
    		var backoffDuration time.Duration
    		backoffDuration = jitterSeconds + ((1 << uint(attempts)) * time.Second)
    
    		if backoffDuration > r.config.MaximumDialPeriod && r.isBackoffExcepted(addr){
    			backoffDuration = r.config.MaximumDialPeriod
    		}
    		sinceLastDialed := time.Since(lastDialed)
    		if sinceLastDialed < backoffDuration {
    			return errTooEarlyToDial{backoffDuration, lastDialed}
    		}
    	}
    
    ...

alexanderbez · 2019-10-21T19:06:47Z

Thanks @dlguddus for writing up a concise spec outline. Would you be able to write up the context/background and the proposal (it can be concise) into a quick ADR? This would happen ideally before a detailed spec (a spec may not even be necessary here) and implementation.

That being said, I'm not too privy to the mechanics of Tendermint's p2p layer and config so correct me if I'm wrong here. The goal of wildcard peer IDs seems like it should be accomplished by the already existing persistent peer IDs config? That is, even if all connection slots are filled, persistent peers always have the privilege to connect. Is this not the case, if not, why?

However, I do think the maximum_dial_period proposal makes sense 👍

Hyung-bharvest · 2019-10-22T00:08:09Z

Difference between persistent peers and wildcard:

persistent peers only deal with outbound but wildcard deal with both in/outbound
wildcard does not write ip address but only id

Because of struct difference(id@ip vs id), we think it is too complicated to put wildcard function into persistent peers.

ADR : #4072

ADR waiting for review : @tessr @melekes

Hyung-bharvest · 2019-10-22T02:26:01Z

Agree, think you would get something like:
persistent_peers = "e4d19c423f4b2ft7e40905b320d54155907b6b54@23.34.567.34:26656,7f40e45c3e3fc1d4dc68edc977b8e3eeb4e50724,b2ft7e40905320d541423f4b2ft@75.34.564.67:26656"

This way you can also setup a connection direction without worrying about open peer slots.

Also, when having persistent peers I can imaging that some of them need to kept private. When 'fixing' the persistent_peer make sure to look at #1705 so it wont expose private peers that are part of the persistent_peer.

is #1705 not resolved yet? if it still stays in bug status, we might want to include it to this proposal.
@Creamers158

tessr · 2019-11-19T15:20:34Z

I admit I don't operate a validator, and I'm still coming up to speed with lots of things here. But can you share more context on

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

What exactly are the problems that you experienced? It sounds like this is a reasonable proposal but I'd like to better understand the issue. Thank you!

Hyung-bharvest · 2019-11-19T16:03:27Z

I admit I don't operate a validator, and I'm still coming up to speed with lots of things here. But can you share more context on

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

What exactly are the problems that you experienced? It sounds like this is a reasonable proposal but I'd like to better understand the issue. Thank you!

In PoS, nodes have to share its ip and id to connect to other nodes. And there exists a significant risk from ddos attacks to even node compromisation.

So what we construct is a sentry architecture, having public nodes(sentry) in front and we have internal nodes to bumper the connection from public internet to the precious validator.

These internal connections should be always connected. But, because of exponential backoff and max_num_peer, a disconnected connection often never come back. This causes instability of internal(or trusted) connection quite frequently.

But I think this is just a tip of iceberg. As a validator operator, we experience many other issues too. This ADR is just a pilot project to have experience with tendermint team to acquire better approach and communication from ourselves. If we think it is efficient and worth well, we would like to dig in dipper to suggest more improvement of p2p feature in tendermint.

tessr · 2019-11-19T17:07:28Z

OK, thank you for the explanation. I hear you're coming to our dev session on Monday, and it sounds like there will be further discussion then. I'm looking forward to it!

Refs #4053 ## Commits: * Create adr-050-improved-trusted-peering.md * Modify `maximum_dial_period` Modify `maximum_dial_period` to `persistent_peers_maximum_dial_period` * Update adr-050-improved-trusted-peering.md * Update docs/architecture/adr-050-improved-trusted-peering.md Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com> * Update docs/architecture/adr-050-improved-trusted-peering.md Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com> * Update docs/architecture/adr-050-improved-trusted-peering.md Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com> * Update docs/architecture/adr-050-improved-trusted-peering.md Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com> * wildcard -> unconditional wildcard -> unconditional * Remove blank lines * fix spelling * add quotes

…od` (#4176) implementation spec of Improved Trusted Peering ADR-050 by B-Harvest - add unconditional_peer_ids and persistent_peers_max_dial_period to config - add unconditionalPeerIDs map to Switch struct default config value of persistent_peers_max_dial_period is 0s(disabled) Refs #4072, #4053

melekes added C:p2p Component: P2P pkg S:proposal Status: Proposal labels Oct 15, 2019

tessr changed the title ~~two p2p feature inhancement proposal~~ two p2p feature enhancement proposal Nov 19, 2019

Hyung-bharvest mentioned this issue Nov 20, 2019

Create adr-050-improved-trusted-peering.md #4072

Merged

5 tasks

dongsam mentioned this issue Nov 21, 2019

p2p: implement adr-050 #4176

Merged

5 tasks

brapse added this to the ABCI Updates milestone Dec 3, 2019

melekes closed this as completed Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

two p2p feature enhancement proposal #4053

two p2p feature enhancement proposal #4053

Hyung-bharvest commented Oct 14, 2019 •

edited

Loading

mdyring commented Oct 15, 2019

Creamers158 commented Oct 15, 2019

Hyung-bharvest commented Oct 15, 2019 •

edited

Loading

Hyung-bharvest commented Oct 18, 2019 •

edited

Loading

alexanderbez commented Oct 21, 2019

Hyung-bharvest commented Oct 22, 2019 •

edited

Loading

Hyung-bharvest commented Oct 22, 2019

tessr commented Nov 19, 2019

Hyung-bharvest commented Nov 19, 2019

tessr commented Nov 19, 2019

two p2p feature enhancement proposal #4053

two p2p feature enhancement proposal #4053

Comments

Hyung-bharvest commented Oct 14, 2019 • edited Loading

mdyring commented Oct 15, 2019

Creamers158 commented Oct 15, 2019

Hyung-bharvest commented Oct 15, 2019 • edited Loading

Hyung-bharvest commented Oct 18, 2019 • edited Loading

wildcard_peer_ids

1) problem

2) objective

3) implementation detail

maximum_dial_period

1) problem

2) objective

3) implementation detail

alexanderbez commented Oct 21, 2019

Hyung-bharvest commented Oct 22, 2019 • edited Loading

Hyung-bharvest commented Oct 22, 2019

tessr commented Nov 19, 2019

Hyung-bharvest commented Nov 19, 2019

tessr commented Nov 19, 2019

Hyung-bharvest commented Oct 14, 2019 •

edited

Loading

Hyung-bharvest commented Oct 15, 2019 •

edited

Loading

Hyung-bharvest commented Oct 18, 2019 •

edited

Loading

Hyung-bharvest commented Oct 22, 2019 •

edited

Loading