Skip to content
This repository has been archived by the owner on Oct 28, 2021. It is now read-only.

Garbage collect incompatible peers in Host::run() #5624

Merged
merged 7 commits into from
Jun 18, 2019
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- Added: [#5591](https://github.com/ethereum/aleth/pull/5591) Network logging bugfixes and improvements and add p2pcap log channel.
- Added: [#5588](https://github.com/ethereum/aleth/pull/5588) Testeth prints similar test suite name suggestions, when the name passed in `-t` argument is not found.
- Added: [#5593](https://github.com/ethereum/aleth/pull/5593) Dynamically updating host ENR.
- Added: [#5624](https://github.com/ethereum/aleth/pull/5624) Remove useless peers from peer list.
- Changed: [#5532](https://github.com/ethereum/aleth/pull/5532) The leveldb is upgraded to 1.22. This is breaking change on Windows and the old databases are not compatible.
- Changed: [#5559](https://github.com/ethereum/aleth/pull/5559) Update peer validation error messages.
- Changed: [#5568](https://github.com/ethereum/aleth/pull/5568) Improve rlpx handshake log messages and create new rlpx log channel.
Expand Down
11 changes: 11 additions & 0 deletions libp2p/Common.h
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,17 @@ enum DisconnectReason
/// @returns the string form of the given disconnection reason.
std::string reasonOf(DisconnectReason _r);

enum HandshakeFailureReason
{
NoFailure = 0,
UnknownFailure,
Timeout,
TcpError,
FrameDecryptionFailure,
InternalError,
ProtocolError
};

using CapDesc = std::pair<std::string, unsigned>;
using CapDescSet = std::set<CapDesc>;
using CapDescs = std::vector<CapDesc>;
Expand Down
46 changes: 38 additions & 8 deletions libp2p/Host.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,29 @@ void Host::stopCapabilities()
}
}

std::shared_ptr<Peer> Host::peer(NodeID const& _n) const
{
RecursiveGuard l(x_sessions);
auto it = m_peers.find(_n);
if (it == m_peers.end())
{
LOG(m_logger) << "Peer " << _n << " not found";
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
return nullptr;
}
return it->second;
}

void Host::handshakeFailed(NodeID const& _n, HandshakeFailureReason _r)
{
std::shared_ptr<Peer> p = peer(_n);
if (!p)
{
cerror << "Peer " << _n << " not found";
return;
}
p->m_lastHandshakeFailure = _r;
}

void Host::doneWorking()
{
// Return early if we have no capabilities since there's nothing to do. We've already stopped
Expand Down Expand Up @@ -284,6 +307,7 @@ void Host::startPeerSession(Public const& _id, RLP const& _hello,
m_peers[_id] = peer;
}
}
peer->m_lastHandshakeFailure = NoFailure;
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
if (peer->isOffline())
peer->m_lastConnected = chrono::system_clock::now();
peer->endpoint.setAddress(_s->remoteEndpoint().address());
Expand Down Expand Up @@ -783,15 +807,21 @@ void Host::run(boost::system::error_code const& _ec)
unsigned reqConn = 0;
{
RecursiveGuard l(x_sessions);
for (auto const& p : m_peers)
{
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
bool haveSession = havePeerSession(p.second->id);
bool required = p.second->peerType == PeerType::Required;
if (haveSession && required)
reqConn++;
else if (!haveSession && p.second->shouldReconnect() &&
(!m_netConfig.pin || required))
toConnect.push_back(p.second);
for (auto p = m_peers.cbegin(); p != m_peers.cend(); p++)
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
{
bool haveSession = havePeerSession(p->second->id);
bool required = p->second->peerType == PeerType::Required;
if (haveSession && required)
reqConn++;
else if (!haveSession)
{
if (p->second->uselessPeer())
p = m_peers.erase(p);
else if (p->second->shouldReconnect() && (!m_netConfig.pin || required))
toConnect.push_back(p->second);
}
}
}
}

Expand Down
7 changes: 6 additions & 1 deletion libp2p/Host.h
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,11 @@ class Host: public Worker
/// Stop registered capabilities, typically done when the network is being shut down.
void stopCapabilities();

std::shared_ptr<Peer> peer(NodeID const& _n) const;

/// Set a handshake failure reason for a peer
void handshakeFailed(NodeID const& _n, HandshakeFailureReason _r);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better onHandshakeFailed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would make it public and declare close to startPeerSession

Copy link
Contributor Author

@halfalicious halfalicious Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gumb0 : Why make this public, does it make sense to expose the concept of a handshake to consumers of Host?

Copy link
Member

@gumb0 gumb0 Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it looks to me like a callback similar to startPeerSession, the callback called by RLPXHandshake when handshake is finished. One is for success another one is for failure.

It works as private, because RLPXHandshake is a friend of Host, but idealluy we should get rid of this friend declarations at some point.

Exposing it to the clients of Host is of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it to Host clients, but pass it only to RLPXHandshake. That's a bit complicated change, at least it's not for this PR.

(In other words, we won't make it much worse, because startPeerSession is already public)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it looks to me like a callback similar to startPeerSession, the callback called by RLPXHandshake when handshake is finished. One is for success another one is for failure.

It works as private, because RLPXHandshake is a friend of Host, but idealluy we should get rid of this friend declarations at some point.

Exposing it to the clients of Host is of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it to Host clients, but pass it only to RLPXHandshake. That's a bit complicated change, at least it's not for this PR.

(In other words, we won't make it much worse, because startPeerSession is already public)

Ah that makes sense, thank you for clarifying! 😄 I'll make the change before merging.


bytes m_restoreNetwork; ///< Set by constructor and used to set Host key and restore network peers & nodes.

std::atomic<bool> m_run{false}; ///< Whether network is running.
Expand Down Expand Up @@ -408,7 +413,7 @@ class Host: public Worker
/// logging to once every c_logActivePeersInterval seconds
std::chrono::steady_clock::time_point m_lastPeerLogMessage;

Logger m_logger{createLogger(VerbosityDebug, "net")};
mutable Logger m_logger{createLogger(VerbosityDebug, "net")};
Logger m_detailsLogger{createLogger(VerbosityTrace, "net")};
Logger m_infoLogger{createLogger(VerbosityInfo, "net")};
};
Expand Down
58 changes: 52 additions & 6 deletions libp2p/Peer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,12 @@ namespace dev

namespace p2p
{

Peer::Peer(Peer const& _original):
Node(_original),
Peer::Peer(Peer const& _original)
: Node(_original),
m_lastConnected(_original.m_lastConnected),
m_lastAttempted(_original.m_lastAttempted),
m_lastDisconnect(_original.m_lastDisconnect),
m_lastHandshakeFailure(_original.m_lastHandshakeFailure),
m_session(_original.m_session)
{
m_score = _original.m_score.load();
Expand All @@ -45,18 +45,64 @@ Peer::Peer(Peer const& _original):

bool Peer::shouldReconnect() const
{
return id && endpoint && chrono::system_clock::now() > m_lastAttempted + chrono::seconds(fallbackSeconds());
return id && endpoint &&
fallbackSeconds() != numeric_limits<unsigned>::max() &&
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
chrono::system_clock::now() > m_lastAttempted + chrono::seconds(fallbackSeconds());
}


bool Peer::uselessPeer() const
{
if (peerType == PeerType::Required)
return false;

switch (m_lastHandshakeFailure)
{
case FrameDecryptionFailure:
case ProtocolError:
return true;
default:
break;
}

switch (m_lastDisconnect)
{
case BadProtocol:
case UselessPeer:
case IncompatibleProtocol:
case UnexpectedIdentity:
case UserReason:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about UserReason - this could in theory be any kind of reason specific to subprotocol, including some temporary reasons, maybe it makes sense to try to reconnect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to treat UserReason as terminal because we only use it in 2 cases - if status validation fails:

std::string disconnectReason;
if (peerSessionInfo->clientVersion.find("/v0.7.0/") != string::npos)
disconnectReason = "Blacklisted client version.";
else
disconnectReason = _peer.validate(
host().chain().genesisHash(), host().protocolVersion(), host().networkId());
if (!disconnectReason.empty())
{
LOG(m_logger) << "Peer " << _peer.id() << " not suitable for sync: " << disconnectReason;
m_host.capabilityHost().disconnect(_peer.id(), p2p::UserReason);
return;
}

Here's where we perform the actual validation in EthereumPeer::validate:
if (m_networkId != _hostNetworkId)
error << "Network identifier mismatch. Host network id: " << _hostNetworkId
<< ", peer network id: " << m_networkId;
else if (m_protocolVersion != _hostProtocolVersion)
error << "Protocol version mismatch. Host protocol version: " << _hostProtocolVersion
<< ", peer protocol version: " << m_protocolVersion;
else if (m_genesisHash != _hostGenesisHash)
error << "Genesis hash mismatch. Host genesis hash: " << _hostGenesisHash.abridged()
<< ", peer genesis hash: " << m_genesisHash.abridged();
else if (m_asking != Asking::State && m_asking != Asking::Nothing)
error << "Peer banned for unexpected status message.";

  • If there's a bug in our network code and Session read validation fails:

    aleth/libp2p/Session.cpp

    Lines 409 to 418 in 505aead

    else if (_length != _expected)
    {
    // with static m_data-sized buffer this shouldn't happen unless there's a regression
    // sec recommends checking anyways (instead of assert)
    LOG(m_netLoggerError)
    << "Error reading - TCP read buffer length differs from expected frame size ("
    << _length << " != " << _expected << ")";
    disconnect(UserReason);
    return false;
    }

You bring up a good point though, technically it can be any reason specific to a subprotocol and other clients can also choose to send it to us for reasons specific to their implementation. I think that rather than treating it as immediately critical we should use it in some sort of reconnection count threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gumb0 I've decided to change UserReason -> UselessPeer when status message validation fails for a peer (since if that happens the peer is effectively useless to us i.e. we can't sync with it). I've also added in logic in Peer::isUseless to take the number of failed connection attempts into account when the last disconnect isn't critical.

return true;
default:
break;
}
return false;
}


unsigned Peer::fallbackSeconds() const
{
constexpr unsigned oneYearInSeconds{60 * 60 * 24 * 360};

if (peerType == PeerType::Required)
return 5;

switch (m_lastHandshakeFailure)
{
case FrameDecryptionFailure:
case ProtocolError:
return oneYearInSeconds;
default:
break;
}

switch (m_lastDisconnect)
{
case BadProtocol:
return 30 * (m_failedAttempts + 1);
case UselessPeer:
case IncompatibleProtocol:
case UnexpectedIdentity:
case UserReason:
return oneYearInSeconds;
case TooManyPeers:
return 25 * (m_failedAttempts + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_failedAttempts seems to affect only the value of fallbackSeconds currently. Maybe we should use it now to retry several times and then go to "critical error, disconnect" state.
(at least for some cases of failures)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gumb0 Good idea - what do you think of this being taken care of in another PR? I'd like to limit the amount of changes I make to the peer gc logic in this PR so if something ends up breaking it will be easier to debug.

Copy link
Member

@gumb0 gumb0 Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for another PR, but it seems to be the matter of only adding condition like m_failedAttempts >= 20 to uselessPeer function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional thought is that we could make all this change more conservative if we leave fallbackSeconds() as it was before (so that we do the same reconnects as before) and have just this check for failed attempts count in uselessPeer (plus the new check for handshake failures)
This way it would reconnect with the same intervals for each case as before, but stop after limited number of attempts.

But I'm fine with it if you think it's better to immediately stop in some cases.

case ClientQuit:
Expand Down
8 changes: 7 additions & 1 deletion libp2p/Peer.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@ class Peer: public Node
/// Return true if connection attempt should be made to this peer or false if
bool shouldReconnect() const;

/// A peer which should never be reconnected to - e.g. it's running on a different network, we
/// don't have any common capabilities
bool uselessPeer() const;
halfalicious marked this conversation as resolved.
Show resolved Hide resolved

/// Number of times connection has been attempted to peer.
int failedAttempts() const { return m_failedAttempts; }

Expand All @@ -80,7 +84,8 @@ class Peer: public Node
void noteSessionGood() { m_failedAttempts = 0; }

private:
/// Returns number of seconds to wait until attempting connection, based on attempted connection history.
/// Returns number of seconds to wait until attempting connection, based on attempted connection history, or
/// numeric_limits<unsigned>::max() if a connection should never be attempted.
halfalicious marked this conversation as resolved.
Show resolved Hide resolved
unsigned fallbackSeconds() const;

std::atomic<int> m_score{0}; ///< All time cumulative.
Expand All @@ -92,6 +97,7 @@ class Peer: public Node
std::chrono::system_clock::time_point m_lastAttempted;
std::atomic<unsigned> m_failedAttempts{0};
DisconnectReason m_lastDisconnect = NoDisconnect; ///< Reason for disconnect that happened last.
HandshakeFailureReason m_lastHandshakeFailure = NoFailure; ///< Reason for most recent handshake failure

/// Used by isOffline() and (todo) for peer to emit session information.
std::weak_ptr<Session> m_session;
Expand Down
Loading