-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix libp2p identify race #6573
base: develop
Are you sure you want to change the base?
Fix libp2p identify race #6573
Conversation
c72d92c
to
119c374
Compare
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## develop #6573 +/- ##
=======================================
Coverage 79.7% 79.7%
=======================================
Files 355 355
Lines 47168 47170 +2
=======================================
+ Hits 37620 37639 +19
+ Misses 7404 7388 -16
+ Partials 2144 2143 -1 ☔ View full report in Codecov by Sentry. |
When a P2P test is set up using `mocknet.FullMeshConnected(...)` and then calls `p2p/server.New(...)`, there's a possible race due to how `libp2p` `identify` service works. Namely, when a new peer connects, an active `identify` request is initiated towards it asking in particular what protocols does the peer support, to which the peer must reply with an identify response message. Also, when `SetStreamHandler` is called, an identify response message is pushed towards the currently connected peers. In some cases, the following race is possible: 1. Peer `A` connects to peer `B`. 2. Peer `B` sends identify request to peer `A`. 3. Peer `A` sends response to the identify request from peer `A`. This response contains the list of protocols, but that list misses the protocol which is used for `Server` in p.4, b/c `Server` is not set up yet. 4. Peer `A` sets up a `Server` which uses `SetStreamHandler`, and at this point peer `A` sends pushes an identify response message to peer `B`, _without_ corresponding identify request. 5. Peer `B` receives pushed identify response from `A` which is sent in p.4, despite it being sent after the response in p.3. This may happen due to how `libp2p` handles incoming requests. Peer `B` sets the supported protocols in its `ProtoBook` for peer `A`, the list of protocols now contains the protocol specfied for the `Server` in p.4. 6. Peer `B` receives identify response from `A` which was sent in p.3, despite it being sent before p.4, due to possible reordering. This response also has a list of protocols, but it misses the protocol specified for the `Server` in p.4. Peer `B` again sets the supported protocols in its `ProtoBook` for peer `A`, but now that list misses the necessary protocol. 7. Peer `B` tries to find peers which support the protocol used for the `Server` in p.4, or connect to peer `B` using that protocol. This fails b/c `ProtoBook` entry for peer `A` contains wrong protocol list. In addition to this, there's an issue with protocol support checks which `Fetcher` does to check which peers it can retrieve data from. When a peer is freshly connected, the active identify request towards it may not be finished yet when the fetcher tries to check that peer. Although unlikely, in some cases this may cause valid peers to get ignored. This change removes the instances of use of `mocknet.FullMeshConnected(...)` where it may cause identify race, replacing it with `mocknet.FullMeshLinked(...)` followed by `mesh.ConnectAllButSelf()` after the `Server`s are set up. It also fixes fetcher peer selection mechanism so it waits for any pending identification request to finish, similar how to `Host.NewStream` does that.
119c374
to
54840f8
Compare
p2p/host.go
Outdated
libp2p.WithFxOption(fx.Invoke(func(ids identify.IDService) { | ||
identifyConn = func(c network.Conn) { | ||
ids.IdentifyConn(c) | ||
} | ||
})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a hack to me, especially since later down we panic if this doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not fail unless fx
DI library that libp2p
uses is broken.
There's a panic
call further down to clarify that in case if this doesn't work.
Will try to get rid of this function variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now uses a variable of identity.IDService
type and returns an error instead of panicking if for whatever reason fx
fails to invoke the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels like we are digging deep through the layers of libp2p
to access functionality that isn't intended to be used outside of libp2p
. libp2p.WithFxOption
is marked as experimental and fx
itself is used for dependency injection / builder patterns in libp2p. libp2p.New
probably returns an interface instead of a concrete type such that the maintainers can make big refactorings without breaking user code.
Us interjecting the build process of whatever is behind the interface returned by lip2p.New
will probably only lead to us having to rewrite code with future updates of the library. I suggest a slightly different approach (see my other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nature of libp2p
is that the DI has to be used anyway, e.g. in all the cases like this
Line 302 in 1fd07d4
func(upgrader transport.Upgrader, rcmgr network.ResourceManager) (transport.Transport, error) { |
It can also be possible to get hold on the
*BasicHost
via other DI tricks while not using WithFxOption
but IMO that would be more hacky
p2p/upgrade.go
Outdated
if bh, ok := h.(*basichost.BasicHost); ok { | ||
fh.identifyConn = func(conn network.Conn) { | ||
bh.IDService().IdentifyConn(conn) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this cast fails? If libp2p.New
returns an interface - we shouldn't assume a specific implementation to be behind that interface or this might randomly break (silently) at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarifying comments. *basichost.BasicHost
is expected when libp2p
's mocknet is being used, and there's actually no better way than casting host.Host
to *basichost.BasicHost
in this case to obtain IDService
, as libp2p.New()
and fx
dependency injection is not used in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe instead of accessing libp2p internals we should just use the methods that are exposed via the Host
interface. Connect
establishes the connection to the peer and calls IDService:IdentifyWait()
before returning. It would also allow us to pass a timeout if we want that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly there seems to be no other way, Connect
does not do this if the peer is already connected (which it is in this case), see below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the peer is already connected - Connect
(and with it IdentifyWait
) have already been called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If peer is already connected and is listed among connected peers, this does not mean Connect()
and IdentifyWait()
have necessarily been finished. So the race is still possible.
// Make sure that the protocol list for the peer is correct. | ||
// This is similar to what Host.NewStream does to make | ||
// sure it is possible to use one of the specified | ||
// protocols. If we don't do this, there may be a race causing | ||
// some peers to be unnecessarily ignored. | ||
host.Identify(peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this:
// Make sure that the protocol list for the peer is correct. | |
// This is similar to what Host.NewStream does to make | |
// sure it is possible to use one of the specified | |
// protocols. If we don't do this, there may be a race causing | |
// some peers to be unnecessarily ignored. | |
host.Identify(peer) | |
pi := host.Peerstore().PeerInfo(peer) | |
if err := host.Connect(context.Background(), pi); err != nil { | |
f.logger.Debug("failed to connect to peer", | |
zap.Stringer("id", peer), | |
zap.Error(err), | |
) | |
return nil | |
} |
has basically the same effect without a) needing to intercept the building process of the libp2p
host and b) allowing us to pass a timeout to Connect
in case we want to abort early if something goes wrong.
Connect
in both implementations of Host
eventually calls IdentifyWait
which is also called by IdentifyCon
but allows passing a context in case we want to abort early.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work b/c Connect
is noop in case if the peer is already connected:
https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L803
So it will not call IdentifyWait
in this case (it is invoked from dialPeer
: https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L826)
And if you force dial via a context option, there will be adverse side effects such as trying to establish a new connection, invoking the gater etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a way of trying to create a new stream which also calls IdentifyWait
, but I do not like it either as it adds overhead of creating a new stream when it's not really needed, producing unwanted network traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. For the network status to be connected doesn't this mean Connect
has already been called and it is OK that Connect
is a no-op? The first call to it will result in IdentifyWait
being called?
Again to me this feels wrong - pulling internals out of libp2p that look like they might not be intended to be used outside of the library (IdentifyConn
looks like a helper method for tests of the IDService
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The peer is listed in the addressbook before Connect
finishes.
The need to access internals is rather unfortunate but it's due to some inconvenient decisions in go-libp2p
codebase. Basically IDService
can be easily accessed if you know for sure your Host
is a *BasicHost
, but it is harder to reach if there are some Host
wrappers. So IDService
is not fully internal.
IdentifyConn
just calls IdentifyWait
(which is not only used in tests) and waits on the channel it returns
Motivation
This supersedes #6570
When a P2P test is set up using
mocknet.FullMeshConnected(...)
and then callsp2p/server.New(...)
, there's a possible race due to howlibp2p
identify
service works. Namely, when a new peer connects, an activeidentify
request is initiated towards it asking in particular what protocols does the peer support, to which the peer must reply with an identify response message. Also, whenSetStreamHandler
is called, an identify response message is pushed towards the currently connected peers. In some cases, the following race is possible:A
connects to peerB
.B
sends identify request to peerA
.A
sends response to the identify request from peerA
. This response contains the list of protocols, but that list misses the protocol which is used forServer
in p.4, b/cServer
is not set up yet.A
sets up aServer
which usesSetStreamHandler
, and at this point peerA
sends pushes an identify response message to peerB
, without corresponding identify request.B
receives pushed identify response fromA
which is sent in p.4, despite it being sent after the response in p.3. This may happen due to howlibp2p
handles incoming requests. PeerB
sets the supported protocols in itsProtoBook
for peerA
, the list of protocols now contains the protocol specfied for theServer
in p.4.B
receives identify response fromA
which was sent in p.3, despite it being sent before p.4, due to possible reordering. This response also has a list of protocols, but it misses the protocol specified for theServer
in p.4. PeerB
again sets the supported protocols in itsProtoBook
for peerA
, but now that list misses the necessary protocol.B
tries to find peers which support the protocol used for theServer
in p.4, or connect to peerB
using that protocol. This fails b/cProtoBook
entry for peerA
contains wrong protocol list.In addition to this, there's an issue with protocol support checks which
Fetcher
does to check which peers it can retrieve data from. When a peer is freshly connected, the active identify request towards it may not be finished yet when the fetcher tries to check that peer. Although unlikely, in some cases this may cause valid peers to get ignored.Description
This change removes the instances of use of
mocknet.FullMeshConnected(...)
where it may cause identify race, replacing it withmocknet.FullMeshLinked(...)
followed bymesh.ConnectAllButSelf()
after theServer
s are set up. It also fixes fetcher peer selection mechanism so it waits for any pending identification request to finish, similar how toHost.NewStream
does that.Previously, in some tests there was a check for protocol list contents in some tests, but it worked mostly by chance, and now is replaced with delayed mesh connection.
Test Plan
Make sure the tests pass.