Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix libp2p identify race #6573

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Fix libp2p identify race #6573

wants to merge 4 commits into from

Conversation

ivan4th
Copy link
Contributor

@ivan4th ivan4th commented Dec 26, 2024

Motivation

This supersedes #6570

When a P2P test is set up using mocknet.FullMeshConnected(...) and then calls p2p/server.New(...), there's a possible race due to how libp2p identify service works. Namely, when a new peer connects, an active identify request is initiated towards it asking in particular what protocols does the peer support, to which the peer must reply with an identify response message. Also, when SetStreamHandler is called, an identify response message is pushed towards the currently connected peers. In some cases, the following race is possible:

  1. Peer A connects to peer B.
  2. Peer B sends identify request to peer A.
  3. Peer A sends response to the identify request from peer A. This response contains the list of protocols, but that list misses the protocol which is used for Server in p.4, b/c Server is not set up yet.
  4. Peer A sets up a Server which uses SetStreamHandler, and at this point peer A sends pushes an identify response message to peer B, without corresponding identify request.
  5. Peer B receives pushed identify response from A which is sent in p.4, despite it being sent after the response in p.3. This may happen due to how libp2p handles incoming requests. Peer B sets the supported protocols in its ProtoBook for peer A, the list of protocols now contains the protocol specfied for the Server in p.4.
  6. Peer B receives identify response from A which was sent in p.3, despite it being sent before p.4, due to possible reordering. This response also has a list of protocols, but it misses the protocol specified for the Server in p.4. Peer B again sets the supported protocols in its ProtoBook for peer A, but now that list misses the necessary protocol.
  7. Peer B tries to find peers which support the protocol used for the Server in p.4, or connect to peer B using that protocol. This fails b/c ProtoBook entry for peer A contains wrong protocol list.

In addition to this, there's an issue with protocol support checks which Fetcher does to check which peers it can retrieve data from. When a peer is freshly connected, the active identify request towards it may not be finished yet when the fetcher tries to check that peer. Although unlikely, in some cases this may cause valid peers to get ignored.

Description

This change removes the instances of use of mocknet.FullMeshConnected(...) where it may cause identify race, replacing it with mocknet.FullMeshLinked(...) followed by mesh.ConnectAllButSelf() after the Servers are set up. It also fixes fetcher peer selection mechanism so it waits for any pending identification request to finish, similar how to Host.NewStream does that.

Previously, in some tests there was a check for protocol list contents in some tests, but it worked mostly by chance, and now is replaced with delayed mesh connection.

Test Plan

Make sure the tests pass.

Copy link

codecov bot commented Dec 26, 2024

Codecov Report

Attention: Patch coverage is 91.17647% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.7%. Comparing base (b905784) to head (ad79adc).
Report is 6 commits behind head on develop.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
p2p/upgrade.go 85.7% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6573   +/-   ##
=======================================
  Coverage     79.7%   79.7%           
=======================================
  Files          355     355           
  Lines        47168   47170    +2     
=======================================
+ Hits         37620   37639   +19     
+ Misses        7404    7388   -16     
+ Partials      2144    2143    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

When a P2P test is set up using `mocknet.FullMeshConnected(...)` and
then calls `p2p/server.New(...)`, there's a possible race due to how
`libp2p` `identify` service works. Namely, when a new peer connects,
an active `identify` request is initiated towards it asking in
particular what protocols does the peer support, to which the peer
must reply with an identify response message. Also, when
`SetStreamHandler` is called, an identify response message is pushed
towards the currently connected peers. In some cases, the following
race is possible:

1. Peer `A` connects to peer `B`.
2. Peer `B` sends identify request to peer `A`.
3. Peer `A` sends response to the identify request from peer
`A`. This response contains the list of protocols, but that list
misses the protocol which is used for `Server` in p.4, b/c `Server` is
not set up yet.
4. Peer `A` sets up a `Server` which uses `SetStreamHandler`, and at
this point peer `A` sends pushes an identify response message to peer
`B`, _without_ corresponding identify request.
5. Peer `B` receives pushed identify response from `A` which is sent
in p.4, despite it being sent after the response in p.3. This may
happen due to how `libp2p` handles incoming requests. Peer `B` sets
the supported protocols in its `ProtoBook` for peer `A`, the list of
protocols now contains the protocol specfied for the `Server` in p.4.
6. Peer `B` receives identify response from `A` which was sent in p.3,
despite it being sent before p.4, due to possible reordering. This
response also has a list of protocols, but it misses the protocol
specified for the `Server` in p.4. Peer `B` again sets the supported
protocols in its `ProtoBook` for peer `A`, but now that list misses
the necessary protocol.
7. Peer `B` tries to find peers which support the protocol used for
the `Server` in p.4, or connect to peer `B` using that protocol. This
fails b/c `ProtoBook` entry for peer `A` contains wrong protocol list.

In addition to this, there's an issue with protocol support checks
which `Fetcher` does to check which peers it can retrieve data from.
When a peer is freshly connected, the active identify request towards
it may not be finished yet when the fetcher tries to check that peer.
Although unlikely, in some cases this may cause valid peers to get
ignored.

This change removes the instances of use of
`mocknet.FullMeshConnected(...)` where it may cause identify race,
replacing it with `mocknet.FullMeshLinked(...)` followed by
`mesh.ConnectAllButSelf()` after the `Server`s are set up.
It also fixes fetcher peer selection mechanism so it waits for any
pending identification request to finish, similar how to
`Host.NewStream` does that.
@ivan4th ivan4th force-pushed the fix/p2p-identify-race branch from 119c374 to 54840f8 Compare December 26, 2024 21:06
p2p/server/server_test.go Outdated Show resolved Hide resolved
p2p/host.go Outdated
Comment on lines 302 to 306
libp2p.WithFxOption(fx.Invoke(func(ids identify.IDService) {
identifyConn = func(c network.Conn) {
ids.IdentifyConn(c)
}
})),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a hack to me, especially since later down we panic if this doesn't work.

Copy link
Contributor Author

@ivan4th ivan4th Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not fail unless fx DI library that libp2p uses is broken.
There's a panic call further down to clarify that in case if this doesn't work.
Will try to get rid of this function variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now uses a variable of identity.IDService type and returns an error instead of panicking if for whatever reason fx fails to invoke the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels like we are digging deep through the layers of libp2p to access functionality that isn't intended to be used outside of libp2p. libp2p.WithFxOption is marked as experimental and fx itself is used for dependency injection / builder patterns in libp2p. libp2p.New probably returns an interface instead of a concrete type such that the maintainers can make big refactorings without breaking user code.

Us interjecting the build process of whatever is behind the interface returned by lip2p.New will probably only lead to us having to rewrite code with future updates of the library. I suggest a slightly different approach (see my other comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nature of libp2p is that the DI has to be used anyway, e.g. in all the cases like this

func(upgrader transport.Upgrader, rcmgr network.ResourceManager) (transport.Transport, error) {
we're using libp2p's fx DI b/c it passes the necessary arguments to the functions we provide.
It can also be possible to get hold on the *BasicHost via other DI tricks while not using WithFxOption but IMO that would be more hacky

p2p/upgrade.go Outdated
Comment on lines 142 to 146
if bh, ok := h.(*basichost.BasicHost); ok {
fh.identifyConn = func(conn network.Conn) {
bh.IDService().IdentifyConn(conn)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this cast fails? If libp2p.New returns an interface - we shouldn't assume a specific implementation to be behind that interface or this might randomly break (silently) at some point in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some clarifying comments. *basichost.BasicHost is expected when libp2p's mocknet is being used, and there's actually no better way than casting host.Host to *basichost.BasicHost in this case to obtain IDService, as libp2p.New() and fx dependency injection is not used in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe instead of accessing libp2p internals we should just use the methods that are exposed via the Host interface. Connect establishes the connection to the peer and calls IDService:IdentifyWait() before returning. It would also allow us to pass a timeout if we want that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly there seems to be no other way, Connect does not do this if the peer is already connected (which it is in this case), see below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the peer is already connected - Connect (and with it IdentifyWait) have already been called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If peer is already connected and is listed among connected peers, this does not mean Connect() and IdentifyWait() have necessarily been finished. So the race is still possible.

p2p/host.go Outdated Show resolved Hide resolved
Comment on lines +303 to +308
// Make sure that the protocol list for the peer is correct.
// This is similar to what Host.NewStream does to make
// sure it is possible to use one of the specified
// protocols. If we don't do this, there may be a race causing
// some peers to be unnecessarily ignored.
host.Identify(peer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this:

Suggested change
// Make sure that the protocol list for the peer is correct.
// This is similar to what Host.NewStream does to make
// sure it is possible to use one of the specified
// protocols. If we don't do this, there may be a race causing
// some peers to be unnecessarily ignored.
host.Identify(peer)
pi := host.Peerstore().PeerInfo(peer)
if err := host.Connect(context.Background(), pi); err != nil {
f.logger.Debug("failed to connect to peer",
zap.Stringer("id", peer),
zap.Error(err),
)
return nil
}

has basically the same effect without a) needing to intercept the building process of the libp2p host and b) allowing us to pass a timeout to Connect in case we want to abort early if something goes wrong.

Connect in both implementations of Host eventually calls IdentifyWait which is also called by IdentifyCon but allows passing a context in case we want to abort early.

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work b/c Connect is noop in case if the peer is already connected:
https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L803
So it will not call IdentifyWait in this case (it is invoked from dialPeer: https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L826)
And if you force dial via a context option, there will be adverse side effects such as trying to establish a new connection, invoking the gater etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a way of trying to create a new stream which also calls IdentifyWait, but I do not like it either as it adds overhead of creating a new stream when it's not really needed, producing unwanted network traffic

Copy link
Member

@fasmat fasmat Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. For the network status to be connected doesn't this mean Connect has already been called and it is OK that Connect is a no-op? The first call to it will result in IdentifyWait being called?

Again to me this feels wrong - pulling internals out of libp2p that look like they might not be intended to be used outside of the library (IdentifyConn looks like a helper method for tests of the IDService)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The peer is listed in the addressbook before Connect finishes.
The need to access internals is rather unfortunate but it's due to some inconvenient decisions in go-libp2p codebase. Basically IDService can be easily accessed if you know for sure your Host is a *BasicHost, but it is harder to reach if there are some Host wrappers. So IDService is not fully internal.
IdentifyConn just calls IdentifyWait (which is not only used in tests) and waits on the channel it returns

p2p/server/server_test.go Outdated Show resolved Hide resolved
@fasmat fasmat self-requested a review January 2, 2025 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants