Fix libp2p identify race #6573

ivan4th · 2024-12-26T19:15:42Z

Motivation

This supersedes #6570

When a P2P test is set up using mocknet.FullMeshConnected(...) and then calls p2p/server.New(...), there's a possible race due to how libp2p identify service works. Namely, when a new peer connects, an active identify request is initiated towards it asking in particular what protocols does the peer support, to which the peer must reply with an identify response message. Also, when SetStreamHandler is called, an identify response message is pushed towards the currently connected peers. In some cases, the following race is possible:

Peer A connects to peer B.
Peer B sends identify request to peer A.
Peer A sends response to the identify request from peer A. This response contains the list of protocols, but that list misses the protocol which is used for Server in p.4, b/c Server is not set up yet.
Peer A sets up a Server which uses SetStreamHandler, and at this point peer A sends pushes an identify response message to peer B, without corresponding identify request.
Peer B receives pushed identify response from A which is sent in p.4, despite it being sent after the response in p.3. This may happen due to how libp2p handles incoming requests. Peer B sets the supported protocols in its ProtoBook for peer A, the list of protocols now contains the protocol specfied for the Server in p.4.
Peer B receives identify response from A which was sent in p.3, despite it being sent before p.4, due to possible reordering. This response also has a list of protocols, but it misses the protocol specified for the Server in p.4. Peer B again sets the supported protocols in its ProtoBook for peer A, but now that list misses the necessary protocol.
Peer B tries to find peers which support the protocol used for the Server in p.4, or connect to peer B using that protocol. This fails b/c ProtoBook entry for peer A contains wrong protocol list.

In addition to this, there's an issue with protocol support checks which Fetcher does to check which peers it can retrieve data from. When a peer is freshly connected, the active identify request towards it may not be finished yet when the fetcher tries to check that peer. Although unlikely, in some cases this may cause valid peers to get ignored.

Description

This change removes the instances of use of mocknet.FullMeshConnected(...) where it may cause identify race, replacing it with mocknet.FullMeshLinked(...) followed by mesh.ConnectAllButSelf() after the Servers are set up. It also fixes fetcher peer selection mechanism so it waits for any pending identification request to finish, similar how to Host.NewStream does that.

Previously, in some tests there was a check for protocol list contents in some tests, but it worked mostly by chance, and now is replaced with delayed mesh connection.

Test Plan

Make sure the tests pass.

codecov · 2024-12-26T19:42:45Z

Codecov Report

Attention: Patch coverage is 91.17647% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.7%. Comparing base (b905784) to head (ad79adc).
Report is 6 commits behind head on develop.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
p2p/upgrade.go	85.7%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #6573   +/-   ##
=======================================
  Coverage     79.7%   79.7%           
=======================================
  Files          355     355           
  Lines        47168   47170    +2     
=======================================
+ Hits         37620   37639   +19     
+ Misses        7404    7388   -16     
+ Partials      2144    2143    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

When a P2P test is set up using `mocknet.FullMeshConnected(...)` and then calls `p2p/server.New(...)`, there's a possible race due to how `libp2p` `identify` service works. Namely, when a new peer connects, an active `identify` request is initiated towards it asking in particular what protocols does the peer support, to which the peer must reply with an identify response message. Also, when `SetStreamHandler` is called, an identify response message is pushed towards the currently connected peers. In some cases, the following race is possible: 1. Peer `A` connects to peer `B`. 2. Peer `B` sends identify request to peer `A`. 3. Peer `A` sends response to the identify request from peer `A`. This response contains the list of protocols, but that list misses the protocol which is used for `Server` in p.4, b/c `Server` is not set up yet. 4. Peer `A` sets up a `Server` which uses `SetStreamHandler`, and at this point peer `A` sends pushes an identify response message to peer `B`, _without_ corresponding identify request. 5. Peer `B` receives pushed identify response from `A` which is sent in p.4, despite it being sent after the response in p.3. This may happen due to how `libp2p` handles incoming requests. Peer `B` sets the supported protocols in its `ProtoBook` for peer `A`, the list of protocols now contains the protocol specfied for the `Server` in p.4. 6. Peer `B` receives identify response from `A` which was sent in p.3, despite it being sent before p.4, due to possible reordering. This response also has a list of protocols, but it misses the protocol specified for the `Server` in p.4. Peer `B` again sets the supported protocols in its `ProtoBook` for peer `A`, but now that list misses the necessary protocol. 7. Peer `B` tries to find peers which support the protocol used for the `Server` in p.4, or connect to peer `B` using that protocol. This fails b/c `ProtoBook` entry for peer `A` contains wrong protocol list. In addition to this, there's an issue with protocol support checks which `Fetcher` does to check which peers it can retrieve data from. When a peer is freshly connected, the active identify request towards it may not be finished yet when the fetcher tries to check that peer. Although unlikely, in some cases this may cause valid peers to get ignored. This change removes the instances of use of `mocknet.FullMeshConnected(...)` where it may cause identify race, replacing it with `mocknet.FullMeshLinked(...)` followed by `mesh.ConnectAllButSelf()` after the `Server`s are set up. It also fixes fetcher peer selection mechanism so it waits for any pending identification request to finish, similar how to `Host.NewStream` does that.

p2p/server/server_test.go

fasmat · 2024-12-27T09:37:46Z

p2p/host.go

+		libp2p.WithFxOption(fx.Invoke(func(ids identify.IDService) {
+			identifyConn = func(c network.Conn) {
+				ids.IdentifyConn(c)
+			}
+		})),


This feels like a hack to me, especially since later down we panic if this doesn't work.

This should not fail unless fx DI library that libp2p uses is broken.
There's a panic call further down to clarify that in case if this doesn't work.
Will try to get rid of this function variable

It now uses a variable of identity.IDService type and returns an error instead of panicking if for whatever reason fx fails to invoke the function

This still feels like we are digging deep through the layers of libp2p to access functionality that isn't intended to be used outside of libp2p. libp2p.WithFxOption is marked as experimental and fx itself is used for dependency injection / builder patterns in libp2p. libp2p.New probably returns an interface instead of a concrete type such that the maintainers can make big refactorings without breaking user code.

Us interjecting the build process of whatever is behind the interface returned by lip2p.New will probably only lead to us having to rewrite code with future updates of the library. I suggest a slightly different approach (see my other comment).

The nature of libp2p is that the DI has to be used anyway, e.g. in all the cases like this

go-spacemesh/p2p/host.go

Line 302 in 1fd07d4

func(upgrader transport.Upgrader, rcmgr network.ResourceManager) (transport.Transport, error) {

we're using libp2p's fx DI b/c it passes the necessary arguments to the functions we provide.
It can also be possible to get hold on the *BasicHost via other DI tricks while not using WithFxOption but IMO that would be more hacky

fasmat · 2024-12-27T09:39:44Z

p2p/upgrade.go

+		if bh, ok := h.(*basichost.BasicHost); ok {
+			fh.identifyConn = func(conn network.Conn) {
+				bh.IDService().IdentifyConn(conn)
+			}
+		}


What if this cast fails? If libp2p.New returns an interface - we shouldn't assume a specific implementation to be behind that interface or this might randomly break (silently) at some point in the future.

Added some clarifying comments. *basichost.BasicHost is expected when libp2p's mocknet is being used, and there's actually no better way than casting host.Host to *basichost.BasicHost in this case to obtain IDService, as libp2p.New() and fx dependency injection is not used in that case.

I believe instead of accessing libp2p internals we should just use the methods that are exposed via the Host interface. Connect establishes the connection to the peer and calls IDService:IdentifyWait() before returning. It would also allow us to pass a timeout if we want that.

Sadly there seems to be no other way, Connect does not do this if the peer is already connected (which it is in this case), see below

But if the peer is already connected - Connect (and with it IdentifyWait) have already been called?

If peer is already connected and is listed among connected peers, this does not mean Connect() and IdentifyWait() have necessarily been finished. So the race is still possible.

p2p/host.go

fasmat · 2025-01-02T11:34:18Z

fetch/fetch.go

+				// Make sure that the protocol list for the peer is correct.
+				// This is similar to what Host.NewStream does to make
+				// sure it is possible to use one of the specified
+				// protocols. If we don't do this, there may be a race causing
+				// some peers to be unnecessarily ignored.
+				host.Identify(peer)


I believe this:

Suggested change

// Make sure that the protocol list for the peer is correct.

// This is similar to what Host.NewStream does to make

// sure it is possible to use one of the specified

// protocols. If we don't do this, there may be a race causing

// some peers to be unnecessarily ignored.

host.Identify(peer)

pi := host.Peerstore().PeerInfo(peer)

if err := host.Connect(context.Background(), pi); err != nil {

f.logger.Debug("failed to connect to peer",

zap.Stringer("id", peer),

zap.Error(err),

)

return nil

}

has basically the same effect without a) needing to intercept the building process of the libp2p host and b) allowing us to pass a timeout to Connect in case we want to abort early if something goes wrong.

Connect in both implementations of Host eventually calls IdentifyWait which is also called by IdentifyCon but allows passing a context in case we want to abort early.

Wdyt?

This will not work b/c Connect is noop in case if the peer is already connected:
https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L803
So it will not call IdentifyWait in this case (it is invoked from dialPeer: https://github.com/libp2p/go-libp2p/blob/v0.38.1/p2p/host/basic/basic_host.go#L826)
And if you force dial via a context option, there will be adverse side effects such as trying to establish a new connection, invoking the gater etc.

There may be a way of trying to create a new stream which also calls IdentifyWait, but I do not like it either as it adds overhead of creating a new stream when it's not really needed, producing unwanted network traffic

I'm not sure I understand. For the network status to be connected doesn't this mean Connect has already been called and it is OK that Connect is a no-op? The first call to it will result in IdentifyWait being called?

Again to me this feels wrong - pulling internals out of libp2p that look like they might not be intended to be used outside of the library (IdentifyConn looks like a helper method for tests of the IDService)

The peer is listed in the addressbook before Connect finishes.
The need to access internals is rather unfortunate but it's due to some inconvenient decisions in go-libp2p codebase. Basically IDService can be easily accessed if you know for sure your Host is a *BasicHost, but it is harder to reach if there are some Host wrappers. So IDService is not fully internal.
IdentifyConn just calls IdentifyWait (which is not only used in tests) and waits on the channel it returns

p2p/server/server_test.go

ivan4th requested review from dshulyak, fasmat, poszu, acud, jellonek and kacpersaw as code owners December 26, 2024 19:15

ivan4th force-pushed the fix/p2p-identify-race branch from c72d92c to 119c374 Compare December 26, 2024 19:19

ivan4th mentioned this pull request Dec 26, 2024

Ensure server is started before handling requests #6570

Closed

4 tasks

ivan4th force-pushed the fix/p2p-identify-race branch from 119c374 to 54840f8 Compare December 26, 2024 21:06

poszu approved these changes Dec 27, 2024

View reviewed changes

p2p/server/server_test.go Outdated Show resolved Hide resolved

fasmat requested changes Dec 27, 2024

View reviewed changes

ivan4th added 2 commits December 27, 2024 19:02

Fix timesync test

27f347d

Address comments

2f4c4a3

fasmat reviewed Jan 2, 2025

View reviewed changes

p2p/server/server_test.go Outdated Show resolved Hide resolved

fasmat self-requested a review January 2, 2025 11:38

Fix typo in a comment

ad79adc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix libp2p identify race #6573

Fix libp2p identify race #6573

ivan4th commented Dec 26, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

fasmat Dec 27, 2024

ivan4th Dec 27, 2024 •

edited

Loading

ivan4th Dec 27, 2024

fasmat Jan 2, 2025

ivan4th Jan 3, 2025

fasmat Dec 27, 2024

ivan4th Dec 27, 2024

fasmat Jan 2, 2025

ivan4th Jan 3, 2025

fasmat Jan 3, 2025

ivan4th Jan 3, 2025

fasmat Jan 2, 2025

ivan4th Jan 3, 2025

ivan4th Jan 3, 2025

fasmat Jan 3, 2025 •

edited

Loading

ivan4th Jan 3, 2025

Fix libp2p identify race #6573

Are you sure you want to change the base?

Fix libp2p identify race #6573

Conversation

ivan4th commented Dec 26, 2024 • edited Loading

Motivation

Description

Test Plan

codecov bot commented Dec 26, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ivan4th Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fasmat Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivan4th commented Dec 26, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

ivan4th Dec 27, 2024 •

edited

Loading

fasmat Jan 3, 2025 •

edited

Loading