Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Rewrite network protocol/service to use channels #1340

Merged
merged 20 commits into from
Feb 6, 2019

Conversation

gterzian
Copy link
Contributor

@gterzian gterzian commented Dec 31, 2018

Fix #1326

Here is a work in progress/proof of concept for a slightly different approach to concurrency involving running components in their own threads, based on their own "event-loop", mainly consisting of receiving and handling messages sequentially.

The main benefit is that, since components are running in their own dedicated thread, there is no need for locking of any kind. Also, you're essentially building your own "runtime", and can enable components to run independently of each other, logically "in parallel".


Some examples:

  • Protocol is now running in a thread, handling ProtocolMsg, including periodics Tick and ProgagateExtrinsincs. Protocol itself isn't shared, or has it's methods called, outside of it's own thread. The result is this:

screen shot 2018-12-31 at 11 17 33 am

  • The network Service now contains senders to both Protocol and NetworkService(libp2p), and instead of calling methods on those, it sends messages to be handled on their respective event-loops.

screen shot 2018-12-31 at 11 22 10 am

  • The libp2p Service is still polled by the tokio runtime. However it isn't shared with other components via the SyncIo anymore. Instead, it receives messages from Protocol(and others) via a channel, and messages are handled inside poll. So while this still seems to require locking Service before doing a poll, that seems to be the only case left(previously it would be locked at each call to SyncIo). The handling of ServiceEvent, enqueud on a stream by the network service, is still done on the tokio runtime as well, however it now only consists of sending messages to Protocol, and, in the case of reporting a peer, back to the network service. This could be seen as a "bridge" between the tokio runtime/libp2p and the protocol.

screen shot 2018-12-31 at 11 32 22 am

screen shot 2018-12-31 at 1 37 46 pm

Note that net_sync above is essentially a wrapper around the network service itself. So previously not only was protocol "running" on the same thread as the network service, but the network service was then also shared with protocol through net_sync(implementingSyncIo). Now, the network service and protocol both run independently of each other. Protocol can send messages to the network service, and the network services communicates back via incoming NetworkServiceEvent. The handling of NetworkServiceEvent, still on the tokio runtime, then acts as a thin layer of glue between the two(which prevents the need for the network service to depend on anything from the protocol).

  • The NetworkLink, shared with the import queue, is now also just a wrapper around a network and protocol sender. So instead of calling methods, which required locking, it now sends messages, to be handled on the respective event-loops of protocol or the network service.
    screen shot 2018-12-31 at 11 29 16 am
    screen shot 2018-12-31 at 11 29 40 am

Note that the two send are non-blocking, meaning the the import_queue can go back to work right after(which would be especially relevant if the import queue were to be an independently running component, as is done in #1327), and both the network and protocol will handle these messages independently.


As a general note, this not only removes locks(which make the logic of Protocol really just single-threaded sequential stuff), it also removes dependencies on the various "components" in your system, in various ways:

  • A the level of the "generic" signature of structs:

screen shot 2018-12-31 at 11 36 03 am

Service itself has nothing to do with Specialization or ExHashT, these are related to Protocol, yet because service contained a protocol, these needed to be included in the signature.

  • At the level of the "parallelism" of the various components.

screen shot 2018-12-31 at 11 38 46 am

remove_reserved_peer previously required to lock the network service(meaning it couldn't be polled in the meantime), and the method call on protocol also required various locks, some of which could block the import-queue, and others would block the network service(each method call on SyncIo would lock the network service, meaning it couldn't be polled in the meantime).

When is on_peer_disconnected called now, Since it's not done in remove_reserved_peer anymore?

When the network service receives the RemoveReservedPeer message, it will enqueue a new DisconnectNode event containing info about nodes to be disconnected.

screen shot 2018-12-31 at 11 47 38 am

This event will later be handled, and result in a message send to protocol

screen shot 2018-12-31 at 11 49 38 am

screen shot 2018-12-31 at 1 48 32 pm

screen shot 2018-12-31 at 11 44 24 am

This results in protocol removing nodes from it's own state, and network removing reserved peers, to happen completely in non-blocking fashion of each other. This setup could also further free-up the import-queue, since the shared dependency/lock on ChainSync is be removed.


Finally, if you absolutely must block one component, while waiting for an "answer" from another one, this can be implemented like this:

screen shot 2018-12-31 at 2 17 01 pm


The tests are going to have to be rewritten for them to pass, however I first wanted to get your point of view on this approach.

Note that when I mention "parallelism", I'm less thinking in terms of performance, and more in terms of logic(obviously, the actual parallelism will be limited by the physical reality of the machine the code is running on). So it's less about boosting performance, more about ensuring components are executing independently. It might bring some performance bonus as well, but that's not the goal.

cc @andresilva @tomusdrw This is a concrete example of some of the stuff I mentioned in relation to "actors"...

@parity-cla-bot
Copy link

It looks like @gterzian signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@gterzian gterzian added the A3-in_progress Pull request is in progress. No review needed at this stage. label Dec 31, 2018
@gterzian gterzian force-pushed the rewrite_protocol_to_use_channels branch from 03c3504 to 3f82e52 Compare January 2, 2019 11:00
@gterzian gterzian force-pushed the rewrite_protocol_to_use_channels branch 10 times, most recently from 0f956fd to 7f434bc Compare January 11, 2019 05:53
@gterzian gterzian added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Jan 11, 2019
@gterzian gterzian force-pushed the rewrite_protocol_to_use_channels branch 5 times, most recently from d0c9630 to f9f4198 Compare January 11, 2019 06:15
@gterzian
Copy link
Contributor Author

gterzian commented Jan 11, 2019

Ok this one is now ready for review. @tomaka ?

I've added one commit that introduces a basic back pressure mechanism between the libp2p and protocol(since the the protocol operations are now in their own thread, and the events from libp2p could in theory otherwise pile-up in the unbounded channel used to communicate with protocol).

Please note that in a few files with "lots" of changes, and because my editor was set to indent with spaces, I ended up running cargo fmt with hard_tabs = true in order to remove the spaces. This introduced a few extra changes, which I hope is ok.

@gterzian gterzian force-pushed the rewrite_protocol_to_use_channels branch 4 times, most recently from 029c255 to d153086 Compare January 11, 2019 06:57
@gterzian gterzian changed the title [WIP] Rewrite network protocol/service to use channels Rewrite network protocol/service to use channels Jan 11, 2019
@tomaka
Copy link
Contributor

tomaka commented Jan 11, 2019

After reading the diff, I think I misunderstood your description.
To me the whole fact that we're using channels and separate threads is a good thing, but should be an implementation detail hidden deep in the code, and not exposed in the API at all.

I don't really see the point of exposing a channel in the API of network-libp2p, considering that it's network that manages the thread. It could simply be the network crate that polls the channel in its thread, and passes on the messages to network-libp2p, and this is in fact already more or less what is happening.

@gterzian
Copy link
Contributor Author

gterzian commented Jan 11, 2019

using channels and separate threads is a good thing, but should be an implementation detail hidden deep in the code, and not exposed in the API at all.

@tomaka I can re-introduce the SyncIo and make network use that instead of a channel directly.

We could also move the equivalent of SyncIo to network-libp2p(what I removed was actually in network), and completely hide the use of NetworkMsg from network.

Would that be better?

I don't really see the point of exposing a channel in the API of network-libp2p, considering that it's network that manages the thread.

network manages the thread that receives ProtocolMsg, sent by network-libp2p from here.

network-libp2preceives NetworkMsg running on Tokio. Currently, those messages are received, non-blockingly, and handled, as part of poll.

So network "owns" a sender to send NetworkMsg(which could be wrapped inside SyncIo), and a receiver to receive and handle ProtocolMsg, in it it's own thread.

network-libp2p is the mirror of that, it owns a sender to send ProtocolMsg, and a receiver to receive NetworkMsg in "its own thread", a tokio runtime actually.

The nice thing is that since the ProtocolMsg are sent from from here, in response to incoming ServiceEvent, network-libp2p doesn't need to send ProtocolMsg directly.

This use of a stream of event is really related to network-libp2p running on tokio, yet on the other side we could indeed make network use a trait from network-libp2p, with only the implementation sending NetworkMsg, which would equally hide the messaging from network.

@tomaka
Copy link
Contributor

tomaka commented Jan 11, 2019

What I'm suggesting is to instead move handle_protocol_messages in the network crate and call it from here.

network-libp2p doesn't own any thread or events loop.
The API surface of libp2p's NetworkService is poll() and send_custom_message(), and a few other methods. This makes its internal state very coherent, easy to figure out, and easy to test (although it doesn't have any test at the moment).

For example if deny_unreserved_peers() returns a node ID, then you know we were connected to it. There is no "later" or "was maybe connected". We were connected to it, and the only way we can move to the disconnected state is by calling one of the methods of the service (like poll()). The API of NetworkService is a state machine, and its state can only be updated by calling methods and doesn't automatically change over time.

Sure, internally, libp2p uses sub-tasks so that the network connections are actually asynchronous. And internally libp2p is quite similar to an actor model. However having a simple synchronous API exposed is a very nice property in my opinion.

On the other hand network already has a lot of internal asynchronicity to handle, and I'm not sure about introducing a spaghetti of events into network-libp2p when to me it looks like this belongs to network.

@gterzian
Copy link
Contributor Author

gterzian commented Jan 11, 2019

For example if deny_unreserved_peers() returns a node ID, then you know we were connected to it. There is no "later" or "was maybe connected". We were connected to it, and the only way we can move to the disconnected state is by calling one of the methods of the service.

In this proposal, you get exactly the same behavior from the perspective of libp2p, the async-ness is only introduced with regards to sending the message, not handling it.

When libp2p handles a NetworkMsg::DenyUnresservedPeers message, it will call its own deny_unreserved_peers method in an absolutely sync way.

The question is whether network cares about whether the node is connected when it sends the message. It appears to me that it doesn't. If libp2p disconnects a node, it will enqueue an event back for network to handle, and disconnect the peer internally if necessary.

This particular case actually requires the "node id" of a disconnected node to be communicated back to network, but most operations, like sending a custom message over the network, do not require any immediate response. What benefit do we get from waiting on a lock to be able to call send_custom_message?

However having a simple synchronous API exposed is a very nice property in my opinion.

If we're using threads, perhaps we might as well go for as much parallelism as the integrity of the system can tolerate, and only do things synchronously when necessary?

If component absolutely need to do a state transition in a "sync" way, that can be dealt with with a reply channel send on the original message(example).

Also, is the current "synchronous API" really that simple? What about the various concurrent components competing to acquire locks?

What I'm suggesting is to instead move handle_protocol_messages in the network crate and call it from here.

I actually did handle the messages at the place you suggest in an earlier version of this PR, and I moved the message handling to poll for a couple of reasons:

  1. Handling the messages in stream::poll_fn().for_each instead of just in poll require locking for each method call, hence it can't be done in parallel to poll, hence it might as well be done together(any method call inside poll doesn't require locking since the whole thing is locked in order to do a poll).
  2. If you do the message handling in stream::poll_fn().for_each you end up with several clones like let network_service2 = network_service.clone();. Now we're actually able to remove the let network_service2 = network_service.clone(); which was previously used by SyncIo, since all we need is one to do a network_service.lock().poll()(see the previous code).

self.handler.on_block_imported(&mut NetSyncIo::new(&self.network, self.protocol_id), hash, header)
let _ = self
.protocol_sender
.send(ProtocolMsg::BlockImported(hash, header.clone()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these functions should be changed to take the header by-value explicitly so the clone isn't hidden.

impl NetworkChan {
/// Create a new network chan.
pub fn new(sender: Sender<NetworkMsg>, task_notify: Arc<AtomicTask>) -> Self {
Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: using Self like this is weird.

protocol.on_clogged_peer(&mut net_sync, node_index,
messages.iter().map(|d| d.as_ref()));
debug!(target: "sync", "{} clogging messages:", messages.len());
for msg_bytes in messages.iter().take(5) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why take 5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is from master at

for msg_bytes in clogging_messages.take(5) {

@tomaka would know the reason for it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't put this. I imagine someone did because it was spamming the logs too much. It's a bit stupid as the whole point of dumping all the messages was to see the frequency's of each one.

@gterzian gterzian force-pushed the rewrite_protocol_to_use_channels branch from 7ed111d to 9f63fbe Compare February 6, 2019 06:15
@gterzian
Copy link
Contributor Author

gterzian commented Feb 6, 2019

@rphmeier Thanks for the review, I think all your comments have been addressed...

@bkchr bkchr merged commit 64cde6f into paritytech:master Feb 6, 2019
@gterzian gterzian deleted the rewrite_protocol_to_use_channels branch February 6, 2019 12:32
@andresilva
Copy link
Contributor

FWIW I tested this on polkadot by syncing the alexander chain from scratch, worked fine. 👍

@gterzian
Copy link
Contributor Author

gterzian commented Feb 6, 2019

@andresilva thank you!

MTDK1 pushed a commit to bdevux/substrate that referenced this pull request Apr 12, 2019
* rewrite network protocol/service to use channels

* remove use of unwrap

* re-introduce with_spec

* remove unnecessary mut

* remove unused param

* improve with_spec, add with_gossip

* rename job to task

* style: re-add comma

* remove extra string allocs

* rename use of channel

* turn TODO into FIXME

* remove mut in match

* remove Self in new

* pass headers by value to network service

* remove network sender from service

* remove TODO

* better expect

* rationalize use of network sender in ondemand
@gterzian gterzian mentioned this pull request May 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A3-in_progress Pull request is in progress. No review needed at this stage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants