Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O timeouts during bench test #2744

Closed
roman-khimov opened this issue Oct 14, 2022 · 1 comment
Closed

I/O timeouts during bench test #2744

roman-khimov opened this issue Oct 14, 2022 · 1 comment
Assignees
Labels
network P2P layer performance More of something per second
Milestone

Comments

@roman-khimov
Copy link
Member

As the load appears on the test network I/O timeouts start to happen:

2022-10-14T06:21:19.568Z WARN peer disconnected {"addr": "172.200.0.3:20335", "error": "write tcp 172.200.0.6:41398->172.200.0.3:20335: i/o timeout", "peerCount": 2}

This leads to disconnects/reconnects and isn't particularly good for performance. While it's almost fine to have them (the system is under stress and it uses 2s blocks), this still can be a sign of some messages requiring reader thread to do more work than it ought to. We need to check message handlers and see if anything can be improved there to minimize chances for these timeouts/disconnects.

@roman-khimov roman-khimov added network P2P layer performance More of something per second labels Oct 14, 2022
@roman-khimov roman-khimov added this to the v0.99.5 milestone Oct 14, 2022
@roman-khimov roman-khimov self-assigned this Oct 14, 2022
roman-khimov added a commit that referenced this issue Oct 14, 2022
Until the consensus process starts for a new block and until it really needs
some transactions we can spare some cycles by not delivering transactions to
it. In tests this doesn't affect TPS, but makes block delays a bit more
stable. Related to #2744, I think it also may cause timeouts during
transaction processing (waiting on the consensus process channel while it does
something dBFT-related).
roman-khimov added a commit that referenced this issue Oct 17, 2022
It makes sense in general (further narrowing down the time window when
transactions are processed by consensus thread) and it improves block times a
little too, especially in the 7+2 scenario.

Related to #2744.
roman-khimov added a commit that referenced this issue Oct 18, 2022
It makes sense in general (further narrowing down the time window when
transactions are processed by consensus thread) and it improves block times a
little too, especially in the 7+2 scenario.

Related to #2744.
roman-khimov added a commit that referenced this issue Oct 21, 2022
This allows to naturally scale transaction processing if we have some peer
that is sending a lot of them while others are mostly silent. It also can help
somewhat in the event we have 50 peers that all send transactions. 4+1
scenario benefits a lot from it, while 7+2 slows down a little. Delayed
scenarios don't care.

Surprisingly, this also makes disconnects (#2744) much more rare, 4-node
scenario almost never sees it now. Most probably this is the case where peers
affect each other a lot, single-threaded transaction receiver can be slow
enough to trigger some timeout in getdata handler of its peer (because it
tries to push a number of replies).
roman-khimov added a commit that referenced this issue Oct 21, 2022
This allows to naturally scale transaction processing if we have some peer
that is sending a lot of them while others are mostly silent. It also can help
somewhat in the event we have 50 peers that all send transactions. 4+1
scenario benefits a lot from it, while 7+2 slows down a little. Delayed
scenarios don't care.

Surprisingly, this also makes disconnects (#2744) much more rare, 4-node
scenario almost never sees it now. Most probably this is the case where peers
affect each other a lot, single-threaded transaction receiver can be slow
enough to trigger some timeout in getdata handler of its peer (because it
tries to push a number of replies).
roman-khimov added a commit that referenced this issue Oct 21, 2022
When block is being spread through the network we can get a lot of invs with
the same hash. Some more stale nodes may also announce previous or some
earlier block. We can avoid full DB lookup for them and minimize inv handling
time (timeouts in inv handler had happened in #2744).

It doesn't affect tests, just makes node a little less likely to spend some
considerable amount of time in the inv handler.
@roman-khimov
Copy link
Member Author

Timeouts were mostly happening in getdata (more often) and inv handlers (#2750 + a specific timeout-tracking addition). After #2755 they're very rare in 4+1 scenario and even when they happen, they're less problematic. Seems like it was one node affecting the other, while some peer was processing transactions another node couldn't push more of them to it.

#2757 and #2759 also make timeouts less likely to happen because we now have one Write() call instead of 32 most of the time (and less stress on internal queues) and a specific potentially problematic Inv case can be handled easier. #2758 can prevent some traffic from happening which is also beneficial.

Things also tried, but less successful:

  • reducing lock/unlock calls amount in the Inventory handler, only taking txInLock (and/or mempool) once per inv (a set of 32 hashes) --- this in fact degrades the performance, various combinations were tried (on top of different other changes), but they all work worse
  • adding more cache to P2P message queue, but that's largely obsolete after Improve P2P transaction handling #2755, we no longer have 32 distinct messages to push into the queue in GetData handler
  • re-adding peer shuffling removed by Rework broadcast logic #2741, as expected it's absolutely irrelevant with the new iteratePeersWithSendMsg logic
  • adding a global cache tracker with per-peer data inside to track known hashes and only request transactions once in normal scenario (a bit similar to C# pendingKnownHashes, but with a circular buffer and re-requests from other peers if we get no real transaction from the first announcer) --- surprisingly, it horribly breaks performance, our current dumb code works faster (and is more reliable)

I think that's enough for this task, we have a better networking subsystem now and timeouts are less likely to happen.

AnnaShaleva pushed a commit that referenced this issue Oct 24, 2022
Until the consensus process starts for a new block and until it really needs
some transactions we can spare some cycles by not delivering transactions to
it. In tests this doesn't affect TPS, but makes block delays a bit more
stable. Related to #2744, I think it also may cause timeouts during
transaction processing (waiting on the consensus process channel while it does
something dBFT-related).
AnnaShaleva pushed a commit that referenced this issue Oct 26, 2022
When block is being spread through the network we can get a lot of invs with
the same hash. Some more stale nodes may also announce previous or some
earlier block. We can avoid full DB lookup for them and minimize inv handling
time (timeouts in inv handler had happened in #2744).

It doesn't affect tests, just makes node a little less likely to spend some
considerable amount of time in the inv handler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
network P2P layer performance More of something per second
Projects
None yet
Development

No branches or pull requests

1 participant