-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate subscriber too slow
errors in pubsub
#759
Comments
We should set Tracer in pubsub and log |
The buffer size of 20 here may be too small for large networks. |
Investigating slow subscriber error: * increase internal message queue buffer by 5X * increase pubsub internal buffer size by 2x Relates to #759
Depend on go-f3 with bumped internal buffer sizes. Relates to: filecoin-project/go-f3#759
With buffer sizes bumped to 512 both for internal buffer size and pubsub subscriber buffer size seem to have fixed the issue at network scale of 40%. The true fix is to make the processing loop faster, or internally buffer messages in a more sophisticated way inside GPBFT (and indeally hook it up to validation/internal state to drop messages early if they are not processed yet , were valid when buffered, and became useless) |
There is a flurry of these messages every time we re-bootstrap a passive testing network. I believe the cause could also be that the gpbft runner gets paused when manifest changes and resumes again close to bootstrap of the new manifest. The runner is where the consume loop of topic subscription lives. Meaning, when paused no messages from the previous network topic gets consumed, and hence the "queue full" error. If on archioz we also only observe this error during rebootstrap then I believe this is a non-issue. Though will not hurt to bump the limits anyway. |
Increase the pubsub and internal message queue buffer sizes to 128. This is to allow a larger headroom for buffered messages if the internal message handling loop is too slow, or in a case where there is a slight missalignment across nodes on bootstrap. Relates to #759
Increase the pubsub and internal message queue buffer sizes to 128. This is to allow a larger headroom for buffered messages if the internal message handling loop is too slow, or in a case where there is a slight missalignment across nodes on bootstrap. Relates to #759
Observed subscriber too slow error flurry on |
There was a bump in CPU/allocation around that time. The bump was abrupt: I cannot say for sure it was caused by split store. It does however overlap with BLS sig validation, which is expected at the end of an instance. It does not seem to happen consistently though. Increase of buffer sizes in the latest release should certainly help. @Kubuxu also suggested adding artificial jitter in message broadcast which would also help but we should be careful with the jitter boundary relative to F3 instance alignment. |
@masih was ida at the time running with the buffer patch or not? |
No but archioz was |
OK looks like archioz is running the commit that doesn't have the buffer bump. And it did observer the subscriber too slow. Interestingly, a lot more than ida did. |
resolved by #779 |
Observed in passive testing on mainnet for
filecoin/16
network:This indicates that the processing loop is taking too long to process received messages.
ida
metrics at the corresponding time interval in which this occured: https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org:80&from=1732789053994&to=1732789928949&tab=transformThe text was updated successfully, but these errors were encountered: