Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate subscriber too slow errors in pubsub #759

Closed
Tracked by #773
masih opened this issue Nov 28, 2024 · 11 comments · Fixed by #762
Closed
Tracked by #773

Investigate subscriber too slow errors in pubsub #759

masih opened this issue Nov 28, 2024 · 11 comments · Fixed by #762
Assignees
Labels

Comments

@masih
Copy link
Member

masih commented Nov 28, 2024

Observed in passive testing on mainnet for filecoin/16 network:

{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}

This indicates that the processing loop is taking too long to process received messages.
ida metrics at the corresponding time interval in which this occured: https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org:80&from=1732789053994&to=1732789928949&tab=transform

@masih
Copy link
Member Author

masih commented Nov 28, 2024

We should set Tracer in pubsub and log UndeliverableMessage; such messages end up in that tracer call.

@masih
Copy link
Member Author

masih commented Nov 28, 2024

The buffer size of 20 here may be too small for large networks.

@masih masih self-assigned this Nov 28, 2024
@masih masih added the P1 label Nov 28, 2024
@masih masih moved this from Todo to In progress in F3 Nov 28, 2024
masih added a commit that referenced this issue Nov 28, 2024
Investigating slow subscriber error:
* increase internal message queue buffer by 5X
* increase pubsub internal buffer size by 2x

Relates to #759
masih added a commit to filecoin-project/lotus that referenced this issue Nov 28, 2024
Depend on go-f3 with bumped internal buffer sizes.

Relates to: filecoin-project/go-f3#759
@masih
Copy link
Member Author

masih commented Nov 28, 2024

With buffer sizes bumped to 512 both for internal buffer size and pubsub subscriber buffer size seem to have fixed the issue at network scale of 40%.

The true fix is to make the processing loop faster, or internally buffer messages in a more sophisticated way inside GPBFT (and indeally hook it up to validation/internal state to drop messages early if they are not processed yet , were valid when buffered, and became useless)

@masih
Copy link
Member Author

masih commented Dec 2, 2024

There is a flurry of these messages every time we re-bootstrap a passive testing network.

I believe the cause could also be that the gpbft runner gets paused when manifest changes and resumes again close to bootstrap of the new manifest. The runner is where the consume loop of topic subscription lives. Meaning, when paused no messages from the previous network topic gets consumed, and hence the "queue full" error.

If on archioz we also only observe this error during rebootstrap then I believe this is a non-issue. Though will not hurt to bump the limits anyway.

masih added a commit that referenced this issue Dec 3, 2024
Increase the pubsub and internal message queue buffer sizes to 128. This
is to allow a larger headroom for buffered messages if the internal
message handling loop is too slow, or in a case where there is a slight
missalignment across nodes on bootstrap.

Relates to #759
github-merge-queue bot pushed a commit that referenced this issue Dec 3, 2024
Increase the pubsub and internal message queue buffer sizes to 128. This
is to allow a larger headroom for buffered messages if the internal
message handling loop is too slow, or in a case where there is a slight
missalignment across nodes on bootstrap.

Relates to #759
@masih masih closed this as completed in #762 Dec 3, 2024
@github-project-automation github-project-automation bot moved this from In progress to Done in F3 Dec 3, 2024
@masih masih reopened this Dec 6, 2024
@masih
Copy link
Member Author

masih commented Dec 6, 2024

Observed subscriber too slow error flurry on ida just before instance 4 started in network filecoin/40.

@masih
Copy link
Member Author

masih commented Dec 6, 2024

There was a bump in CPU/allocation around that time. The bump was abrupt: I cannot say for sure it was caused by split store. It does however overlap with BLS sig validation, which is expected at the end of an instance.

It does not seem to happen consistently though.

Increase of buffer sizes in the latest release should certainly help.

@Kubuxu also suggested adding artificial jitter in message broadcast which would also help but we should be careful with the jitter boundary relative to F3 instance alignment.

@Kubuxu
Copy link
Contributor

Kubuxu commented Dec 6, 2024

@masih was ida at the time running with the buffer patch or not?

@masih
Copy link
Member Author

masih commented Dec 6, 2024

running with the buffer patch or not?

No but archioz was

@masih
Copy link
Member Author

masih commented Dec 6, 2024

OK looks like archioz is running the commit that doesn't have the buffer bump. And it did observer the subscriber too slow.

Interestingly, a lot more than ida did.

@masih
Copy link
Member Author

masih commented Dec 6, 2024

New observation: On both ida and archioz when this happens committee fetch time is at its max: 2-3 seconds.

It seem likely that slow committee fetch can cause this. Which brings me to ask why get committee times are shooting up

image

@masih
Copy link
Member Author

masih commented Dec 12, 2024

resolved by #779

@masih masih closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants