Investigate `subscriber too slow` errors in pubsub #759

masih · 2024-11-28T10:33:26Z

Observed in passive testing on mainnet for filecoin/16 network:

{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}

This indicates that the processing loop is taking too long to process received messages.
ida metrics at the corresponding time interval in which this occured: https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org:80&from=1732789053994&to=1732789928949&tab=transform

The text was updated successfully, but these errors were encountered:

masih · 2024-11-28T14:44:39Z

We should set Tracer in pubsub and log UndeliverableMessage; such messages end up in that tracer call.

masih · 2024-11-28T14:50:12Z

The buffer size of 20 here may be too small for large networks.

Investigating slow subscriber error: * increase internal message queue buffer by 5X * increase pubsub internal buffer size by 2x Relates to #759

Depend on go-f3 with bumped internal buffer sizes. Relates to: filecoin-project/go-f3#759

masih · 2024-11-28T16:36:02Z

With buffer sizes bumped to 512 both for internal buffer size and pubsub subscriber buffer size seem to have fixed the issue at network scale of 40%.

The true fix is to make the processing loop faster, or internally buffer messages in a more sophisticated way inside GPBFT (and indeally hook it up to validation/internal state to drop messages early if they are not processed yet , were valid when buffered, and became useless)

masih · 2024-12-02T14:20:17Z

There is a flurry of these messages every time we re-bootstrap a passive testing network.

I believe the cause could also be that the gpbft runner gets paused when manifest changes and resumes again close to bootstrap of the new manifest. The runner is where the consume loop of topic subscription lives. Meaning, when paused no messages from the previous network topic gets consumed, and hence the "queue full" error.

If on archioz we also only observe this error during rebootstrap then I believe this is a non-issue. Though will not hurt to bump the limits anyway.

Increase the pubsub and internal message queue buffer sizes to 128. This is to allow a larger headroom for buffered messages if the internal message handling loop is too slow, or in a case where there is a slight missalignment across nodes on bootstrap. Relates to #759

masih · 2024-12-06T10:37:47Z

Observed subscriber too slow error flurry on ida just before instance 4 started in network filecoin/40.

masih · 2024-12-06T12:46:34Z

There was a bump in CPU/allocation around that time. The bump was abrupt: I cannot say for sure it was caused by split store. It does however overlap with BLS sig validation, which is expected at the end of an instance.

It does not seem to happen consistently though.

Increase of buffer sizes in the latest release should certainly help.

@Kubuxu also suggested adding artificial jitter in message broadcast which would also help but we should be careful with the jitter boundary relative to F3 instance alignment.

Kubuxu · 2024-12-06T13:54:43Z

@masih was ida at the time running with the buffer patch or not?

masih · 2024-12-06T14:11:45Z

running with the buffer patch or not?

No but archioz was

masih · 2024-12-06T14:21:02Z

OK looks like archioz is running the commit that doesn't have the buffer bump. And it did observer the subscriber too slow.

Interestingly, a lot more than ida did.

masih · 2024-12-06T14:23:50Z

New observation: On both ida and archioz when this happens committee fetch time is at its max: 2-3 seconds.

It seem likely that slow committee fetch can cause this. Which brings me to ask why get committee times are shooting up

masih · 2024-12-12T21:03:37Z

resolved by #779

github-project-automation bot added this to F3 Nov 28, 2024

github-project-automation bot moved this to Todo in F3 Nov 28, 2024

masih self-assigned this Nov 28, 2024

masih added the P1 label Nov 28, 2024

masih added this to the Milestone 2.5: Mainnet Deployment Readiness milestone Nov 28, 2024

masih moved this from Todo to In progress in F3 Nov 28, 2024

masih added a commit that referenced this issue Nov 28, 2024

Increase message buffer sizes to 100 and sub buffer size to 64

6e1e518

Investigating slow subscriber error: * increase internal message queue buffer by 5X * increase pubsub internal buffer size by 2x Relates to #759

masih mentioned this issue Nov 28, 2024

Increase message and subscription buffer size to 128 #762

Merged

masih added a commit to filecoin-project/lotus that referenced this issue Nov 28, 2024

Investigate F3 subscriber too slow error

d057da0

Depend on go-f3 with bumped internal buffer sizes. Relates to: filecoin-project/go-f3#759

masih mentioned this issue Nov 28, 2024

fix: investigate F3 subscriber too slow error filecoin-project/lotus#12738

Closed

8 tasks

This was referenced Dec 3, 2024

Lotus Node and Miner v1.32.0 Release filecoin-project/lotus#12750

Open

Release v0.7.3 for Lotus v1.32.0-rc1 #773

Closed

masih closed this as completed in #762 Dec 3, 2024

github-project-automation bot moved this from In progress to Done in F3 Dec 3, 2024

masih reopened this Dec 6, 2024

masih mentioned this issue Dec 6, 2024

GetProposal and GetCommittee too slow #779

Closed

masih closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `subscriber too slow` errors in pubsub #759

Investigate `subscriber too slow` errors in pubsub #759

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Dec 2, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

Kubuxu commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 12, 2024

Investigate subscriber too slow errors in pubsub #759

Investigate subscriber too slow errors in pubsub #759

Comments

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Nov 28, 2024

masih commented Dec 2, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

Kubuxu commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 6, 2024

masih commented Dec 12, 2024

Investigate `subscriber too slow` errors in pubsub #759

Investigate `subscriber too slow` errors in pubsub #759