Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: measure bitswap ttfb from after we get candidates back #432

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Sep 21, 2023

Fixes: #333

This changes the start-time from when we start asking the indexer for bitswap candidates to when we get our first candidate back. So the "Bitswap TTFB" is measured from when we start attempting to talk to peers. This roughly matches the other retrievers that start their timing from when we start retrieving from each peer (we do them in sequence, so we do a per-peer ttfb from when we start the individual attempts).

BUT I'm having reservations about this now that I consider it further. Bitswap inherently comes with a greater discovery time penalty that we don't measure anywhere else (it's difficult to measure). If we move the timing as per this PR, we lose account of that delay; maybe we want to see that in the ttfb measurement for bitswap?

Either way, it's difficult to compare protocols; we just have to acknowledge that.

Feedback from @hannahhoward & @willscott would be good before proceeding.

@LexLuthr
Copy link

I have an SP serving retrievals via booster-bitswap. He set it up in proxy mode behind the boostd node. When we try to fetch something using Lassie and specifying the protocol, the retrievals work.
But when Bot attempts to retrieve from this same SP, it gets a timeout. This was a bit perplexing for me till I realised that TTFB timeouts are simply too low for Bitswap.
This is just one SP. I know there are 70+ nodes with booster-bitswap serving retrievals. Maybe, the way we check them is simply incorrect and that is why only "Filcollins" is serving retrievals over all 3 protocols.

@rvagg
Copy link
Member Author

rvagg commented Sep 21, 2023

I'm suspecing that RetrievalBot might use a timeout of 10s: https://github.com/data-preservation-programs/RetrievalBot/blob/main/.env.retrievalworker#L4 @xinaxu am I understanding that correctly?

Lassie has an overall timeout by default on the commandline of 20s - start fetching something or it'll time out.
Even that's always felt a little short for me.

We're recording ttfb for Saturn atm, this is all peers, not just Filecoin of course, but ttfb for bitswap is the fastest across protocols, even with the extra delay for discovery this PR is looking at removing:

Screenshot 2023-09-21 at 7 53 27 pm

It could be that we're that fast because these are Saturn nodes that maintain a large number of peers already-connected, and there's a lot of repeat content. But if booster-bitswap is taking longer than 10 seconds then that seems like a problem to me.

It's also the case that Lassie is stuck on a boxo commit that has now been reverted (long story): ipfs/boxo#435 which has to do with initial peer discovery + connection, without this particular change we get a certain % where we know the peer(s) to connect to and tell bitswap but a race condition makes it fall through the cracks and we get what looks like a timeout. If we don't get a "proper" fix to this in boxo by the time we need to move on our boxo dependency then we may have to fork bitswap to maintain this! But looking at RetrievalBot's bitswap code I suspect this might not be the same problem we're addressing since it seems to be more direct & 1:1 whereas we use the full peer discovery mechanism and just short-circuit it with indexer results.

@rvagg
Copy link
Member Author

rvagg commented Sep 21, 2023

Maybe, the way we check them is simply incorrect and that is why only "Filcollins" is serving retrievals over all 3 protocols.

btw I'm not sure we know about this, yet; we currently don't have insight into who's serving via Rhea/Saturn on Bitswap. That's what #345 was for but it's only just released and hasn't been deployed to Saturn yet. Once that's turned on we should have insight into Filecoin SPs giving us data via Bitswap. Hopefully by next week we should be able to make some more interesting statements about this and even get stuff onto the dashboard.

@@ -300,6 +315,7 @@ func (br *bitswapRetrieval) RetrieveFromAsyncCandidates(ayncCandidates types.Inb
TotalPayment: big.Zero(),
NumPayments: 0,
AskPrice: big.Zero(),
TimeToFirstByte: time.Duration(ttfb.Load()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should preserve this change even if this PR is closed, it's the only retriever to not return this value and there's no good reason not to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TTFB for bitswap shouldn't include candidate collection
2 participants