-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DHT Query Performance #88
Comments
At a first glance, a potentially big chunk of query slowdown comes from the need to resolve the key to verify the record in the response. Specifically, every time we get a record in This suggests a possible optimization: piggyback the key for the record in the query response. Note that Ed25519 keys will make this point moot, as the key will be extractable from the peer id. Nonetheless, we will have to deal with RSA keys for years to come, so it's worthwhile to optimize this case. |
Another possible issue is the big lock in The lock is held for the duration of each request: send+receive response, which means that we can only have a single outstanding request per peer at a time. |
One other related thing I noticed was that it appears we are sending out provider messages to the same peers for the same keys multiple times. This should probably get investigated too at some point. |
That definitely sounds like it needs to be investigated. |
Some analysis of the results with pipelining (#92), suggests that we have a 75ms response time in
|
Analysis of event log data from the SOL indicates a very long tailed distribution in our response time for all request types. Most of the requests (80%) take a short time -- under 10ms, but there is a very long tail that reaches to tens of seconds. Processed data and plots from the analysis is pinned at |
@vyzo @whyrusleeping @Stebalien I'm interested in helping out on this, what's the current plan? |
|
Ok thanks for the update @Stebalien, sounds like you guys have this covered :) |
I'm not sure how often this occurs in practice, but I noticed that if GetPublicKey can't extract the public key, get it from the peerstore or get it from the node itself, then it will call GetValue(). GetValue() will in turn request 16 values from the closest peers. However the public key is immutable, so if I understand correctly it should only need one value, right? Would it make sense to call |
@dirkmc Yeah, it might help to set that down to one value. We will need to expose the I'm curious how often that occurs in practice. Maybe throw a log in there and see how often it gets hit? |
It looks like GetValues is already exposed: https://github.com/libp2p/go-libp2p-routing/blob/master/routing.go#L58 |
When trying this out today I found that my node rarely receives a value from more than 8 or 9 nodes for a particular key before the one minute timeout is reached. If I reduce the number of required values to one (eg for retrieving a public key) it usually resolves within about 10 seconds. When calling One complication is that the |
@dirkmc great findings on the number of values received. We should look into different ways of getting that 8 or 9 as high as we can. As for the |
Ah, it counts invalid records: https://github.com/libp2p/go-libp2p-kad-dht/blob/master/routing.go#L208 So thats not for arbitrary failures, but specifically for when we receive a value that is not valid. In the case of a public key, this could mean they sent us back the wrong key. So youre right, this can't be dropped to 1. |
Ah yes you're right, it's only invalid records. That seems like a pretty rare case, maybe the answer in that case is just to return an error from It seems like the successful set of values is reached relatively fast, maybe in the first 20 seconds, and then it waits for the rest of the dial attempts to time out. Is there a way to fail fast for dial errors, rather than waiting for timeouts? For example if it's a connection refused error that should return pretty quickly, right? |
Connection refused returns pretty quickly, but most failed dials on the internet don't hit that. Most firewalls on the internet just drop packets, as opposed to denying them. If a firewall drops your SYN packet, youre SOL until the timeout. |
That makes sense. Peers should only get into my kbucket if either I have been able to ping them, or a node that I can reach has been able to ping them. So in practice I would imagine the reason for not being able to reach a peer in the kbucket should only infrequently be because of firewalls, right? |
firewalls, or if theyve gone offline. |
I'm going to dig into the network code a little more tomorrow. |
One simple approach might be to track the distribution of response times,
fit a (exponentially decaying) distribution to it, and then set the timeout
to be something like the 95th percentile response time.
Thoughts?
…On Feb 12, 2018 5:50 PM, "dirkmc" ***@***.***> wrote:
I'm going to dig into the network code a little more tomorrow.
One way of improving the response rate would be to keep the kbuckets more
up to date somehow.
Another way that occurs to me might be to keep a record of the ping time
to a peer, and use some multiple of that time as the timeout, rather than
assuming one minute for every node. eg if I have a 200ms ping to a node,
the next time I ping it I can probably assume it will respond within a
second or two.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAcnXMyLBAzkY1cwuOV09A_Fw-bxmfOks5tUOpcgaJpZM4PUPLn>
.
|
@dirkmc Thinking through things, I actually think that we remove peers from our routing table once we are no longer connected to them (see #45) While this might technically be a bug, It makes the problem a little simpler. Pretty much every peer we get handed right now via DHT queries should be there, its just a matter of actually connecting to them. |
that's very interesting indeed; thanks @dirkmc. |
It seems like the reason for the successful
Once the rate limited dialer has hit its limit of 160 concurrent dials, it blocks until a success or error response (eg a 60 second dial timeout) before trying the next dial in its queue. 60 seconds after a request begins, a whole bunch of these time out at once, freeing up slots in the dialer. The dialer now dials a bunch of new addresses, sometimes for a peer where a previous address failed. When one of these is successful we get the successful connect after 60 seconds mentioned above. Quick fixes:
Other ideas:
|
Part of this is because we were remembering (indefinitely) and gossiping bad peer addresses. This will be partially fixed in the next release (once a large fraction of the network has upgraded) but we should put some work into fixing this issue in general. I've filed libp2p/libp2p#27 to discuss.
I assume these dials are failing quickly so I don't think handling that error would help.
I believe the real trick will be introducing finer grained timeouts. Currently, we have a global dial timeout (that includes, e.g., security transport negotiation) but we really need a shorter timeout at the TCP level. Issue: libp2p/go-tcp-transport#22 Note: Many of these issues will be fixed when we get QUIC support. As QUIC connections don't take up file descriptors, we'll be able to spam as many concurrent dials as we want and cancel all but the first that succeeds. |
Some thought as this problem really bite me for my project. As I understand, when a DHT query is done, we accumulate values until we get Another point that's worth looking into IMHO is how fast and how well values are propagated when doing a publish. In my real world full of NAT test, I have many dht queries (ipns) that don't resolve at all for a long time. But after some time (read many minutes) queries resolve decently. I have the feeling that values get replicated enough to reach the |
@MichaelMure this PR will allow clients to specify the timeout duration, and to specify the number of values to collect: ipfs/kubo#4733 |
@dirkmc this PR allow to set the level of confirmation requested for a query, but my point is different. Most of the time if you can't get to this level, you still want to use whatever information you got. Why a query should completely fail when we got only N-1 values instead of N ? In bad network condition, answering with whatever we got when the timeout hit would make the difference between "might be degraded" and "not working at all". |
@MichaelMure actually the query does return with the best record it has found so far when it hits the timeout:
Note this PR has now been merged and will go out with the next release |
@dirkmc ho indeed, I missed a code path. As it was also working that way even before your PR, I wonder how a query can completely fail.... |
Yes, you're right it was working that way before already. My PR just allows you to set a specific timeout and a specific number of values to try to collect. The query fails to resolve an IPNS name if for example it is unable to find any peers with a value for the target key. |
Just a quick update here, the current primary cause of DHT slowness at this point appears to be poor connectivity. We are working to address this by adding in better relay support throughout the network (and other longer term solutions) |
@whyrusleeping have you guys done some research that indicates that poor connectivity is definitely the cause? I'm wondering how much is caused by some of the things @Stebalien mentioned above, eg gossiping bad peer addresses and ephemeral nodes |
@dirkmc we have a crawler which you can use to collect network connectivity diagnostic data from your own network vantage point. It seems that addrsplosion is under control and the majority of dials in the dht are timing out at about 1min. |
@vyzo I commented above about why timeouts take exactly one minute, and @Stebalien suggested that this issue should be alleviated by providing a QUIC transport because "As QUIC connections don't take up file descriptors, we'll be able to spam as many concurrent dials as we want and cancel all but the first that succeeds." Has there been any progress on adding a QUIC transport? |
@dirkmc QUIC is still some ways out... |
@dirkmc For this issue, adding more concurrent dials won't fix it. Many of the nodes we are failing to dial have fewer addresses than the concurrency limit, its just that they don't have any addresses that work. |
Sorry if this is a silly comment but perhaps we should be scoring the addresses that work and gossiping those? One way to do this is to rely on newcomers. When a new node joins the DHT, each node that it contacts during its initial join operations can ping it again some fixed number of minutes later. This helps ensure accurate mixing and mapping without adding much additional network burden. Perhaps preferring node addresses recently gossiped via this mechanism would help spread out the work of pruning and sorting live node addresses? |
We do a bit of that. Addresses that are used to connect to a peer are
communicated to them. If a peer is told about an address of theirs multiple
times, they will keep it and tell others about it. The main issue here is
that no such address exists for many peers.
…On Thu, May 3, 2018, 2:39 PM Adam Bouhenguel ***@***.***> wrote:
Sorry if this is a silly comment but perhaps we should be scoring the
addresses that work and gossiping those?
One way to do this is to rely on newcomers. When a new node joins the DHT,
each node that it contacts during its initial join operations can ping it
again some fixed number of minutes later. This helps ensure accurate mixing
and mapping without adding much additional network burden.
Perhaps preferring node addresses recently gossiped via this mechanism
would help spread out the work of pruning and sorting live node addresses?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABL4HNRMLKoOToFBQzl8daCN7bR3AYoUks5tupgcgaJpZM4PUPLn>
.
|
I see, I forgot we're in the world of TCP. Another stupid question: if a
routable address doesn't exist for a peer, how does an invalid address end
up in another node's list of addresses to try?
…On Thu, May 3, 2018, 02:32 Whyrusleeping ***@***.***> wrote:
We do a bit of that. Addresses that are used to connect to a peer are
communicated to them. If a peer is told about an address of theirs multiple
times, they will keep it and tell others about it. The main issue here is
that no such address exists for many peers.
On Thu, May 3, 2018, 2:39 PM Adam Bouhenguel ***@***.***>
wrote:
> Sorry if this is a silly comment but perhaps we should be scoring the
> addresses that work and gossiping those?
>
> One way to do this is to rely on newcomers. When a new node joins the
DHT,
> each node that it contacts during its initial join operations can ping it
> again some fixed number of minutes later. This helps ensure accurate
mixing
> and mapping without adding much additional network burden.
>
> Perhaps preferring node addresses recently gossiped via this mechanism
> would help spread out the work of pruning and sorting live node
addresses?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#88 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/ABL4HNRMLKoOToFBQzl8daCN7bR3AYoUks5tupgcgaJpZM4PUPLn
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAcnWoQuMkmhynb0SgnQDpx2RQUe5Nlks5tus60gaJpZM4PUPLn>
.
|
Not a stupid question, the answer is "it depends". Sometimes the answer is "I dont know", sometimes the answer is "the node thinks that they can be dialed there, but they really cant", sometimes the answer is "Other nodes have told the node that they see them at that address (for outbound dials from the node)". |
@ajbouh that's the million dollar question: why exactly are these nodes unreachable? It would be great to have some solid statistics on what percentage of nodes are in each category that @whyrusleeping mentioned |
@whyrusleeping currently if (for example) the first nodes to be dialled have 160 TCP addresses between them that all time out, then that clogs up all of the available file descriptors allocated to the dialer until the time out occurs (after one minute). QUIC should help alleviate this problem because as I understand it doesn't use up file descriptors, so we can dial out to as many peers as we want, the dialer doesn't get blocked up with slow connections. |
Given the ambiguity around the statistics here, the connectivity issues are
starting to make more sense to me.
Alas, my coding time is overcommitted at the moment, so I must wait until
I'm working on something that's completely blocked by these performance
issues before I can really dive in.
In the meantime, I'm really glad people are thinking hard about this!
…On Thu, May 3, 2018, 08:25 dirkmc ***@***.***> wrote:
@ajbouh <https://github.com/ajbouh> that's the million dollar question:
why exactly are these nodes unreachable? It would be great to have some
solid statistics on what percentage of nodes are in each category that
@whyrusleeping <https://github.com/whyrusleeping> mentioned
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAcneUZOF8H26vnWZridq4-XS797JoUks5tuyFTgaJpZM4PUPLn>
.
|
This issue is now old enough that it contains a lot of historical information that's not relevant. The main DHT query logic has been re-written. Note: The fact that we serialize all queries to a specific peer through a single stream is still a performance bottleneck, but the real performance bottleneck was always the DHT query logic itself. |
DHT queries are slow, in the order of several seconds, which is a performance problem.
This issue is here to discuss performance and propose ways to measure and improve it.
The text was updated successfully, but these errors were encountered: