Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

New Dialer #243

Merged
merged 33 commits into from
Apr 1, 2021
Merged

New Dialer #243

merged 33 commits into from
Apr 1, 2021

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Mar 19, 2021

The direct connection rework introduced two bugs in the dialer and exposed a variety of issues.

  1. Prefer non-transient connections over transient connections when opening new streams. Otherwise, we could miss a usable connection.
  2. Do not use the first dial's context to derive the "dial once" context when dialing. Otherwise, canceling this context can cancel subsequent (joined) dial attempts.

2 is actually pretty common. If I try to make two requests to a peer a the same time, then decide to cancel the first, the second request will get canceled.


Requirements to fix the situation:

  1. Add tests. We need a test for both of these cases. We need a test for the first case. We had a test for the second case and ignored failing CI.
  2. Figure out the correct approach to 2. Here, I'm just reverting the current behavior but this revert will break direct connection support.

A better implementation of 2 is to have new dials "join" an active dial by sending an address filter function to the active dial over a channel. The channel would then:

  1. Go back through previously "filtered" addresses and see if any of them are allowed by this new function. If so, add them on to the end of the "to dial" channel.
  2. Whenever a connection attempt succeeds, the active dial worker would look through the address filter functions sent by all "joined" dials and send the new connection back to any caller where the filter function passes.

This is my preferred approach.

We could also consider approaches where:

  1. We could make a new context but copy the options over. This is still wrong, but at least it's less broken.
  2. If our options differ with an active dial, we could wait for that active dial to finish. This is slow, but the simplest and least wrong approach.

But given that we now have multiple people working on this project, I'd much prefer to go with the correct approach.


The new dialer was implemented in #250 .

@Stebalien Stebalien changed the title [WIP] Fix two bugs introduced in direct connection rework Fix two bugs introduced in direct connection rework Mar 19, 2021
@vyzo vyzo self-requested a review March 19, 2021 19:53
@vyzo
Copy link
Contributor

vyzo commented Mar 19, 2021

We also need to figure out a mechanism to pass down context options to the transport where applicable.

@vyzo
Copy link
Contributor

vyzo commented Mar 19, 2021

For instance, in libp2p/go-libp2p-quic-transport#194 we do make use of the SimultaneousConnect context option to have one side punch a hole, which is quite important for a successful hole punch.

@Stebalien
Copy link
Member Author

We also need to figure out a mechanism to pass down context options to the transport where applicable.

That will depend on the option. The simplest is to just let the first to dial win. That should work for the "active listen" case and is probably our best option.

Unfortunately, that doesn't work so well for the direct connection case because the two different dials have two different requirements.

swarm.go Outdated Show resolved Hide resolved
@Stebalien Stebalien force-pushed the fix/direct-connect branch from 808a8e5 to ddfb5b3 Compare March 19, 2021 20:19
Given two relay connections, prefer the one that's non-transient.
Otherwise, a transient connection could prevent us from opening streams
even if we have a second non-transient connection through a better relay.
Otherwise, canceling one dial request will cancel all "joined" dial requests.
@Stebalien Stebalien force-pushed the fix/direct-connect branch from ddfb5b3 to 14d544b Compare March 19, 2021 20:20
@vyzo vyzo changed the title Fix two bugs introduced in direct connection rework New Dialer Apr 1, 2021
@vyzo
Copy link
Contributor

vyzo commented Apr 1, 2021

The test failure was the flaky connection gating test; see #251.

@vyzo vyzo merged commit 68ae307 into master Apr 1, 2021
@vyzo vyzo deleted the fix/direct-connect branch April 1, 2021 20:41
@Stebalien Stebalien mentioned this pull request May 11, 2021
27 tasks
@aschmahmann aschmahmann mentioned this pull request May 14, 2021
71 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants