Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(quic): Trigger Quic as Transport wakeup on dial #3306

Merged
merged 7 commits into from
Jan 11, 2023

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Jan 5, 2023

Description

Scenario: rust-libp2p node A dials rust-libp2p node B. B listens on a QUIC address. A dials B via the libp2p-quic Transport wrapped in a libp2p-dns Transport.

Note that libp2p-dns in itself is not relevant here. Only the fact that libp2p-dns delays a dial is relevant, i.e. that it first does other async stuff (DNS lookup) before creating the QUIC dial. In fact, dialing an IP address through the DNS Transport where no DNS resolution is needed triggers the below just fine.

  1. A calls Swarm::dial which creates a libp2p-dns dial.
  2. That dial is spawned onto the connection Pool, thus starting the DNS resolution.
  3. A continuously calls Swarm::poll.
  4. libp2p-quic Transport::poll is called, finding no dialers in self.dialer given that the spawned dial is still only resolving the DNS address.
  5. On the spawned connection task:
    1. The DNS resolution finishes.
    2. Thus calling Transport::dial on libp1p-quic (note that the DNS dial has a clone of the QUIC Transport wrapped in an Arc<Mutex<_>>).
    3. That adds a dialer to self.dialer. Note that there are no listeners, i.e. Swarm::listen_on was never called.
    4. DialerState::new_dial is called which adds a message to self.pending_dials and wakes self.waker. Given that on the last Transport::poll there was no self.dialer, that waker is empty.

Result: The message is stuck in the DialerState::pending_dials. The message is never send to the endpoint driver. The dial never succeeds.

This commit fixes the above, waking the <Quic as Transport>:poll method.

Notes

Steps to reproduce:

Links to any relevant issues

Open Questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Scenario: rust-libp2p node A dials rust-libp2p node B. B listens on a QUIC
address. A dials B via the `libp2p-quic` `Transport` wrapped in a `libp2p-dns`
`Transport`.

Note that `libp2p-dns` in itself is not relevant here. Only the fact that
`libp2p-dns` delays a dial is relevant, i.e. that it first does other async
stuff (DNS lookup) before creating the QUIC dial. In fact, dialing an IP address
through the DNS `Transport` where no DNS resolution is needed triggers the below
just fine.

1. A calls `Swarm::dial` which creates a `libp2p-dns` dial.
2. That dial is spawned onto the connection `Pool`, thus starting the DNS
   resolution.
3. A continuously calls `Swarm::poll`.
4. `libp2p-quic` `Transport::poll` is called, finding no dialers in
   `self.dialer` given that the spawned dial is still only resolving the DNS
   address.
5. On the spawned connection task:
  1. The DNS resolution finishes.
  2. Thus calling `Transport::dial` on `libp1p-quic` (note that the DNS dial has
     a clone of the QUIC `Transport` wrapped in an `Arc<Mutex<_>>`).
  3. That adds a dialer to `self.dialer`. Note that there are no listeners, i.e.
     `Swarm::listen_on` was never called.
  4. `DialerState::new_dial` is called which adds a message to
     `self.pending_dials` and wakes `self.waker`. Given that on the last
     `Transport::poll` there was no `self.dialer`, that waker is empty.

Result: The message is stuck in the `DialerState::pending_dials`. The message is
never send to the endpoint driver. The dial never succeeds.

This commit includes a hot-fix. It demonstrates the above issue more than it
represents a solid fix.
Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed bug description @mxinden, very much appreciated!

Looking through the code again, we basically always want to wake our Transport::poll if a new dial happens.
I think we can remove the waker from DialerState and instead in GenTransport::dial simply always call dial_waker.wake(), independently of whether we then use a listener/ existing dialer or create a new dialer. GenTransport::poll will poll all of them again anyway.

Only if long-term we switch to a more fine-grained polling logic (i.e. only poll something again if this structure triggered the waker) this would have to be reconsidered. But for now I think having a single waker is cleaner.

@mxinden
Copy link
Member Author

mxinden commented Jan 6, 2023

Looking through the code again, we basically always want to wake our Transport::poll if a new dial happens. I think we can remove the waker from DialerState and instead in GenTransport::dial simply always call dial_waker.wake(), independently of whether we then use a listener/ existing dialer or create a new dialer. GenTransport::poll will poll all of them again anyway.

Sounds good to me. See recent changes.

@elenaf9 thanks for the review. Would you mind taking another look?

Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment, and clippy is complaining. Otherwise this looks good to me.
Thanks @mxinden!

/// See https://github.com/libp2p/rust-libp2p/pull/3306 for context.
#[cfg(feature = "async-std")]
#[async_std::test]
async fn wrapped_with_dns() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

Comment on lines +193 to +198
// Note that the dial is spawned on a different task than the transport allowing the transport
// task to poll the transport once and then suspend, waiting for the wakeup from the dial.
let dial = async_std::task::spawn({
let dial = b_transport.dial(a_addr).unwrap();
async { dial.await.unwrap().0 }
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isn't it sufficient that a new task is already spawned for polling the transport future below?
We could just create the dial-future here, and then drive it in the future::join below. If we already spawn a task for the dial here, couldn't it happen that the dial resolves before the executor starts polling the transport task below (since with QUIC, the endpoint runs in a separate background task and thus things "run" even if the transport is not being polled)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already spawn a task for the dial here, couldn't it happen that the dial resolves before the executor starts polling the transport task below (since with QUIC, the endpoint runs in a separate background task and thus things "run" even if the transport is not being polled)?

Increased the Delay to 100ms, thus I don't think the above is probable. Will keep it as is. @elenaf9 please comment in case you feel strongly about this.

async fn wrapped_with_dns() {
let _ = env_logger::try_init();

struct FakeDns(Arc<Mutex<Boxed<(PeerId, StreamMuxerBox)>>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this DialDelay maybe?

// Simulate DNS lookup. Giving the `Transport::poll` the chance to return
// `Poll::Pending` and thus suspending its task, waiting for a wakeup from the dial
// on the inner transport below.
Delay::new(Duration::from_millis(1)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit short. Setting this to 100 wouldn't increase the test run time much because they all run in parallel and would make me feel easier that this actually makes a difference :)

@mxinden mxinden changed the title fix(quic): Trigger <Quic as Transport> wakeup on dial fix(quic): Trigger Quic as Transport wakeup on dial Jan 11, 2023
@mergify mergify bot merged commit 7665e74 into libp2p:master Jan 11, 2023
@thomaseizinger
Copy link
Contributor

Ever since this was merged, the libp2p-quic job hangs forever: https://github.com/libp2p/rust-libp2p/actions/runs/3893847657/jobs/6647025747

@thomaseizinger
Copy link
Contributor

I'd recommend we do two things:

  1. Revert this PR to unblock others
  2. Add libp2p-quic to the list of required jobs so this doesn't happen again: https://github.com/libp2p/github-mgmt/blob/573764714d656f03f35e665f6afd157614486e26/github/libp2p.yml#L4292C2-L4338

thomaseizinger added a commit that referenced this pull request Jan 13, 2023
@thomaseizinger
Copy link
Contributor

Revert PR submitted here: #3322

@thomaseizinger
Copy link
Contributor

Adding libp2p-quic and libp2p-webrtc to required jobs here: libp2p/github-mgmt#104

@elenaf9
Copy link
Contributor

elenaf9 commented Jan 13, 2023

Sorry, this was my bad. The issue is in the above solution I proposed i.e. to only have a single waker for the transport instead of having the waker in the DialState. I wrongly assumed to following:

Only if long-term we switch to a more fine-grained polling logic (i.e. only poll something again if this structure triggered the waker) this would have to be reconsidered. But for now I think having a single waker is cleaner.

It's true that we ourselves don't implement fine-grained polling. What I missed however is that we use a SelectAll for polling our Listeners. And the SelectAll does such fine-grained polling so that it only polls a listener if the listener registered a waker.
With this PR's change the waker was not registered from within the Listener's Stream implementation anymore, causing the listener to not be polled again.

Will do a PR with the fix. I'll try to think of an alternative solution to handle our wakers in a nice way, or else fall back to the original fix from @mxinden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants