-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(quic): Trigger Quic as Transport wakeup on dial #3306
Conversation
Scenario: rust-libp2p node A dials rust-libp2p node B. B listens on a QUIC address. A dials B via the `libp2p-quic` `Transport` wrapped in a `libp2p-dns` `Transport`. Note that `libp2p-dns` in itself is not relevant here. Only the fact that `libp2p-dns` delays a dial is relevant, i.e. that it first does other async stuff (DNS lookup) before creating the QUIC dial. In fact, dialing an IP address through the DNS `Transport` where no DNS resolution is needed triggers the below just fine. 1. A calls `Swarm::dial` which creates a `libp2p-dns` dial. 2. That dial is spawned onto the connection `Pool`, thus starting the DNS resolution. 3. A continuously calls `Swarm::poll`. 4. `libp2p-quic` `Transport::poll` is called, finding no dialers in `self.dialer` given that the spawned dial is still only resolving the DNS address. 5. On the spawned connection task: 1. The DNS resolution finishes. 2. Thus calling `Transport::dial` on `libp1p-quic` (note that the DNS dial has a clone of the QUIC `Transport` wrapped in an `Arc<Mutex<_>>`). 3. That adds a dialer to `self.dialer`. Note that there are no listeners, i.e. `Swarm::listen_on` was never called. 4. `DialerState::new_dial` is called which adds a message to `self.pending_dials` and wakes `self.waker`. Given that on the last `Transport::poll` there was no `self.dialer`, that waker is empty. Result: The message is stuck in the `DialerState::pending_dials`. The message is never send to the endpoint driver. The dial never succeeds. This commit includes a hot-fix. It demonstrates the above issue more than it represents a solid fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed bug description @mxinden, very much appreciated!
Looking through the code again, we basically always want to wake our Transport::poll
if a new dial happens.
I think we can remove the waker
from DialerState
and instead in GenTransport::dial
simply always call dial_waker.wake()
, independently of whether we then use a listener/ existing dialer or create a new dialer. GenTransport::poll
will poll all of them again anyway.
Only if long-term we switch to a more fine-grained polling logic (i.e. only poll something again if this structure triggered the waker) this would have to be reconsidered. But for now I think having a single waker is cleaner.
Sounds good to me. See recent changes. @elenaf9 thanks for the review. Would you mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment, and clippy is complaining. Otherwise this looks good to me.
Thanks @mxinden!
/// See https://github.com/libp2p/rust-libp2p/pull/3306 for context. | ||
#[cfg(feature = "async-std")] | ||
#[async_std::test] | ||
async fn wrapped_with_dns() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
// Note that the dial is spawned on a different task than the transport allowing the transport | ||
// task to poll the transport once and then suspend, waiting for the wakeup from the dial. | ||
let dial = async_std::task::spawn({ | ||
let dial = b_transport.dial(a_addr).unwrap(); | ||
async { dial.await.unwrap().0 } | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: isn't it sufficient that a new task is already spawned for polling the transport future below?
We could just create the dial-future here, and then drive it in the future::join
below. If we already spawn a task for the dial here, couldn't it happen that the dial resolves before the executor starts polling the transport task below (since with QUIC, the endpoint runs in a separate background task and thus things "run" even if the transport is not being polled)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already spawn a task for the dial here, couldn't it happen that the dial resolves before the executor starts polling the transport task below (since with QUIC, the endpoint runs in a separate background task and thus things "run" even if the transport is not being polled)?
Increased the Delay
to 100ms
, thus I don't think the above is probable. Will keep it as is. @elenaf9 please comment in case you feel strongly about this.
transports/quic/tests/smoke.rs
Outdated
async fn wrapped_with_dns() { | ||
let _ = env_logger::try_init(); | ||
|
||
struct FakeDns(Arc<Mutex<Boxed<(PeerId, StreamMuxerBox)>>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this DialDelay
maybe?
transports/quic/tests/smoke.rs
Outdated
// Simulate DNS lookup. Giving the `Transport::poll` the chance to return | ||
// `Poll::Pending` and thus suspending its task, waiting for a wakeup from the dial | ||
// on the inner transport below. | ||
Delay::new(Duration::from_millis(1)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit short. Setting this to 100 wouldn't increase the test run time much because they all run in parallel and would make me feel easier that this actually makes a difference :)
<Quic as Transport>
wakeup on dial
Ever since this was merged, the |
I'd recommend we do two things:
|
This reverts commit 7665e74.
Revert PR submitted here: #3322 |
Adding |
Sorry, this was my bad. The issue is in the above solution I proposed i.e. to only have a single waker for the transport instead of having the waker in the
It's true that we ourselves don't implement fine-grained polling. What I missed however is that we use a Will do a PR with the fix. I'll try to think of an alternative solution to handle our wakers in a nice way, or else fall back to the original fix from @mxinden. |
Description
Scenario: rust-libp2p node A dials rust-libp2p node B. B listens on a QUIC address. A dials B via the
libp2p-quic
Transport
wrapped in alibp2p-dns
Transport
.Note that
libp2p-dns
in itself is not relevant here. Only the fact thatlibp2p-dns
delays a dial is relevant, i.e. that it first does other async stuff (DNS lookup) before creating the QUIC dial. In fact, dialing an IP address through the DNSTransport
where no DNS resolution is needed triggers the below just fine.Swarm::dial
which creates alibp2p-dns
dial.Pool
, thus starting the DNS resolution.Swarm::poll
.libp2p-quic
Transport::poll
is called, finding no dialers inself.dialer
given that the spawned dial is still only resolving the DNS address.Transport::dial
onlibp1p-quic
(note that the DNS dial has a clone of the QUICTransport
wrapped in anArc<Mutex<_>>
).self.dialer
. Note that there are no listeners, i.e.Swarm::listen_on
was never called.DialerState::new_dial
is called which adds a message toself.pending_dials
and wakesself.waker
. Given that on the lastTransport::poll
there was noself.dialer
, that waker is empty.Result: The message is stuck in the
DialerState::pending_dials
. The message is never send to the endpoint driver. The dial never succeeds.This commit fixes the above, waking the
<Quic as Transport>:poll
method.Notes
Steps to reproduce:
Server
cargo run
Client
RUST_LOG=debug cargo run -- direct --address /ip4/127.0.0.1/udp/4001/quic
Links to any relevant issues
Open Questions
Change checklist