-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share and reuse wakers #15
Conversation
Using an Arc instead prevents an allocation on every clone of the waker. Instead, we only need the allocation when converting from a RefWaker to Internals. We still end up with the same number of atomic ref count changes since before we had to increment the count on the Arc<Shared> but that is no longer necessary.
This allows wakers to be reused when upgrading a RefWaker to an InternalWaker. This is probably sub-optimal, but it serves as a starting point for a better one. This change adds some new locking overhead, and because we store weak pointers to the child wakers that end up having to be reallocated often. Future changes will store them inline and pass pointers around.
This avoids reallocating, so wakers are only allocated once per task id. The current waker tests pass miri, and the full test suite passes too, but the code still needs cleaned up and probably has some unsoundness around the waker vector resizing.
…ed when they need to be
Also start adding more safety comments
I did some more thread scaling measurements using 20..=40 threads. Here are the raw results: https://gist.github.com/eholk/665a69623189c6e57c4768ce42d870d3 Below is a graph of the performance with this PR applied: It seems like things are relatively flat up through 31 threads, but starting at 32 the performance gets steadily worse. Criterion didn't find any of the changes after 27 threads to be significant, but if the baseline performance stays flat the thread count scales up it'd be pretty easy to see why we start seeing regressions on larger numbers of threads. |
Cheers! I'll look this over in a day or two. The reported performance scaling does look good to me. One note from the brief skim: The current solution uses two separate types for the Wake. One wrapping a reference, another one wrapping a reference after it's been cloned to maintain refcount. In order for the Waker::will_wake optimization to work the pointer and vtable have to be equal (see the Naively it looks like this should be possible by wrapping some uniform waker implementation to avoid dropping it for the duration of the poll. Do you think it is? |
This lock does not need to be help very long so RwLock is much heavier than we need.
Ah, thanks for pointing out the I just pushed another change that uses a The more I think about it though, I think we don't even need the lock at all. That's only there to protect the outer |
I spent some more time looking at performance and locking today. The I also tried it without any synchronization. I suspect it's safe, but I haven't been able to convince myself that there's no way to try to modify the vector from multiple threads. That said, it made basically no difference in the performance (at least as far as scaling to more threads is concerned), so I'm inclined to stick with the locking version that we know is safe. I'm going to look at merging the representation of the different wakers now and hopefully be able to post something new this afternoon. |
Alright, I just pushed a version that unifies the two modes of wakers using a |
So I've had a bit more time to look through this now! It seems like the current impl still has the But since the Is |
Yeah, I think you're right. Basically, then Alternatively, we could get rid of the stack wakers. I tried this earlier this week and that had around a 15% slowdown if I remember right, so I abandoned that pretty quickly. Anyway, I don't think there's a clear best option here, so I'm happy to do either keep the |
Okay, I went ahead and tried out removing the For fun, I also tried using an enum for |
Looks great! I'd want to try out the pre-allocating variant at some point in the future to make |
Awesome, thanks! |
This PR implements Option 2 from #14, and also builds upon #13.
It works by adding a list of
InternalWaker
objects to theShared
structure. Each of these internal wakers has a raw pointer to theShared
and a task index that it wakes. We don't use these objects as a waker directly, but instead wrap a pointer to them inInternalWakerRef
. Creating anInternalWakerRef
increments the ref count for theShared
state, and when theInternalWakerRef
is dropped we decrement theShared
state ref count. This ensures that theShared
state will live as long as there are any references to a waker outstanding (see thelong_lived_waker
test in wakers.rs)I added some tests to exercise tricky cases I could come up with and I made sure these pass running under Miri.
Benchmarks are a little mixed, but mostly good. See my results here: https://gist.github.com/eholk/f1ec4dd1a6376b1a046be6b058d93835 (I disabled the futures-rs variant to focus on comparing unicycle's current main branch with these changes)
Generally I saw small speed improvements from 3-7%. There were two significant slowdowns though. These were in the thread scaling tests once we got to 50 and 100 threads. I ran these tests on a 5950x with 32 logical cores, so it seems noteworthy that the slowdown started once the number of threads exceeded the number of cores. If this is true, maybe that regression is acceptable since most async code probably uses a thread-per-core model.
As far as what causes the slowdown at higher thread counts, I'm guessing it has to do with the
RwLock
I used inInternalWakers
. I'm guessing there's another way we could do this, but I wanted to stick with something simple at first.Anyway, I'd appreciate any feedback you have on this PR. Thanks!