Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share and reuse wakers #15

Merged
merged 22 commits into from
Apr 16, 2022
Merged

Share and reuse wakers #15

merged 22 commits into from
Apr 16, 2022

Conversation

eholk
Copy link
Collaborator

@eholk eholk commented Apr 9, 2022

This PR implements Option 2 from #14, and also builds upon #13.

It works by adding a list of InternalWaker objects to the Shared structure. Each of these internal wakers has a raw pointer to the Shared and a task index that it wakes. We don't use these objects as a waker directly, but instead wrap a pointer to them in InternalWakerRef. Creating an InternalWakerRef increments the ref count for the Shared state, and when the InternalWakerRef is dropped we decrement the Shared state ref count. This ensures that the Shared state will live as long as there are any references to a waker outstanding (see the long_lived_waker test in wakers.rs)

I added some tests to exercise tricky cases I could come up with and I made sure these pass running under Miri.

Benchmarks are a little mixed, but mostly good. See my results here: https://gist.github.com/eholk/f1ec4dd1a6376b1a046be6b058d93835 (I disabled the futures-rs variant to focus on comparing unicycle's current main branch with these changes)

Generally I saw small speed improvements from 3-7%. There were two significant slowdowns though. These were in the thread scaling tests once we got to 50 and 100 threads. I ran these tests on a 5950x with 32 logical cores, so it seems noteworthy that the slowdown started once the number of threads exceeded the number of cores. If this is true, maybe that regression is acceptable since most async code probably uses a thread-per-core model.

As far as what causes the slowdown at higher thread counts, I'm guessing it has to do with the RwLock I used in InternalWakers. I'm guessing there's another way we could do this, but I wanted to stick with something simple at first.

Anyway, I'd appreciate any feedback you have on this PR. Thanks!

eholk added 18 commits March 31, 2022 17:17
Using an Arc instead prevents an allocation on every clone of the waker.
Instead, we only need the allocation when converting from a RefWaker
to Internals. We still end up with the same number of atomic ref count
changes since before we had to increment the count on the Arc<Shared>
but that is no longer necessary.
This allows wakers to be reused when upgrading a RefWaker to an
InternalWaker.

This is probably sub-optimal, but it serves as a starting point for a
better one. This change adds some new locking overhead, and because we
store weak pointers to the child wakers that end up having to be
reallocated often. Future changes will store them inline and pass
pointers around.
This avoids reallocating, so wakers are only allocated once per task id.

The current waker tests pass miri, and the full test suite passes too,
but the code still needs cleaned up and probably has some unsoundness
around the waker vector resizing.
Also start adding more safety comments
@eholk
Copy link
Collaborator Author

eholk commented Apr 9, 2022

I did some more thread scaling measurements using 20..=40 threads. Here are the raw results: https://gist.github.com/eholk/665a69623189c6e57c4768ce42d870d3

Below is a graph of the performance with this PR applied:

image

It seems like things are relatively flat up through 31 threads, but starting at 32 the performance gets steadily worse. Criterion didn't find any of the changes after 27 threads to be significant, but if the baseline performance stays flat the thread count scales up it'd be pretty easy to see why we start seeing regressions on larger numbers of threads.

@udoprog
Copy link
Owner

udoprog commented Apr 9, 2022

Cheers! I'll look this over in a day or two. The reported performance scaling does look good to me.

One note from the brief skim:

The current solution uses two separate types for the Wake. One wrapping a reference, another one wrapping a reference after it's been cloned to maintain refcount.

In order for the Waker::will_wake optimization to work the pointer and vtable have to be equal (see the PartialEq derive). Upholding this allows implementations that need to defer waking to avoid extra clones.

Naively it looks like this should be possible by wrapping some uniform waker implementation to avoid dropping it for the duration of the poll. Do you think it is?

This lock does not need to be help very long so RwLock is much heavier
than we need.
@eholk
Copy link
Collaborator Author

eholk commented Apr 9, 2022

Ah, thanks for pointing out the will_wake optimization; I wasn't aware of it! I think it should be possible to preserve this behavior. One easy option is to get rid of the RefWaker entirely and only use InternalWakerRef. Another is that it might be possible to store both wakers as an enum and hope the compiler can pack the tag into some padding bits to fit the whole thing in a pointer. This would make the pointers be different though, so it'd probably still defeat the will_wake optimization. I'll think about this some more and play around with it early next week.

I just pushed another change that uses a Mutex instead of an RwLock. I realized the RwLock is overkill given that we don't actually need to hold the lock very long. I haven't done real scientific measurements yet, but it looks like the change improved performance.

The more I think about it though, I think we don't even need the lock at all. That's only there to protect the outer Vec in PinVec, but modifications there should only be happening on the thread that owns the FuturesUnordered. I'll explore this idea some more next week too.

@eholk
Copy link
Collaborator Author

eholk commented Apr 11, 2022

I spent some more time looking at performance and locking today. The Mutex versus RwLock doesn't seem to have changed much, although I'm inclined to keep the Mutex because it's simpler.

I also tried it without any synchronization. I suspect it's safe, but I haven't been able to convince myself that there's no way to try to modify the vector from multiple threads. That said, it made basically no difference in the performance (at least as far as scaling to more threads is concerned), so I'm inclined to stick with the locking version that we know is safe.

I'm going to look at merging the representation of the different wakers now and hopefully be able to post something new this afternoon.

@eholk
Copy link
Collaborator Author

eholk commented Apr 11, 2022

Alright, I just pushed a version that unifies the two modes of wakers using a needs_drop flag to tell them apart. This means there's just one vtable, and also less duplicated code.

@udoprog
Copy link
Owner

udoprog commented Apr 15, 2022

So I've had a bit more time to look through this now!

It seems like the current impl still has the will_wake problem, since vtable and pointer has to be equal in order for will_wake to return true. So when we call by ref we start out by having a pointer on the stack, and once that is cloned it transitions to a pointer on the collection of wakers. But I am also now realizing that this is an outstanding issue with the implementation being replaced (since a new ptr is allocated for every clone!). To make will_wake work, I think we'd have to up front ensure that you use a shared pre-allocated waker.

But since the will_wake optimization is already broken on main as noted above, this PR is still provides an improvement by sharing wakers. So I'll continue the review just looking at how the sharing improvement works:

Is needs_drop strictly necessary? Wouldn't it be possible to wrap the stack waker in a ManuallyDrop like before and assume that whenever clone is called it should be fetched from the shared waker set and the refcount increased?

@eholk
Copy link
Collaborator Author

eholk commented Apr 15, 2022

Is needs_drop strictly necessary? Wouldn't it be possible to wrap the stack waker in a ManuallyDrop like before and assume that whenever clone is called it should be fetched from the shared waker set and the refcount increased?

Yeah, I think you're right. Basically, then clone would just do the else branch and call get_shared_waker unconditionally. The tradeoff is that although we'd save some space in the InternalWaker struct (which is potentially significant because there may be a lot of them), now every clone would have to acquire the lock on all_wakers.wakers instead of just bumping the ref count for the current waker.

Alternatively, we could get rid of the stack wakers. I tried this earlier this week and that had around a 15% slowdown if I remember right, so I abandoned that pretty quickly.

Anyway, I don't think there's a clear best option here, so I'm happy to do either keep the needs_drop, or get rid of it and have clone do get_shared_waker unconditionally.

@eholk
Copy link
Collaborator Author

eholk commented Apr 15, 2022

Okay, I went ahead and tried out removing the needs_drop flag and I think it's the clear winner. Performance seems to be about the same, and even a little better in the high thread count cases. The need_drop flag was also taking an additional 8 bytes in the InternalWaker struct, which is kind of a lot.

For fun, I also tried using an enum for InternalWaker in hopes that the compiler could do some layout optimization to pack the discriminant into the unused values of the shared field, but it looks like the compiler's not smart enough to do that yet.

@udoprog
Copy link
Owner

udoprog commented Apr 16, 2022

Looks great!

I'd want to try out the pre-allocating variant at some point in the future to make will_wake work. I.e. something that allocates waker storage once futures are added to the collection. For now this is a clear improvement. Thank you!

@udoprog udoprog merged commit 13cf17e into udoprog:main Apr 16, 2022
@eholk
Copy link
Collaborator Author

eholk commented Apr 18, 2022

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants