Share and reuse wakers #15

eholk · 2022-04-09T06:42:01Z

This PR implements Option 2 from #14, and also builds upon #13.

It works by adding a list of InternalWaker objects to the Shared structure. Each of these internal wakers has a raw pointer to the Shared and a task index that it wakes. We don't use these objects as a waker directly, but instead wrap a pointer to them in InternalWakerRef. Creating an InternalWakerRef increments the ref count for the Shared state, and when the InternalWakerRef is dropped we decrement the Shared state ref count. This ensures that the Shared state will live as long as there are any references to a waker outstanding (see the long_lived_waker test in wakers.rs)

I added some tests to exercise tricky cases I could come up with and I made sure these pass running under Miri.

Benchmarks are a little mixed, but mostly good. See my results here: https://gist.github.com/eholk/f1ec4dd1a6376b1a046be6b058d93835 (I disabled the futures-rs variant to focus on comparing unicycle's current main branch with these changes)

Generally I saw small speed improvements from 3-7%. There were two significant slowdowns though. These were in the thread scaling tests once we got to 50 and 100 threads. I ran these tests on a 5950x with 32 logical cores, so it seems noteworthy that the slowdown started once the number of threads exceeded the number of cores. If this is true, maybe that regression is acceptable since most async code probably uses a thread-per-core model.

As far as what causes the slowdown at higher thread counts, I'm guessing it has to do with the RwLock I used in InternalWakers. I'm guessing there's another way we could do this, but I wanted to stick with something simple at first.

Anyway, I'd appreciate any feedback you have on this PR. Thanks!

Using an Arc instead prevents an allocation on every clone of the waker. Instead, we only need the allocation when converting from a RefWaker to Internals. We still end up with the same number of atomic ref count changes since before we had to increment the count on the Arc<Shared> but that is no longer necessary.

This allows wakers to be reused when upgrading a RefWaker to an InternalWaker. This is probably sub-optimal, but it serves as a starting point for a better one. This change adds some new locking overhead, and because we store weak pointers to the child wakers that end up having to be reallocated often. Future changes will store them inline and pass pointers around.

This avoids reallocating, so wakers are only allocated once per task id. The current waker tests pass miri, and the full test suite passes too, but the code still needs cleaned up and probably has some unsoundness around the waker vector resizing.

…ed when they need to be

Also start adding more safety comments

eholk · 2022-04-09T07:04:56Z

I did some more thread scaling measurements using 20..=40 threads. Here are the raw results: https://gist.github.com/eholk/665a69623189c6e57c4768ce42d870d3

Below is a graph of the performance with this PR applied:

It seems like things are relatively flat up through 31 threads, but starting at 32 the performance gets steadily worse. Criterion didn't find any of the changes after 27 threads to be significant, but if the baseline performance stays flat the thread count scales up it'd be pretty easy to see why we start seeing regressions on larger numbers of threads.

udoprog · 2022-04-09T09:19:11Z

Cheers! I'll look this over in a day or two. The reported performance scaling does look good to me.

One note from the brief skim:

The current solution uses two separate types for the Wake. One wrapping a reference, another one wrapping a reference after it's been cloned to maintain refcount.

In order for the Waker::will_wake optimization to work the pointer and vtable have to be equal (see the PartialEq derive). Upholding this allows implementations that need to defer waking to avoid extra clones.

Naively it looks like this should be possible by wrapping some uniform waker implementation to avoid dropping it for the duration of the poll. Do you think it is?

This lock does not need to be help very long so RwLock is much heavier than we need.

eholk · 2022-04-09T20:00:54Z

Ah, thanks for pointing out the will_wake optimization; I wasn't aware of it! I think it should be possible to preserve this behavior. One easy option is to get rid of the RefWaker entirely and only use InternalWakerRef. Another is that it might be possible to store both wakers as an enum and hope the compiler can pack the tag into some padding bits to fit the whole thing in a pointer. This would make the pointers be different though, so it'd probably still defeat the will_wake optimization. I'll think about this some more and play around with it early next week.

I just pushed another change that uses a Mutex instead of an RwLock. I realized the RwLock is overkill given that we don't actually need to hold the lock very long. I haven't done real scientific measurements yet, but it looks like the change improved performance.

The more I think about it though, I think we don't even need the lock at all. That's only there to protect the outer Vec in PinVec, but modifications there should only be happening on the thread that owns the FuturesUnordered. I'll explore this idea some more next week too.

eholk · 2022-04-11T22:25:25Z

I spent some more time looking at performance and locking today. The Mutex versus RwLock doesn't seem to have changed much, although I'm inclined to keep the Mutex because it's simpler.

I also tried it without any synchronization. I suspect it's safe, but I haven't been able to convince myself that there's no way to try to modify the vector from multiple threads. That said, it made basically no difference in the performance (at least as far as scaling to more threads is concerned), so I'm inclined to stick with the locking version that we know is safe.

I'm going to look at merging the representation of the different wakers now and hopefully be able to post something new this afternoon.

eholk · 2022-04-11T23:27:17Z

Alright, I just pushed a version that unifies the two modes of wakers using a needs_drop flag to tell them apart. This means there's just one vtable, and also less duplicated code.

udoprog · 2022-04-15T14:02:57Z

So I've had a bit more time to look through this now!

It seems like the current impl still has the will_wake problem, since vtable and pointer has to be equal in order for will_wake to return true. So when we call by ref we start out by having a pointer on the stack, and once that is cloned it transitions to a pointer on the collection of wakers. But I am also now realizing that this is an outstanding issue with the implementation being replaced (since a new ptr is allocated for every clone!). To make will_wake work, I think we'd have to up front ensure that you use a shared pre-allocated waker.

But since the will_wake optimization is already broken on main as noted above, this PR is still provides an improvement by sharing wakers. So I'll continue the review just looking at how the sharing improvement works:

Is needs_drop strictly necessary? Wouldn't it be possible to wrap the stack waker in a ManuallyDrop like before and assume that whenever clone is called it should be fetched from the shared waker set and the refcount increased?

eholk · 2022-04-15T21:38:36Z

Is needs_drop strictly necessary? Wouldn't it be possible to wrap the stack waker in a ManuallyDrop like before and assume that whenever clone is called it should be fetched from the shared waker set and the refcount increased?

Yeah, I think you're right. Basically, then clone would just do the else branch and call get_shared_waker unconditionally. The tradeoff is that although we'd save some space in the InternalWaker struct (which is potentially significant because there may be a lot of them), now every clone would have to acquire the lock on all_wakers.wakers instead of just bumping the ref count for the current waker.

Alternatively, we could get rid of the stack wakers. I tried this earlier this week and that had around a 15% slowdown if I remember right, so I abandoned that pretty quickly.

Anyway, I don't think there's a clear best option here, so I'm happy to do either keep the needs_drop, or get rid of it and have clone do get_shared_waker unconditionally.

eholk · 2022-04-15T22:17:19Z

Okay, I went ahead and tried out removing the needs_drop flag and I think it's the clear winner. Performance seems to be about the same, and even a little better in the high thread count cases. The need_drop flag was also taking an additional 8 bytes in the InternalWaker struct, which is kind of a lot.

For fun, I also tried using an enum for InternalWaker in hopes that the compiler could do some layout optimization to pack the discriminant into the unused values of the shared field, but it looks like the compiler's not smart enough to do that yet.

udoprog · 2022-04-16T14:49:41Z

Looks great!

I'd want to try out the pre-allocating variant at some point in the future to make will_wake work. I.e. something that allocates waker storage once futures are added to the collection. For now this is a clear improvement. Thank you!

eholk · 2022-04-18T15:41:35Z

Awesome, thanks!

eholk added 18 commits March 31, 2022 17:17

Refactor Waker to make unsafe parts clearer and use less unsafe

7620e67

Refactor Waker to make unsafe parts clearer and use less unsafe

21622ac

Use "unchecked" naming to be more consistent with std

fe102d8

Use "unchecked" naming to be more consistent with std

2a95950

Add a simple test case for wakers

2795ded

Add RefWaker to avoid cloning Arc when not necessary

5da935e

Remove ManuallyDrop since it's no longer needed

e106096

Add a test for wakers that outlive the FuturesUnordered container

858c645

Merge branch 'safer-waker' into no-alloc-clone

550a9fe

Reuse internal waker objects

f014dac

This avoids reallocating, so wakers are only allocated once per task id. The current waker tests pass miri, and the full test suite passes too, but the code still needs cleaned up and probably has some unsoundness around the waker vector resizing.

Add a test to hopefully trigger cases where we don't keep wakers pinn…

51b6dac

…ed when they need to be

Tweak test to fail both in miri and not

3aaff83

Add PinVec to fix Waker problem

24fde6c

Use PinVec as the backing for PinSlab

f45942c

Rework InternalWaker::clone to avoid hammering the RwLock

c1a494f

Also start adding more safety comments

Refactor, improve documentation and visibility

abbf14d

Use Mutex instead of RwLock

ac80f99

This lock does not need to be help very long so RwLock is much heavier than we need.

Use a single representation for both stack-allocated and owned wakers

76ae45f

Remove accidental changes

909701a

Get rid of needs_drop flag

0d07b64

udoprog merged commit 13cf17e into udoprog:main Apr 16, 2022

eholk deleted the no-alloc-clone branch April 18, 2022 15:41

This was referenced Apr 18, 2022

Refactor Waker to make unsafe parts clearer and use less unsafe #13

Closed

Is it possible to remove allocation in Internals::clone? #14

Closed

udoprog added the enhancement New feature or request label Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share and reuse wakers #15

Share and reuse wakers #15

eholk commented Apr 9, 2022

eholk commented Apr 9, 2022

udoprog commented Apr 9, 2022 •

edited

Loading

eholk commented Apr 9, 2022

eholk commented Apr 11, 2022

eholk commented Apr 11, 2022

udoprog commented Apr 15, 2022

eholk commented Apr 15, 2022

eholk commented Apr 15, 2022

udoprog commented Apr 16, 2022

eholk commented Apr 18, 2022

Share and reuse wakers #15

Share and reuse wakers #15

Conversation

eholk commented Apr 9, 2022

eholk commented Apr 9, 2022

udoprog commented Apr 9, 2022 • edited Loading

eholk commented Apr 9, 2022

eholk commented Apr 11, 2022

eholk commented Apr 11, 2022

udoprog commented Apr 15, 2022

eholk commented Apr 15, 2022

eholk commented Apr 15, 2022

udoprog commented Apr 16, 2022

eholk commented Apr 18, 2022

udoprog commented Apr 9, 2022 •

edited

Loading