-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Reduce allocations when async methods yield #13105
Conversation
@@ -575,10 +446,10 @@ public void SetResult(TResult result) | |||
{ | |||
// Get the currently stored task, which will be non-null if get_Task has already been accessed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone back and forth on whether to explicitly zero out the StateMachine field here (and in SetException). On the one hand, it would potentially release any references held by the state machine object earlier than they'd otherwise be released. On the other hand, it would have a cost of zero'ing out the memory, and in most cases, the task object itself is about to get dropped, so it wouldn't actually help in most cases. Open to other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you zero it out, that could impede any "interesting" state captured by the state machine in the debugger, right? Is that worth keeping around? Esp if an Exception happens, couldn't the state machine state could help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, potentially.
You keep weakening my AysncLocal is evil generalization 😶 |
Mwahahaha |
As an aside...
Fire-and-forget functions allocate if you catch assign the return Task. Just a gap worth thinking about. e.g. maybe a lighter Task handle; that doesn't allocate until a property is inspected? So if you did var task0 = ThingAsync();
var task1 = ThingAsync();
var task2 = ThingAsync();
var task3 = ThingAsync();
var task4 = ThingAsync();
var task5 = ThingAsync();
// WhenAll inspects and allocates;
// however may have been time for some of them to be regular pre-completed Tasks
await Task.WhenAll(task0, task1, task2, task3, task4); Extra indirection might slow things down though. Just a thought... |
@gregg-miskelly, I've been validating debugger behavior with these changes. I think I've fixed the issues I've found, with one exception. It looks like the debugger's async step-out might not be implemented as it was designed to be. The builders expose a SetNotificationForWaitCompletion method that the debugger is meant to use as part of the step-out implementation, but from observed behavior it looks like the debugger might be using "builder.Task.SetNotificationForWaitCompletion(bool)" instead of "builder.SetNotificationForWaitCompletion(bool)". Can you confirm? And if so, can we fix the debugger to use the method on the builder instead? If we don't make that change, this PR will break some step-out situations, and to fix that, I'd need to add non-trivial allocation overhead for a common non-debugger scenario. |
Believe me when I tell you that we are VERY EAGER to get our hands on this. 5 out of 10 of our top allocators are async machinery related. |
👍 |
@gregg-miskelly, thanks for the offline conversation. I fixed the async call stack issue by renaming back the helper type to the known AsyncMethodBuilderCore name the debugger cares about. Other than the change the debugger should make from using |
What is status of this PR? Are we blocked on @gregg-miskelly review? |
Yes, Gregg has been busy on some important work and I'm waiting for verification from him that allowing the debugger to prefer |
c3e6831
to
63d1932
Compare
@gregg-miskelly, have you been able to double-check this? |
@r-ramesh could you take a look since I keep failing to get to this? |
@@ -721,10 +699,6 @@ internal void SetNotificationForWaitCompletion(bool enabled) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] // method looks long, but for a given TResult it results in a relatively small amount of asm | |||
private Task<TResult> GetTaskForResult(TResult result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to change to internal static
to not clash with #13907
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, merge will catch it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the conflict
The first time a Task-based method yields, today there are four allocations: - The Task returned from the method - The state machine object boxed to the heap - An Action delegate that'll be passed to awaiters - A MoveNextRunner that stores state machine and the ExecutionContext, and has the method that the Action actually references For a simple async method, e.g. ```C# static async Task DoWorkAsync() { await Task.Yield(); } ``` when it yields the first time, we allocate four objects equaling 232 bytes (64-bit). This PR changes the scheme to use fewer allocations and less memory. With the new version, there are only two allocations: - A type derived from Task - An Action delegate that'll be passed to awaiters This doesn't obviate the need for the state machine, but rather than boxing the object normally, we simply store the state machine onto the Task-derived type, with the state machine strongly-typed in a property on the type. Further, the captured ExecutionContext is stored onto that same object, rather than requiring a separate MoveNextRunner to be allocated, and the delegate can point to that Task-derived type. This also makes the builder types thinner, and since the builders are stored in the state machine, that in turn makes the allocation smaller. With this new scheme and that same example from earlier, rather than costing 4 allocations and 232 bytes, it costs 2 allocations and 176 bytes. It also helps further in another common case. Previously the Task and state machine object would only be allocated once, but the Action and MoveNextRunner would be allocated and then could only be reused for subsequent awaits if the current ExecutionContext was the default. If, however, the current ExecutionContext was not the default, every await would end up allocating another Action and MoveNextRunner, for 2 allocations and 56 bytes on each await. With the new design, those are eliminated, such that even if a non-default ExecutionContext is in play, and even if it changes on between awaits, the original allocations are still used. There's also a small debugging benefit to this change: the resulting Task object now also contains the state machine data, which means if you have a reference to the Task, you can easily in the debugger see the state associated with the async method. Previously you would need to use a tool like sos to find the async state machine object that referenced the relevant task. One hopefully minor downside to the change is that the Task object returned from an async method is now larger than it used to be, with all of the state machine's state on it. Generally this won't matter, as you await a Task and then drop it, so the extra memory pressure doesn't exist for longer than it used to. However, if you happen to hold on to that task for a prolonged period of time, you'll now be keeping alive a larger object than you previously were, including any objects lifted "local" variables in the async method referenced. There is also a very corner case change in behavior: we no longer call SetStateMachine on the builder object. This was always infrastructure code and never meant to be used by end-user code directly. The implementation in .NET Native already doesn't call it.
63d1932
to
d56eff4
Compare
I'm very interested in this change, as this is a major area of allocations for us; and its far more allocations for anyone that uses AsyncLocal Alas I can't get dotMemory to record when build coreclr for source atm, which is I think a different issue that's being picked up elsewhere. As an alternative; using, using dotTrace timeline everything is dwarfed by the ETW allocations
So... bearing that in mind... |
Apologise for the format and non-interpreted results; will have something better in the morning. coreclr.dll allocations are the ETW allocations For roughly the same time span (75 secs) the before allocation tree from System.Net.WebSockets.ManagedWebSocket+ReceiveAsyncPrivate looks like the following
|
For roughly the same time span (75 secs) the after allocation tree from System.Net.WebSockets.ManagedWebSocket+ReceiveAsyncPrivate looks like the following
|
Its down 500MB; but that might just be what was going on at the time But I thought I'd give a preview in case @stephentoub case see if the paths are doing what he expected. Will look deeper in the morning. |
I'll look at what happens if AsyncLocal is used, as that moves everything off the default context and currently allocations skyrocket |
For reference to the AsyncLocal and non-default context You can see the allocations of |
I can't tell from the shared output: how many objects were there contributing to that? I'd guestimate you would see a 50-60 byte savings per async call that completes asynchronously and that uses the default context (and potentially much, much more savings in a non-default context). |
@r-ramesh, @gregg-miskelly, I would like to merge this. Can I go ahead and do so? |
@stephentoub Sorry for the delay, let me take a quick look at the debugger implementation. I will keep you updated. |
Looks good to me. |
Thanks, @r-ramesh. |
The first time a Task-based async method yields, today there are four allocations:
For a simple async method, e.g.
when it yields the first time, we allocate four objects equaling 232 bytes (64-bit).
This PR changes the scheme to use fewer allocations and less memory. With the new version, there are only two allocations:
This doesn't obviate the need for the state machine, but rather than boxing the object normally, we simply store the state machine onto the Task-derived type, which itself implements IAsyncStateMachine. Further, the captured ExecutionContext is stored onto that same object, rather than requiring a separate MoveNextRunner to be allocated, and the delegate can point to that Task-derived type. With this new scheme and that same example from earlier, rather than costing 4 allocations and 232 bytes, it costs 2 allocations and 176 bytes, so 50% fewer allocations and 25% less allocated memory.
It also helps further in another common case. Previously the Task and state machine object would only be allocated once, but the Action and MoveNextRunner would be allocated and then could only be reused for subsequent awaits if the current ExecutionContext was the default. If, however, the current ExecutionContext was not the default, every await would end up allocating another Action and MoveNextRunner, for 2 allocations and 56 bytes on each await. With the new design, those are eliminated, such that even if a non-default ExecutionContext is in play, and even if it changes in between awaits, the original allocations are still used.
There's also a small debugging benefit to this change: the resulting Task object now also contains the state machine data, which means if you have a reference to the Task, you can easily in the debugger see the state associated with the async method. Previously you would need to use a tool like sos to find the async state machine object that referenced the relevant task.
One hopefully minor downside to the change is that the Task object returned from an async method is now larger than it used to be, with all of the state machine's state on it. Generally this won't matter, as you await a Task and then drop it, so the extra memory pressure doesn't exist for longer than it used to. However, if you happen to hold on to that task for a prolonged period of time, you'll now be keeping alive a larger object than you previously were.
There is also a very corner case change in behavior, which shouldn't break any real code, but does actually break one corefx test; there's an AsyncValueTaskMethodBuilder test I wrote, as part of trying to get to 100% code coverage, that explicitly passes the wrong state machine object to the builder's SetStateMachine method, and this change causes one of its asserts to fail (in an expected manner).
cc: @kouvel, @benaadams, @davidfowl, @ericeil, @MadsTorgersen