-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve thread pool worker thread's spinning for work #13921
Conversation
CC @sdmaclea |
--*/ | ||
DWORD | ||
PALAPI | ||
PAL_WaitForSingleObjectPrioritized(IN HANDLE hHandle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAL_ [](start = 0, length = 4)
why using PAL_ in the name here? I am not seeing aw are doing that in other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PAL tries to mimic the Windows API. For functions that are not in that category I believe the convention is to prefix with PAL_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of functions that follow this convention, for example https://github.com/dotnet/coreclr/blob/master/src/pal/inc/pal.h#L5393 and others in pal.h
--*/ | ||
DWORD | ||
PALAPI | ||
PAL_WaitForSingleObjectPrioritized(IN HANDLE hHandle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAL_WaitForSingleObjectPrioritized [](start = 0, length = 34)
Is it possible we can use WaitForSingleObject with extra parameter instead of adding one more PAL method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the WaitForSingleObject signature in sync with the Windows API
Added back the cache line paddings from UnfairSemaphore, which I missed. Didn't see much change in the spin perf tests, but probably helps in other more realistic cases. |
{ | ||
--newCounts.signalCount; | ||
--newCounts.waiterCount; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible these count goes negative if multi threads comes here in same time? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind I am seeing we are doing
Counts counts = m_counts.VolatileLoad();
which should be a copy I think from the counts.
In reply to: 138468374 [](ancestors = 138468374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signal count is checked before decrementing, the waiter count can't be zero regardless of the volatile load because the caller on the same thread would have registered a waiter before calling this function
src/vm/synch.cpp
Outdated
// Maximum number of spinners reached, register as a waiter instead | ||
--newCounts.spinnerCount; | ||
++newCounts.waiterCount; | ||
_ASSERTE(newCounts.waiterCount != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ASSERTE(newCounts.waiterCount != 0); [](start = 16, length = 37)
looks weird assert. is it really possible we can get 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an overflow check, waiterCount is uint16 so it'll become 0 on overflow. Currently, CLRLifoSemaphore is only used by the thread pool, which limits the number of worker threads to 32K, so uint16 is sufficient and overflow is currently protected by asserts. For more general use uint16 may not be sufficient and currently that is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment will help here :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
LGTM |
@kouvel Thanks, I'll try to run it on arm64 in the morning. |
"Corrupted waiting list on local CSynchData @ %p\n", | ||
this); | ||
|
||
pwtlnNewNode->ptrNext.ptr = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: couldn't we always assign this to pwtlnCurrFirst before the if rather than assigning it to null here and then to that value in the else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be done either way, both the prev and next need to be assigned on all paths
{ | ||
--newCounts.countOfWaitersSignaledToWake; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible / likely that newCounts == counts? I'm wondering if there's any value in checking whether they're the same before doing the interlocked operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case checking before worries me a bit regarding countOfWaitersSignaledToWake. Checking before and exiting before the interlocked operation would be speculative and if it happens to read an old value and sees a zero, missing a write in Release, the count could be greater than it should be and that could result in a waiter not being woken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it some more, I don't think it would be wrong for the reason I gave above. If new and old counts are the same and it incorrectly sees a zero for countOfWaitersSignaledToWake, it must be because another thread decremented that count to zero and yet another thread incremented it back to nonzero. The thread that incremented the count to nonzero will wake another thread so it's ok for this thread to miss the decrement, the next woken thread will take care of that.
I don't see an issue with checking/breaking early here at the moment, but I prefer to err on the safe side and not do so to avoid any potential problems that I may have missed.
It's not very likely for the counts to be the same because typically every woken thread has to decrement countOfWaitersSignaledToWake. If a waiter timed out in-between and another thread acquired the semaphore it could cause the counts to be the same. Waiters waking up also wouldn't typically be on the hot path, there could be significant delays between bursts of work items to cause threads to wait before waking up for instance, but the delay would mask any gain that this would give.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything's passing and throughput looks good on various platforms and various workloads, looks good.
Thanks @sdmaclea, much appreciated. I was seeing the best performance on Linux with a spin count of 200 on my 6-core machine but that seemed a bit high to me and I settled for 140, which was showing as equal or slightly better than baseline. As I currently don't have access to a processor like yours, would you also be able to try different spin counts? It's using the same config variable, COMPlus_ThreadPool_UnfairSemaphoreSpinLimit. Just want to verify whether the proc count multiplier has any value there. |
@dotnet-bot test Tizen armel Cross Debug Build |
I also suspected it was hurting. I was tempted to use 200, instead of 50*numProc on intuition that 50 was picked on a 4 core processor. Starting testing now. |
@kouvel In general the baseline plaintext performance on tip has regressed since I was looking at it regularly. I have not figured out the root cause. So with that in mind doing a differential between UnfairSemaphore and CLRLifoSemaphore, I see about a 6% regression. I assume the root cause would be the busy spinning in the spin wait. Regarding the UnfairSemaphoreSpinLimit: I see a broad peak 48 and 96 (2x because I'm on Linux). So I am happy with the current default of 70. I will do some more experimentation tonight or more likely tomorrow. |
I think that might be hex. So I am still reasonably happy with the default of 0x46. |
Thanks @sdmaclea, yes the config values are interpreted as hex. Is the 6% regression from baseline using the default of 0x46? Is there a regression from baseline even when compared with the highest point in the peak? |
The 6% regression was using the highest peak when sweeping through |
@sdmaclea, if I remember correctly your processor is 40+ cores, can you confirm? If that's correct, then since your spin counts are roughly matching my estimates, I'm leaning towards believing that a proc count multiplier is not necessary, and that it's just about finding the right spin count. Perhaps I should just multiply the spin count by 3 instead of 2 for Unixes, to give a default spin count of 210 or 0xd2 that is somewhere in the middle of your peak range? |
I see, if you get a chance, could you please try a spin count of 0x69 (on Linux currently it's * 2 = 210 iterations)? Just hoping that there is a spin count where there is no regression from baseline, otherwise something else I'm missing must be involved. |
I am testing on a platform with 48 cores
The spin count for UnfairSemaphore was not actually Playing with it in a spreadsheet, for 48 cores: 223 gives the same average spins as the old 50. But since the spin count is constant, All the spinner end at roughly the same time. With the old method there was a spinner tail. The last spinner spun for 2400 spins. Half the spinners finished after 100 spins. I am actually thinking we might be better off with If we want the same average spin for 48 cores the limit would have to drop to ~2.9. However I am actually thinking 10 looks nice. Half the spinners finish after 37 spins. 75% after 100. The last spinner can spin for 23K spins. Seems like it would be better for bursty work. In any case it might be worth playing with the shape of the spin limit curve. |
I did not see an improvement. This may also be interesting. With UnfairSemaphoreSpinLimit == 74, for 48 cores the average is roughly the same, but half the core do not spin at all, while many of the remaining spin longer. Intuitively I like this better. |
Ubuntu arm64 Debug Cross Build |
@dotnet-bot test Ubuntu arm64 Debug Cross Build |
@tarekgh, could you please take a quick look at the last three commits and see if they are ok? |
LGTM. |
Thanks! |
@dotnet-bot test Windows_NT arm64 Cross Debug Build |
@dotnet-bot test Ubuntu arm64 Cross Debug Build |
@dotnet-bot test Windows_NT Arm64 Checked |
@dotnet-bot help |
Welcome to the dotnet/coreclr Repository The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
Welcome to the dotnet/coreclr Perf help The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
@dotnet-bot test Windows_NT arm64 Cross Checked Build and Test |
@dotnet-bot test Windows_NT arm64 Cross Debug Build |
@dotnet-bot test Ubuntu arm64 Cross Debug Build |
@dotnet-bot test Windows_NT arm64 Checked @jashook Is there a reason we can't fix the trigger word to match like all the other jobs? It causes a lot of confusion for me and others. Opened #14091 |
Ah thanks |
@dotnet-bot test Ubuntu arm64 Checked |
@dotnet-bot test Windows_NT arm64 Checked |
@dotnet-bot test Windows_NT arm64 Checked |
2 similar comments
@dotnet-bot test Windows_NT arm64 Checked |
@dotnet-bot test Windows_NT arm64 Checked |
@dotnet-bot test Ubuntu arm64 Checked |
@dotnet-bot test Windows_NT arm64 Checked |
Issues in Ubuntu arm64 test run look unrelated to this change and are different from the last run |
Closes https://github.com/dotnet/coreclr/issues/5928
Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore
Perf results
For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%.
Updated spin perf code here: #13670
Numbers on Linux were similar with a slightly different spread and no regressions.
I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.