Skip to content
This repository has been archived by the owner on Nov 1, 2020. It is now read-only.

WIP: Move portable thread pool and timer implementation to shared partition #6880

Closed
wants to merge 8 commits into from

Conversation

filipnavara
Copy link
Member

@marek-safar This introduces ThreadPool implementation with the FeaturePortableThreadPool flag into the shared partition. It should be possible to use it as-is in Mono, but I didn't try to build it yet.

@filipnavara filipnavara force-pushed the threadpoolportable branch 3 times, most recently from 624a2c3 to 049724e Compare January 24, 2019 13:53
@filipnavara
Copy link
Member Author

Uh, not quite there yet... I will need to move GetCpuUtilization native function too.

@filipnavara
Copy link
Member Author

filipnavara commented Jan 24, 2019

@jkotas / @stephentoub / @jkoritzinsky Do I assume correctly that it should be possible to move CoreLibNative_GetCpuUtilization into CoreFX as SystemNative_GetCpuUtilization?

(Alternatively I can reduce the footprint by reusing existing SystemNative_GetTimestampResolution and SystemNative_GetTimestamp, binding only the additional getrusage and moving the code to managed.)

@jkotas
Copy link
Member

jkotas commented Jan 24, 2019

it should be possible to move CoreLibNative_GetCpuUtilization into CoreFX as SystemNative_GetCpuUtilization

Yes, that should be fine.

It should be fine to do for pretty much anything to CoreRT CoreLib.Native to support sharing. The CoreRT CoreLib.Native may go away eventually.

@filipnavara
Copy link
Member Author

Thanks!

It should be fine to do for pretty much anything to CoreRT CoreLib.Native to support sharing.

Not planning to work on that until the Environment move from CoreFX is finished. There is quite a lot of overlap.

@filipnavara
Copy link
Member Author

The remaining build failures are because of the missing CoreFX addition to System.Native.

@filipnavara filipnavara changed the title WIP: Move portable thread pool implementation to shared partition Move portable thread pool and timer implementation to shared partition Jan 29, 2019
@@ -1110,6 +1110,24 @@
<Compile Include="$(MSBuildThisFileDirectory)System\Security\SecureString.Unix.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\TimeZoneInfo.Unix.cs" />
</ItemGroup>
<ItemGroup Condition="'$(FeaturePortableThreadPool)' == 'true'">
<Compile Include="$(MSBuildThisFileDirectory)System\Threading\ThreadPool.Portable.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Threading\ClrThreadPool.cs" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little bit confusing naming here, is ClrThreadPool equal to ThreadPool.Portable ?

Copy link
Member Author

@filipnavara filipnavara Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the original naming of the files. There's ThreadPool.Portable.cs which implements the ThreadPool methods. The other files are the internal implementation classes, which are created and called from the ThreadPool.Portable.cs code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the naming comes from an implementation inside CLR vs. Win32 Thread Pool API (https://docs.microsoft.com/en-us/windows/desktop/procthread/thread-pool-api).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to follow the existing naming convention. ThreadPool.cs for portable code, ThreadPool.CoreCLR.cs for CoreCLR runtime specific implementation

Copy link
Member Author

@filipnavara filipnavara Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already ThreadPool.cs shared between all implementations. Then there are three implementations of the actual thread pool (CoreCLR w/ unmanaged code and PAL, CortRT/Portable, CoreRT/Windows).

This moves one of the implementations (CoreRT/Portable) under feature flag to shared partition. This implementation is managed reimplementation of what CoreCLR does and it's currently used by CoreRT on Unix.

Copy link
Member Author

@filipnavara filipnavara Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to use the CoreRT implementation in Mono by simply adding

    <FeaturePortableThreadPool>true</FeaturePortableThreadPool>
    <FeaturePortableTimer>true</FeaturePortableTimer>

to Mono's System.Private.CoreLib.csproj. It could eventually be used in CoreCLR too, but that will require a lot of performance testing and changes to unmanaged code that are currently not feasible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps, it would be ok to create PortableThreadPool subdirectory and move the portable threadpool implementation there.

Copy link
Member

@stephentoub stephentoub Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps, it would be ok to create PortableThreadPool subdirectory and move the portable threadpool implementation there.

And rename ClrThreadPool.*.cs to PortableThreadPool.*.cs, presumably, or something along those lines?

@filipnavara
Copy link
Member Author

I forgot to test Mono builds. This still has to be done before merge.

@filipnavara filipnavara changed the title Move portable thread pool and timer implementation to shared partition WIP: Move portable thread pool and timer implementation to shared partition Jan 29, 2019
@filipnavara
Copy link
Member Author

filipnavara commented Jan 29, 2019

  • Still have to investigate the test failures which I must have somehow introduced with the last couple of commits.

Attempted the Mono build. These still have to be fixed:

  • System/Threading/ClrThreadPool.WorkerThread.cs(6,16): error CS0234: The type or namespace name 'LowLevelLinq' does not exist in the namespace 'Internal' (are you missing an assembly reference?)
  • System/Threading/ThreadPool.Portable.cs(105,17): error CS0246: The type or namespace name 'LowLevelLock' could not be found (are you missing a using directive or an assembly reference?)
  • System/Threading/ClrThreadPool.cs(37,26): error CS0246: The type or namespace name 'LowLevelLock' could not be found (are you missing a using directive or an assembly reference?)
  • System/Threading/ClrThreadPool.cs(64,17): error CS0246: The type or namespace name 'LowLevelLock' could not be found (are you missing a using directive or an assembly reference?)
  • System/Threading/ClrThreadPool.WaitThread.cs(19,17): error CS0246: The type or namespace name 'LowLevelLock' could not be found (are you missing a using directive or an assembly reference?)
  • System/Threading/ClrThreadPool.GateThread.cs(22,28): error CS0246: The type or namespace name 'LowLevelLock' could not be found (are you missing a using directive or an assembly reference?)
  • System/Threading/ClrThreadPool.WorkerThread.cs(21,28): error CS0246: The type or namespace name 'LowLevelLifoSemaphore' could not be found (are you missing a using directive or an assembly reference?)
  • Move HighPerformanceCounter, AppContextConfigHelper to shared

@filipnavara filipnavara force-pushed the threadpoolportable branch 2 times, most recently from 208cdd2 to 0b029f7 Compare January 29, 2019 11:27
@filipnavara
Copy link
Member Author

filipnavara commented Jan 29, 2019

I still haven't figured out adequate short term solution for LowLevelLifoSemaphore. I think I will follow how Semaphore is structured and move the Windows implementation to shared and keep the Unix implementation runtime specific.

That means Mono will have to implement it, but it's only four methods and it's likely to share backend implementation of Semaphore. @marek-safar does that make sense?

(Long term there's a possibility to move the WaitSubsystem and related implementations as a feature to Shared partition, but that is a bigger project and may not make sense if Mono chooses to reuse its existing code)

@marek-safar
Copy link
Contributor

Does it have to be LowLevelLifoSemaphore? Could the code use Semaphore instead?

@filipnavara
Copy link
Member Author

filipnavara commented Jan 29, 2019

Does it have to be LowLevelLifoSemaphore? Could the code use Semaphore instead?

I don't understand the code well enough to answer that.

Short version: Semaphore is mostly FIFO, but not guaranteed, LowLevelLifoSemaphore is guaranteed LIFO. There are some other small semantic differences.

Long version:

  • LowLevelLifoSemaphore guarantees the LIFO order, Semaphore does not. In fact the Windows implementation of Semaphore relies on Win32 API that specifically mentions the order is not guaranteed (but it's closer to FIFO). The Unix implementation in CoreRT uses the same basic code for LowLevelLifoSemaphore and Semaphore.
  • The Wait method on LowLevelLifoSemaphore is non-interruptible, on regular Semaphore it is interruptible.
  • The Wait implementation in LowLevelLifoSemaphore on CoreRT/Unix sets prioritize: true which results in different semantics in regards to the wait queue order (a cursory look suggests a LIFO vs FIFO difference, but I could be wrong).

@filipnavara
Copy link
Member Author

filipnavara commented Jan 29, 2019

A regular Semaphore would work, but it would have different [and likely inferior] performance characteristics. The LIFO nature makes sure that most recent worker threads are reused first. The interruptibility is a correctness concern, where someone could keep a reference to RuntimeThread belonging to thread pool and call Interrupt() on it.

It is certainly possible to have fully managed implementation of LowLevelLifoSemaphore which is what CoreRT on Unix does. However, in that case it tries to reuse code and has a large dependency graph that would not make sense to move to shared partition.

I'll see if I can make a simple implementation for use in Mono, but I would probably still keep it as Mono specific for now.

@filipnavara
Copy link
Member Author

Here's more information on why the Semaphore is implemented the way it is - dotnet/coreclr#13921.

@filipnavara
Copy link
Member Author

filipnavara commented Jan 29, 2019

I'll do some benchmarks, but looks like SemaphoreSlim may be good enough approximation for now to have something working. It does all the spin waits, but the pessimistic wait case is FIFO and not LIFO, which is not ideal (prevents cold threads from timing out).

<Compile Include="System\Threading\Interlocked.cs" />
<Compile Include="System\Threading\LockHolder.cs" />
<Compile Include="System\Threading\LowLevelLock.cs" />
<Compile Include="System\Threading\LowLevelMonitor.cs" />
Copy link
Member

@jkotas jkotas Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, people working on this found it convenient to be able to debug the WaitSystem on Windows. That's whe LowLevelMonitor was implemented on both Windows and Unix.

cc @kouvel @jkoritzinsky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I thought the point was to support debugging ThreadPool.Portable on Windows (the only usage outside of WaitSubsystem). That is still supported after this change. The whole WaitSubsystem was already guarded in the same Unix-only conditions.

I was more worried about introducing some performance bottleneck by replacing usages of LowLevelLock with Lock, but I didn't get to run the benchmarks yet to verify the impact.

If it turns out to be a problem I can revert the last few commits and take a different approach with bringing some implementation of LowLevelLock (and LowLevelSemaphore) to Mono.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to maintain being able to use the portable thread pool on Windows as well under at least a build-time flag for debugging and perf comparisons. There shouldn't be too many dependencies other than the few that were already identified here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to maintain being able to use the portable thread pool on Windows as well under at least a build-time flag for debugging and perf comparisons.

Definitely. I use that myself.

#if CORERT
private readonly Lock _waitThreadLock = new Lock();
#else
private object _waitThreadLock = new object();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create class LowLevelLock : object with a couple of methods on it to make these ifdefs unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it as an option. It would likely result in some counter-part like [LowLevel]LockHolder that can be used in using (new LockHolder(foo)). There's only few places so #if's seemed like an option for now, but I am going to revisit the options before dropping the "WIP" tag.

Copy link
Member Author

@filipnavara filipnavara Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically going to evaluate the performance of LowLevelLock (only used in two places; has the Windows PInvoke-heavy implementation that is not used anywhere else) vs. Lock (used to implement Monitor, so should be pretty optimized) vs. lock (obj) (portable, but goes through one extra hoop through ObjectHeader/SyncTable; the impact on the specific places might be negligible) to see whether there is any merit preferring one over the others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise, Lock in CoreRT is very roughly based on the old implementation of Monitor with some major differences. It is quite likely that currently it is not as good as Monitor's current implementation in CoreCLR. Eventually that implementation would also have to be ported to CoreRT. In any case, see my other comment for my suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LowLevelLock (only used in two places; has the Windows PInvoke-heavy implementation that is not used anywhere else)

LowLevelLock does some (tiny) bit of spin-waiting and where it is used for the most part the locks are typically uncontended, so it's not really p/invoke-heavy in practice (if it turns out to be, the spin-waiting strategy could be improved to fix that).

Copy link
Member Author

@filipnavara filipnavara Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation behind this change was two-fold:

  • Avoid bringing LowLevelLock/LowLevelMonitor to shared CoreLib (and all the dependent code).
  • Use the same primitive as the lock (...) pattern. The premise was that this is what user code gets to use and it should be the one that is optimized better.

The problem is that it violates the condition that it is non-interruptible. Another problem is that I grossly underestimated the dependencies of LowLevelLifoSemaphore, so in my revised plan I would bring the dependencies to shared code anyway and this change would become unnecessary.

@filipnavara
Copy link
Member Author

Benchmarks of LowLevelLock vs. Lock vs. Monitor ended up with being unmeasurable with BenchmarkDotNet. The overhead was so small that for all practical purposes it didn't make a difference which one is used.

Here's a draft of implementation of LowLevelLifoSemaphore for Mono: https://gist.github.com/filipnavara/fea0c72ee8a7aafeb56dc54d5b6ae941

The implementation is based on the code from https://github.com/dotnet/coreclr/blob/a28b25aacdcd2adb0fdfa70bd869f53ba6565976/src/vm/synch.cpp#L578-L981. It doesn't implement the spinning (as that is currently not used by the CoreRT code) and as the low-level primitive for waking up threads it relies on the portable Monitor, which may not be ideal.

I'm still contemplating about moving bulk of this new code into CoreRT (and in effect to the shared partition) and adding the spinning optimization from CoreCLR. Then it would be reduced to 4 methods in Mono (Initialize, Dispose, Wake(int releaseCount) and WaitForWake(int timoutMs)). CoreRT would continue to rely on WaitSubsystem (Unix) and I/O Completion ports (Windows).

@filipnavara
Copy link
Member Author

filipnavara commented Jan 30, 2019

Yet another alternative would be to

  • revert last two commits messing with the locks
  • keep using LowLevelMonitor and move it to shared partition
  • reimplement LowLevelLifoSemaphore using LowLevelMonitor for Unix

That way there would be no Mono specific code at all. The downside is that there would be more native code that has to be transferred to CoreFX (the interop code for wrapping pthread_mutex_t/pthread_cond_t).

@kouvel
Copy link
Member

kouvel commented Jan 30, 2019

Sorry I haven't looked at this yet, I'll get back soon

@filipnavara
Copy link
Member Author

No rush. I was reading through the code and I am only now getting to the point that I understand certain decisions behind the implementation. Sorry for some comments being messy, it follows my thought process.

Btw, @kouvel, I used some of your benchmarks from other PR (thread pool sustained/burst loads), but I had to heavily modify them to run under BenchmarkDotNet. The original benchmarks were producing so inconsistent results on my machine that they were unusable (difference between two consecutive runs was often close to 100%). It is possibly due to tiered JIT compilation not being properly accounted for in the initial warm-up phase. If you have any pointers to some good benchmarks for ThreadPool/Timer I would be happy to use them.

@kouvel
Copy link
Member

kouvel commented Jan 31, 2019

Common to all of LowLevelMonitor, LowLevelLock, and LowLevelLifoSemaphore is that all of their waits are non-interruptible with Thread.Interrupt. For example, a thread pool thread waiting for work should not be interruptible. Monitor, Semaphore use waits that are interruptible and at the moment the interruptibility is not configurable. In CoreRT, due to their use in the wait subsystem, it was not possible to use Lock or Monitor further because they in turn rely on the wait subsystem for their waits and there would be a circular dependency.

LowLevelLock is just meant to be a simple uninterruptible lock. CoreCLR's PAL uses critical sections for uninterruptible locks, but it may be unnecessarily expensive to call into native code just for a simple lock. I would lean towards porting LowLevelMonitor and LowLevelLock as is. Perf-wise, LowLevelLock is probably decent for the purposes it is used at the moment..

LowLevelLifoSemaphore:

  • Thread pool worker threads wait on this. LIFO release order is used to keep hot threads hot and cold threads cold, such that some cold threads may remain unused and would eventually get a chance to time out from the wait and exit. Using a FIFO release order would cause all threads to remain recently used as long as there is a steady-enough stream of incoming work, even if the amount of incoming work does not warrant having that many threads. Once the thread count increases, it would have trouble decreasing.
  • LIFO order doesn't necessarily need to be guaranteed, but as long as it's reasonably LIFO for the most part, that's enough
  • CoreCLR's thread pool had UnfairSemaphore and still has retired threads, both of which attempted to keep hot threads hot. It didn't work very well because it relied on a large amount of spin-waiting in order to do so, and that became problematic in many cases. CLRLifoSemaphore replaced UnfairSemaphore and the retired threads are not strictly necessary anymore (the portable thread pool implementation in CoreRT does not have retired threads).
  • CLRLifoSemaphore in CoreCLR is a better version of LowLevelLifoSemaphore that is more suitable for the thread pool. For waiting, on Windows similarly to LowLevelLifoSemaphore, it uses an IO completion port directly (uninterruptible), and on Unixes it uses PAL_WaitForSingleObjectPrioritized, which is similar to the prioritized wait in CoreRT's wait subsystem, and passes false for the altertable parameter to make it uninterruptible.
  • Perf probably would be important in bursty workloads, calling into native code with a p/invoke to use CLRLifoSemaphore directly probably would not be good
    • For now LowLevelLifoSemaphore could be ported as-is with the Unix side of the wait portion doing something similar to what CLRLifoSemaphore does
    • Ideally, CLRLifoSemaphore would also be ported to C# and would use LowLevelLifoSemaphore for just the wait portion, that would allow staying in managed code for the most part in thread pool workloads that come in short frequent bursts (which is very common)

I used some of your benchmarks from other PR (thread pool sustained/burst loads)

At the moment, tiered compilation has to be disabled (COMPlus_TieredCompilation=0) for those tests to be somewhat representative (or for any perf tests you run). I'm in the process of adding another config flag that should help with getting more representative numbers.

These tests have to be ported to BDN at some point. I have not yet for several reasons that I won't get into, but one is that the harness I use is more useful in many ways for build-to-build comparisons with finer control over iterations and managing measurement errors. It would be nice to port the tests to BDN at some point to make it easier for others to run and for tracking purposes. If you have ported some of the tests, please put up a PR if you can! :)

pointers to some good benchmarks for ThreadPool/Timer

@benaadams' task tests are good ones: https://github.com/benaadams/ThreadPoolTaskTesting. They also should be ported to BDN into the performance repo at some point.

@kouvel
Copy link
Member

kouvel commented Jan 31, 2019

I haven't looked at the code changes yet, I should be able to next week hopefully

@filipnavara
Copy link
Member Author

filipnavara commented Jan 31, 2019

Common to all of LowLevelMonitor, LowLevelLock, and LowLevelLifoSemaphore is that all of their waits are non-interruptible with Thread.Interrupt.

I suspected that would be the reason for it.

In CoreRT, due to their use in the wait subsystem, it was not possible to use Lock or Monitor further because they in turn rely on the wait subsystem for their waits and there would be a circular dependency.

I haven't run into this in my tests and I don't see the code paths where it would actually create the circular dependency, but I may have just been lucky.

Thanks, the rest more or less confirmed what I figured from the code in the past few days.

Accordingly I currently lean towards the following plan:

  • Port CLRLifoSemaphore to C# (already started with it anyway)
  • Move native POSIX mutex/condition helpers to System.Native in CoreFX
  • Reimplement the Unix part of LowLevelLifoSemaphore in terms of LowLevelMonitor (or the POSIX conditions) to break dependency on WaitSubsystem
  • Move LowLevelLock, LowLevelMonitor and LowLevelLifoSemaphore (assuming the name doesn't change) to shared partition along with the portable thread pool

I'll extract the good bits of this PR into separate PR and submit any bigger changes separately. Once I get to the point where all dependencies are resolved I can get back to this PR and keep it as simple as possible.

@benaadams' task tests are good ones: https://github.com/benaadams/ThreadPoolTaskTesting.

Great, I will take a look!

@davidfowl
Copy link
Member

Does this potentially bring us closer to a mostly managed thread pool implementation? Would it be possible to try this out in .NET Core as well?

@filipnavara
Copy link
Member Author

filipnavara commented Jan 31, 2019

Does this potentially bring us closer to a mostly managed thread pool implementation? Would it be possible to try this out in .NET Core as well?

Yes and eventually yes.

The goal is primarily to bring this to Mono. Actually bringing it to CoreCLR is out of scope for me, but if someone else is interested I do have a version of the code that builds on top of CoreCLR. I will be happy to share that once all the problems from this PR are resolved. It will need a lot of benchmarks and tweaking before it would be ready as a replacement though.

@kouvel
Copy link
Member

kouvel commented Jan 31, 2019

Reimplement the Unix part of LowLevelLifoSemaphore in terms of LowLevelMonitor (or the POSIX conditions) to break dependency on WaitSubsystem

LowLevelMonitor is mostly FIFO, implementing LowLevelLifoSemaphore with LowLevelMonitor would not give the LIFO behavior. In CoreCLR's PAL it was not too difficult to modify the equivalent of WaitForSingleObject into a version that registers the waiter in the opposite order to get LIFO release order of waiters. Is it possible to do something like that in Mono? It may mean that the code cannot be entirely in CoreFX or may need some layer that allows help from the runtime, I'm less worried about that than not getting the LIFO behavior.

@kouvel
Copy link
Member

kouvel commented Jan 31, 2019

I haven't run into this in my tests and I don't see the code paths where it would actually create the circular dependency, but I may have just been lucky.

From their uses in the thread pool I don't think there is a circular dependency issue. Those types were initially created for the wait subsystem (not for the thread pool), and they were used in the portable thread pool implementation because there were handy. I was just trying to convey the reason for their existence, I don't think the circular dependency issue is of concern for this change.

@filipnavara
Copy link
Member Author

LowLevelMonitor is mostly FIFO, implementing LowLevelLifoSemaphore with LowLevelMonitor would not give the LIFO behavior.

I didn't mean to use single one. I was thinking about one per thread and waking up specific threads on Release in LIFO order. (Otherwise I would be basically reimplementing SemaphoreSlim.)

Is it possible to do something like that in Mono?

Last time I checked it seemed it would not be easy. I will check again before writing some code :)

@kouvel
Copy link
Member

kouvel commented Jan 31, 2019

I didn't mean to use single one. I was thinking about one per thread and waking up specific threads on Release in LIFO order.

Yea that is possible, and we considered something like that if nothing else was available. I think it would be easier to find a lower-level primitive that would offer the same behavior. IO completion ports use LIFO release order for good reason, perhaps there is also a low-level primitive on Unixes that offers LIFO waits and if so that may do well for this purpose.

@filipnavara
Copy link
Member Author

Closing for now. Will reopen once I sort out all the dependencies.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants