Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check out source generation performance #7

Open
martinothamar opened this issue May 28, 2021 · 20 comments
Open

Check out source generation performance #7

martinothamar opened this issue May 28, 2021 · 20 comments

Comments

@martinothamar
Copy link
Owner

See how fast source generation performance is here, if it should be improved etc.

@Tornhoof
Copy link
Contributor

Not directly related to src gen performance, but performance of the generated source.
Currently the src gen generates large switch tables for e.g. the Send method,
something like:

        public global::System.Threading.Tasks.ValueTask<TResponse> Send<TResponse>(
            global::Mediator.IRequest<TResponse> request,
            global::System.Threading.CancellationToken cancellationToken = default
        )
        {
            switch (request)
            {
                case global::MyQuery r:
                {
                    var task = Send(r, cancellationToken);
                    return global::System.Runtime.CompilerServices.Unsafe.As<global::System.Threading.Tasks.ValueTask<global::MyResponse>, global::System.Threading.Tasks.ValueTask<TResponse>>(ref task);
                }

This scales rather badly for a lot of Requests (I tried it with a solution with 2500 Requests):

There is a concept for statically typed dictionaries, similar to https://github.com/asynkron/protoactor-dotnet/blob/dev/src/Proto.Actor/Utils/TypedDictionary.cs
The idea is to use generic types in a static class to generate consecutive numbers to index an array with the value. From a hashing point of view, this is a Perfect Minimal Hash (all input values are known)
I put a simplified version into a demo repo:
https://github.com/Tornhoof/StaticTypeDictionaryBenchmark

A basic benchmark looks like this, where SwitchRequestN is the switch/case concept and StaticSwitchRequest is the static type dictionary concept.

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-10700 CPU 2.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.300
  [Host]   : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  ShortRun : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  
Method Mean Error StdDev
SwitchRequest1 177.936 ns 2.7435 ns 0.1504 ns
SwitchRequest100 537.292 ns 63.2024 ns 3.4643 ns
SwitchRequest200 868.002 ns 6.9180 ns 0.3792 ns
SwitchRequest300 1,213.257 ns 12.0224 ns 0.6590 ns
SwitchRequest400 1,570.105 ns 505.7319 ns 27.7209 ns
SwitchRequest500 1,954.081 ns 333.3473 ns 18.2719 ns
SwitchRequest600 2,296.974 ns 380.4903 ns 20.8560 ns
SwitchRequest700 2,614.459 ns 375.7176 ns 20.5944 ns
SwitchRequest800 3,001.314 ns 55.9098 ns 3.0646 ns
SwitchRequest900 3,542.332 ns 81.5554 ns 4.4703 ns
SwitchRequest1000 3,964.454 ns 55.4060 ns 3.0370 ns
StaticSwitchRequest1 4.188 ns 0.2662 ns 0.0146 ns
StaticSwitchRequest100 4.837 ns 1.7584 ns 0.0964 ns
StaticSwitchRequest200 4.529 ns 0.0561 ns 0.0031 ns
StaticSwitchRequest300 4.963 ns 0.1827 ns 0.0100 ns
StaticSwitchRequest400 4.530 ns 0.0132 ns 0.0007 ns
StaticSwitchRequest500 4.658 ns 0.3519 ns 0.0193 ns
StaticSwitchRequest600 4.533 ns 0.6353 ns 0.0348 ns
StaticSwitchRequest700 4.593 ns 0.3906 ns 0.0214 ns
StaticSwitchRequest800 4.511 ns 0.0095 ns 0.0005 ns
StaticSwitchRequest900 4.511 ns 0.0425 ns 0.0023 ns
StaticSwitchRequest1000 4.511 ns 0.0292 ns 0.0016 ns

As you see here, the later/deeper the Request is in the switch case the slower the call actually gets, the static version is a constant speed with a constant lookup into an array.

Bottom line: consider getting rid of the large switches as we know all valid cases and with that knowledge can create perfect hashes.

@martinothamar
Copy link
Owner Author

Cool approach! I didn't know switch statements scaled so poorly, but I guess it makes sense since all we have is IRequest<>... has to be a runtime check.

So since we don't have the concrete T for the request, will be able to use that approach? Since in this piece

	public static int Switch<T>(T t) where T : class
	{
		var r = TypeIndex<IRequest>.Get<T>();
		return r?.Value ?? 0;
	}

T will be IRequest<TResponse> - so

        public static TValue Get<TKey>() => s_Values[TypeKey<TKey>.Id];

So TKey here will collide where requests have the same response type?
Am I right in assuming we have to use the runtime type (i.e. request.GetType()) for the IRequest<TResponse> case?

@Tornhoof
Copy link
Contributor

We need to know the concrete T for the static type dict trick to work, that is true.
We only know the concrete TResponse at that point. So what would be possible is to split the large switch into a lot of smaller switches in different generic method for each known TResponse. The largest one would be Unit, but it is often the case that there are not many different concrete IRequest implementations per concrete TResponse. I looked through my code bases and in ~50% of my requests the TResponse is exactly used once, i.e. making it a 1:1 mapping, the other 50% vary between 2 and 10.

The switch case also uses IL isinstanceof which is sufficiently expensive to be slow:
See https://sharplab.io/#v2:EYLgHgbALANALiAhgZwLYB8ACAmAjAWAChMAGAAk1wDoAlAVwDs4BLVAUyoGEB7VAB2YAbNgCcAyqIBuzAMZtkAbiJFMAZgrYynMgG8iZA2QDalWoxbsuvAcPFTZ8qgFk2cABbcAJgEl+ggBSm9EysHDz8QqISItJyyM6uHj5+APJ8LNwM8QCCAOa5IvLIzJJs3gyCzAxVuQCUALr6hmpkVXBkTv7eNGwAjnTy7SK1uk2GhsgA7sxwMm7+w2Pjo4TLa2QyKGxkACJkYCBL68t6q8fnBpgA7PtUAGqIggNKZxfjAL5H65vI2wCi+0OrzeBlOIOO11uDyebBe4MMn2BCK+FBuJDhyOBSxM1GCFjC1kidhiDniLncXl8fACQXMoSsEVs0VijnJSSpgjSGSyVDyBSKJTKFSqNQaSxabQ62C6PX6gzIwxWKNaADMFq1kLsyJ5asqwW9IZ57o9nsrEcc2IJfqr1cxNQC2LqkYZ9RdIRxoabnQZzWtIeilu9xktsbSQpZwjYovY4gkKclqYFcXSI4SmTHWYlKal0sxMjl8oVkMVSuVKtUGHVGoQAJASpgdVQyvoDZBDJ01041mvMFVkBZUADirgAKgBPPhsfwjAC8M7IcAnbG4ap2tQ7naI3ZrhT7kkQIm1ZHnu4oyfDBMZ0ZJsYAqllECqONlkAAebotwYwXYAPgWbD7YYXm3Q1jRhYCa0Rbte37EQh1HJdp2PedF0nFd/D+dct03Wtu1PfdD22E8ALPMwLwZKNiRZeJ72QR9nzfD85Tbb8/j/U8gOwusbg9E1YWwqDuLIANayDAwiERFR1DaUQVUQOQyCY1s4CIV1WgbT1th0XJXAURFJOIdQcC1EBFNlZTVPFaSNL448fzIKAXgMlpjIBUylMGSzgXrdpNLssgAFYnKAA===

for different implementations.

The fastest is something like this, similar to how you already use Unsafe.As:

    [System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
	public int M3(IRequest r)
	{
		if (r.GetType() == typeof(D))
		{
			ref var d = ref System.Runtime.CompilerServices.Unsafe.As<IRequest, D>(ref r);
			return d.Value;
		}
		if (r.GetType() == typeof(E))
		{
			ref var e = ref System.Runtime.CompilerServices.Unsafe.As<IRequest, E>(ref r);
			return e.Value;
		}
		return 0;
	}   

It is important that the GetType() is repeated each time and not stored in a variable because otherwise a lot of optimizations don't hit. AggressiveInlining is also required, I don't really know why.
A preliminary benchmark for that version puts it roughly 3-4 times faster than the switch/case

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-preview.3.22167.1
  [Host]   : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  ShortRun : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  
Method Mean Error StdDev Code Size
SwitchRequest1 101.6320 ns 2.2133 ns 0.1213 ns 65,974 B
SwitchRequest1000 2,095.2218 ns 498.7225 ns 27.3367 ns 65,974 B
UnsafeSwitchRequest1 0.3906 ns 0.0038 ns 0.0002 ns 41,382 B
UnsafeSwitchRequest1000 609.4602 ns 98.1695 ns 5.3810 ns 41,388 B

See the large difference in code size, combining that approach with the approach to split the methods by TResponse should keep the individual methods small and increase the performance a lot for large systems. Large methods always have the problem that many optimizations in the JIT stop after having e.g. 512 locals (which in our case would be 500 ifs more or less).

In general it would be possible to write a minimal perfect hash algorithm which maps all request types to an [0-n] array in all cases, but unfortunately there is no static type information available which can be used for something like that, so it would need to be done at runtmie on e.g. the type guid and that will be a rather slow init process, minimal perfect hash finding is kinda like bruteforce.

@martinothamar
Copy link
Owner Author

Yup, sounds like a really good solution 👍 I'm pretty busy for the next couple of days but I'll try to look into this as soon as possible

@Tornhoof
Copy link
Contributor

Tornhoof commented Jul 1, 2022

I took your benchmark branch (with the 1000 extra handlers) and tried to implement several different concepts to optimize the switch/case for the requests, but never achieved anything worthwhile, maybe around 10% after some nasty unsafe code, nothing even remotely like the changes from the sharplap example above.

Which leads me back to to the TypeDictionary, you've used that for notification in your branch. We still have the problem that we don't really have a proper compile time identifier for each request.

So why not add one via source generation?
For each Request Type deriving from IRequest, add a source generated partial type implementing some id interface with an Id and then use that Id to access the handler array directly, as we can control which id to put in to the generated partial type, we can simply use it as a linear accessor.
something like.

	public sealed partial class Request0 : IRequestIdentifier
	{
		int IRequestIdentifier.RequestId => 0;
	}

This is not the ideal solution, but probably the most performant one.
There are some interesting type changes ahead in C#11 (maybe 12), where it is possible to scope certain types by file. maybe this can help in the future to hide the generated code from the user (better than just explicit interfaces)

@Tornhoof
Copy link
Contributor

Tornhoof commented Jul 5, 2022

I asked Jared Parsons for advice on twitter, https://twitter.com/torn_hoof/status/1543631166686863366 he basically confirmed that there are no constant identifiers we could use out of the box. he suggested to check for type full name first and then do the isinstance check, but that doesn't make a big difference. I'll have some minimal perfect hashing code around somewhere, time to try that one out.

@Tornhoof
Copy link
Contributor

Tornhoof commented Jul 6, 2022

I tried out minimal perfect hashes,
there are conceptionally two major ways of creating the data for minimal perfect hashes via the full type name

First way

You take a known hash function with a seed and a displacement/intermediate map to map all inputs via the displacement map to the output, this might mean running the hash twice. See http://stevehanov.ca/blog/index.php?id=119 for a detailed explanation. The hash method needs to have a decent distribution, otherwise calculation of the displacement/intermediate table (via modifying the seed for a specific input value) takes too long.
I too XXHash32 (integrated into .NET by now) for as a hash method.
This works, but each hash calculation takes ~32ns on my system. Which makes it slower than the c# dictionary with the Type as a lookup. Dictionary Type lookup is 35ns on my system, the hashed version took between 45 and 90ns.
The source is basically this:
https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/PerfectMinimalHashGenerator.cs
this includes the full code.

Second way

Instead of using a fixed hash function, the hash function is build based on the input data.
For example: The test Requests, basically Namespace.Request1-100, so that method basically takes the last few characters to calculate the hash and then the hash method is very very simple, somehting like typeName[^2]+typeName[^1] + 5 and then look it up to the linear index.
This is very fast because the hash function is very easy. This is basically was gperf does (I ran gperf on the input data for that). See http://www.dre.vanderbilt.edu/~schmidt/PDF/gperf.pdf for that.
The lookup takes ~4ns on my system or ~6ns if gperf outputs a switch table.
The sources for the gperf output converted to c# are:
https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/GPerf.cs
and
https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/GPerf2.cs

Conclusion

The first way is too slow, but the gen code is trivial (~100 lines of code).
The second way is fast, but the gen code is very complex (gperf is several thousand lines of code).

The idea works, but writing the required code to get the second way implemented is way out of scope.
gperf is also not designed for many keys, we're talking hundreds not thousands.

@martinothamar
Copy link
Owner Author

Hey, sorry for the slow answer...
Thanks for looking into all of this. I've been busy with different stuff for a while now but I hope to get back to this during summer.
This was surprisingly hard 😄 my initial testing didn't get me anywhere. As you also found, optimizing the branching way that we have now will never be ideal perfwise. Gperf sounded really interesting, but also complex as you note.

I've come to the conclusion that if we want the IRequest<TResponse>-path to be optimized (close to 0 overhead) we have to generate some identifier in a source generator (there is no way for us to attach data to System.Type). That is what you suggest above with IRequestIdentifier right? Then we can read it both at source gen time (to correct array indices) and when we receive the IRequest<TResponse>.
This is a breaking change though and all requests etc have to be made partial...
As an alternative, maybe we could add the identifier as an attribute using IL weaving instead?

I need to read up on the papers/articles you linked above, and I want to do some more testing/experimentation before deciding anything.
Again thanks for all the help on this

@Tornhoof
Copy link
Contributor

Tornhoof commented Jul 7, 2022

Yeah that's my suggestion with IRequestIdentifier.
There are ways to use Attributes, e.g. GuidAttribute, primarily used for COM, this shows up as Type.GUID then, but even switching on something that integrated into the Type System is a lot slower than having our own identifier. Doing switching on our own custom attributes is very slow as attribute lookup at runtime via the type system is not fast, most likely slower than a Dictionary<Type, Handler>
There are ways to improve comparing complex identifiers, like if-ladders where only parts of the id are compared until you hit the correct one. I wrote something like that for SpanJson (https://github.com/Tornhoof/SpanJson/wiki/Technology-and-Internals#2-nested-if-approach), but this does not scale well above ~100 values.

So the preferred way is still some indexable id and I don't see any way around srcgen for that at the moment.

@irunjic
Copy link

irunjic commented Dec 5, 2022

NET 8 will have "Frozen" collections that optimize access based on input data (not sure if it's an implementation of perfect hashing).
dotnet/runtime#77799

Would that simplify the generated code?

@martinothamar
Copy link
Owner Author

Very cool, seems like there are a lot of different optimizations especially for string keys. I don't think it can compete with source generated indices, though there is the complication with the source generated indices - the running process (where Mediator is generated) can refer to multiple projects defining their own set of indices, which can overlap. So when building the running project these indices will have to be computed in some way based on the containing assembly/project so that collisions are avoided. So basically I'm not sure, I haven't done any groundwork on this yet unfortunately but we definitely should explore both and benchmark. Thanks for bringing it up!

@Tornhoof
Copy link
Contributor

I recently took a long look at Source Generators and static abstract interfaces for a different project and here is my summary.
The original idea of static abstract interfaces for a linear index won't work (anymore) for the normal Send<TResponse>(IRequest<TResponse> ...) api, as MS changed the behavior around them after I did my original POC. It is not possible for abstract static interfaces to be used as type args like that.
They changed it to conform to https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern and specifically CSharpLang issue 5955 has the reasons and a lot of comments why it limits the usage too much, (I agree with that sentiment).

Anyway, the only thing we can do is limit the possible type checks for the different IRequest types based on TResponse, hoping that they are usually different, i.e. not all are string.

The other solution, I'm actually considering (for said different project) is having different interfaces

        ValueTask Send<TCommand>(TCommand command,
            CancellationToken cancellationToken = default) where TCommand : class, ICommand;

        ValueTask<TResponse> Send<TRequest, TResponse>(TRequest request,
            CancellationToken cancellationToken = default) where TRequest : class, IRequest<TResponse>;

with the above code, the typed dictionary approach would work.

Note: for the source generated indices to work with multiple projects, don't generate them, but use the TypedDictionary approach to generate numbers, something like this:

    public static class TypeIndex
    {
        private static int _typeIndex = -1;

        private static class TypeKey<TRequest> where TRequest : class
        {
            // ReSharper disable once StaticMemberInGenericType
            internal static readonly int Id = Interlocked.Increment(ref _typeIndex);
        }

        public static int GetOrAdd<TRequest>() where TRequest : class => TypeKey<TRequest>.Id;
    }
    public interface ITypedIndex
    {
        static abstract int TypeIndex { get; }
    }
// Example
    public sealed class Request1 : IRequest<string>, ITypedIndex
    {
        public string Value { get; }

        public Request1(string value)
        {
            Value = value;
        }

        public static int TypeIndex => Interfaces.TypeIndex.GetOrAdd<Request1>();
    }

the static abstract interface only exists here as a marker interface so the static property needs to be used, in the most basic interface implementation it is not used, but could be, then one indirection via the typed dictionaries could be removed.

The upside of this approach in general is that an incremental source generator will probably work better as not everything is in one source generated mediator implementation.

@martinothamar
Copy link
Owner Author

Hey, thanks for the update!

Anyway, the only thing we can do is limit the possible type checks for the different IRequest types based on TResponse, hoping that they are usually different, i.e. not all are string.

Is this because TResponse is generic? JIT only applies monomorphization the the type parameter is a value type right, so the branches won't be eliminated even then? From some quick tests here switch/if'ing over the incoming request, regardless of if its a generic type parameter or not, the latency scales with the number of messages

The other solution, I'm actually considering (for said different project) is having different interfaces

Yeah this is tempting, I even tried it out a bit in the earliest versions of this library, but opted for the better ergonomics...
I've also ran into these limitations around static abstract members in other projects..
Another set of APIs I'll have to think about is the Send(object message) ones. One use-case I've seen for this is integration with messaging bus'es, like an Azure Service Bus receiver for example which deserializes it into an instance and publishes as notification. So will have to have a fallback to some slower path for those APIs still

Note: for the source generated indices to work with multiple projects, don't generate them, but use the TypedDictionary approach

Yeah this was the first problem I was anticipating. Example code looks promising, looking forward to experiment in this direction.

And as you suggest I should rewrite the source generation in a way that more of the work can be done in this new source generator, making everything more incremental and less hacky.

Unfortunately I haven't had as much time to spend doing open source work this last year or so as I'd like, so things have been moving slow. But I have managed to set aside a lot more time during summer, so I expect to get more done on this and other issues that have come up requiring breaking changes. So I think the 3.0 release will be pretty big. As soon as I have time to scope out the next release I'll add some information and context to the README and start work on this, which is probably gonna be when we get closer to May/June

@Tornhoof
Copy link
Contributor

I think, for .NET 8+, @mgravell inadvertently solved this problem here too, with his experiments in DapperLib/Dapper#1909

He's trying to rework the dapper API for aot without large refactoring of the User's source code. His current approach uses interceptors to replace specific code snippets at compile time. These interceptors are code generated.

See dotnet/csharplang#7009 for more information.

The Idea is basically the following:

  • identify each Mediator call via src gen
  • create interceptor location impl via src gen for each invocation
  • Cast and call the fully typed Send<TRequest, TResponse> Mediator impl.
  • See my previous post about how we then can get rid of the slow Switch/Case

As far as I know, the current impl for interceptors do not need to annotate the original location, but works ad-hoc.

@martinothamar
Copy link
Owner Author

Yeah it looks like it simplifies a lot. I think as long as we include the attribute in source gen, we can use older frameworks than .NET 8 as well, people just have to use the .NET 8 SDK? We support .NET Standard now but it is tempting to just say .NET 6+ as that is what is "in support" currently.

If we go this route, I have the follow thoughts:

  • Do source generation per project (this solves other issues as well) - so in project A we will generate an implementation class which has direct handlers for all the messages defined in project A
  • Since we now control the call site as well, when we encounter mediator.Send(new Req1()), we can just find the concrete implementation of IMediator for the project containing Req1, cast and dispatch directly

In this design we could still have the current packages (Mediator.Abstractions and Mediator.SourceGenerator), they would just be added to all projects. So in fact maybe just have 1 NuGet package which contain both the abstractions and includes the analyzer/sourcegenerator?

@martinothamar
Copy link
Owner Author

Well.. still need to think about AddMediator and DI... but that will be my goal atleast, to separate sourcegen and still having low overhead dispatch

@Tornhoof
Copy link
Contributor

I think you're correct with your assumptions about simply using the .NET 8 SDK for older versions.

I think the approach to generate per project should work and is a lot more incremental than doing one big type.

@Dreamescaper
Copy link

I can see you're combining your Provider with a CompilationProvider.
That causes Parse method, along with all the assemblies scanning, to be invoked on every Compilation update - which is basically on every keystroke.

I have encountered the same issue in my Source Generator, and workarounded it by providing a custom EqualityComparer, which ignores the Compilation part.
I'm not really sure how bad this workaround is, to be fair, but it seems to work fine.

I have created a discussion in Roslyn repo, maybe someone suggests anything useful:
dotnet/roslyn#74001

@martinothamar
Copy link
Owner Author

Interesting! If all that runs on every keystroke you'd think you'd notice it when developing stuff. It takes about 15ms for 1 pass on one of my machines. I have been neglecting this part of the code some, though @TimothyMakkison brought some good improvements in #113

@Dreamescaper
Copy link

As far as I know, source generators are async and do not affect typing, so you'd only notice an increased CPU load. But I did check it - it does execute on every input.

#113 made things a bit better that your model is cacheble now, i.e. the output of Parse method. context.RegisterSourceOutput is only invoked if that model has any changes.
But the Compilation is never cacheable, it is updated all the time, therefore Parse method is executed all the time as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants