-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for large loops. #74171
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionint[] x = new int[N];
for (int i = 0; i < x.Length; i++) x[i]++; Consider the code above. With the current JITting, this code gets assembled to N iterations of the loop. However, given how simple the operation is, there is potential for two things (among others) to speed up this loop's performance: loop unrolling & vectorization. Consider just the loop part where for (int i = 0; i < N; i++) x[i]++; The following assembly is generated by RyuJIT: push rbp
mov rbp, rsp
xor eax, eax
test rdi, rdi
je SHORT G_M31042_IG05
cmp dword ptr [rdi+8], 0x2710
jl SHORT G_M31042_IG05
G_M31042_IG03:
movsxd rsi, eax
lea rsi, bword ptr [rdi+4*rsi+16]
inc dword ptr [rsi]
inc eax
cmp eax, 0x2710
jl SHORT G_M31042_IG03
jmp SHORT G_M31042_IG06
G_M31042_IG05:
cmp eax, dword ptr [rdi+8]
jae SHORT G_M31042_IG07
movsxd rsi, eax
lea rsi, bword ptr [rdi+4*rsi+16]
inc dword ptr [rsi]
inc eax
cmp eax, 0x2710
jl SHORT G_M31042_IG05
G_M31042_IG06:
pop rbp
ret
G_M31042_IG07:
call [CORINFO_HELP_RNGCHKFAIL]
int3 Same code in C++: for (int i = 0; i < N; i++) x[i]++; Assembly generated by GCC: movdqa xmm1, XMMWORD PTR .LC0[rip]
lea rax, [rdi+40000]
.L2:
movdqu xmm0, XMMWORD PTR [rdi]
add rdi, 16
paddd xmm0, xmm1
movups XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2
ret
.LC0:
.long 1
.long 1
.long 1
.long 1 If I use C# code: int length = x.Length;
for (int i = 0; i < length; i++) x[i]++; Generated assembly: push rbp
mov rbp, rsp
mov eax, dword ptr [rdi+8]
xor esi, esi
test eax, eax
jle SHORT G_M31042_IG06
test eax, eax
jl SHORT G_M31042_IG05
G_M31042_IG03:
movsxd rdx, esi
lea rdx, bword ptr [rdi+4*rdx+16]
inc dword ptr [rdx]
inc esi
cmp esi, eax
jl SHORT G_M31042_IG03
jmp SHORT G_M31042_IG06
G_M31042_IG05:
movsxd rdx, esi
lea rdx, bword ptr [rdi+4*rdx+16]
inc dword ptr [rdx]
inc esi
cmp esi, eax
jl SHORT G_M31042_IG05
G_M31042_IG06:
pop rbp
ret Meanwhile, C++: int length = sizeof(x) / sizeof(x[0]);
for (int i = 0; i < length; i++) x[i]++; Generated assembly: movq xmm1, QWORD PTR .LC0[rip]
movq xmm0, QWORD PTR [rdi]
paddd xmm0, xmm1
movq QWORD PTR [rdi], xmm0
ret
.LC0:
.long 1
.long 1 AnalysisNow, I understand that the .NET Runtime aims to be safer (ensuring that the index isn't out of range, etc.), but it seems that there is a clear performance winner. In both cases, the code is just as safe. We should be generating performant assembly for such loops. They're used everywhere, in almost every project. Optimized loops would mean a massive speed boost for all applications.
|
cc @BruceForstall @dotnet/jit-contrib. |
The Loops in you examples are already pretty efficient as is (well, there are some things we might improve), for the case where you use We have existing feature requests for partial unrolling and vectorization but we're not really interested in them at this point - most of BCL APIs are already vectorized by hands. Autovectorization is usually pretty fragile even in C++ compilers and it easily stops working in something more complicated than incrementing all elements. |
@TheBlackPlague Thanks for the issue. We know there are many classical compiler loop optimizations that we are missing, and we're trying to implement the most impactful ones (to "real-world" C# code) as resources allow. As EgorBo states, partial unrolling and auto-vectorization haven't met those criteria yet. I'm going to go ahead and close this, since we're well aware of opportunities for improvement in this area. |
@BruceForstall & @EgorBo, I would like to discuss this further if possible. I will agree to the final decision made by you guys as you're the team behind this project. Can we collectively agree that C# can be a valuable language for machine learning? I know there are languages like Python that have a lot of ML frameworks in them, but I imagine that Microsoft does intend to make C# a language that has potential for both model inference & training. Can we collectively agree that machine learning in C# is "real-world" C# code? While it's super convenient to do training on the GPU, some models are written to be used on the CPU. For that reason, having access to SIMD is very important, as inference would be slow without them. We've all seen and heard that neural network models work with floats (float32/float64) to maintain their high accuracy; however, in some instances where complete accuracy isn't necessary, experienced ML researchers will use Quantization. See: Medium Article on Quantization in Deep Learning The rough idea is to convert floating point values to values like integers, shorts, and even sbytes. One good reason is that integer operations are usually faster on the CPU. Another good reason is the size of the data. On AVX-2 (not going to talk about AVX-512 right now as it isn't supported in .NET yet)... the Vector register size is 256-bits (or, for convenience: 32 bytes). Thus, in parallel, one can operate on eight integers, 16 shorts (int16), and 32 sbytes (int8). The cost is roughly the same as doing the operation once on that type. The point is that yes, one can write their Vectorization code, but not every ML Researcher will know about vectorization and how to use it properly. Moreover, not everyone will know about each architecture (AVX or SSE)'s specific instructions to optimize their algorithms (or pipelines) further. Other than that, many more cases, such as games in C#, could use vectorization. In certain games, immense amounts of calculations need to be done at the same time. We talk about how Unity should switch over to .NET Core and give users access to the new APIs, but why would they when .NET Core's JIT compiler performs worse than whatever they've built? Not to mention, not every aspiring game engine developer would want to work in C# and C++. Furthermore, not every new game developer will know about vectorization. No offense, but I believe it's incredibly incorrect to think the only "real-world" use-case of C# is for databases and APIs. C# is a beautiful language with the best syntax I've seen in a language - precise, concise, and expressive. If you look outside the bubble of databases and APIs, you'll see people working tirelessly to create marvelous things in this language. All to at times be beaten by C++'s performance. We should be working towards making C# as fast as C++, and if not C++ due to safety requirements, at least Rust should be a feasible target. I understand due to the way JIT is supposed to run (portable and minimal runtime overhead), it's hard to do multiple passes that compilers like GCC do. It's why I applaud everyone who has worked on RyuJIT. It's amazing. However, with the NativeAOT compilation right around the corner, isn't this something that can happen there? For those that require said performance improvements? I urge you to consider this. |
@TheBlackPlague Thank you for sharing your thoughts. Let me be clear: we want to continue to improve RyuJIT and .NET code generation in general to provide better and better performance, as well as making continuous improvements in the core libraries as well as C# libraries like ML.NET to also improve .NET performance. If you haven't seen them before, check out Stephen Toub's per-release performance improvement summary blog posts, e.g., https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-6/. We'll soon be starting planning for the work we'll choose to invest in for .NET 8. Loop, vectorization, and vector intrinsics improvements will be top-of-mind in those discussions. (You might be interested to see the start of AVX-512 support appearing: #73262.) So far, with respect to vectorization, we've focused on providing the solid cross-platform underpinnings that allow us to generate these instructions and have optimized core libraries routines to use them. We expect performance-sensitive shared library writers will generally have more incentive to optimize their code in this way than general app developers. To make it easier for all, we have |
@BruceForstall Thank you very much for your response. I believe .NET 8 would be the right target for this. Knowing that it'll be one of the top things you consider for it is enough to satisfy me. Are these discussions going to be internal or is there some place a fellow like myself can tune in and possibly participate? Apologies if this is a bit off-topic. |
Nobody on this team is saying this.
As mentioned above we continue to invest heavily in enabling cross platform vectorized operations, including most recently introducing Vector128 and Vector256, that allow you to write code for multiple architectures at the same time. |
Thank you for responding. I apologize if that's not what was meant, but that's what I inferred from the text. If not auto-vectorization, can there be considerations on adding Vectorized addition methods to While
Adding support to The APIs I'm proposing for void Add<T>(Span<T> a, Span<T> b, Span<T> result) {
// Code to unroll loop or variable size span and perform Vectorized addition. Possibly intrinsic and filled in by the JIT?
} These APIs would again only work the types I mentioned. All in all, it would be a great addition to the Span API and give us all we need for simple auto-vectorization support. Afterall, auto-vectorization won't cover everything. |
Our planning discussions are typically internal. We've thought about ways to make them external, but haven't done so. However, our plans are published as GitHub issues and/or projects. |
Description
Consider the code above. With the current JITting, this code gets assembled to N iterations of the loop. However, given how simple the operation is, there is potential for two things (among others) to speed up this loop's performance: loop unrolling & vectorization.
Consider just the loop part where
N = 10000
, andx = int[N]
:The following assembly is generated by RyuJIT:
Same code in C++:
Assembly generated by GCC:
If I use
x.Length
instead ofN
...C# code:
Generated assembly:
Now, I understand that the .NET Runtime aims to be safer (ensuring that the index isn't out of range, etc.), but it seems that there is a clear performance winner. In both cases, the code is just as safe. We should be generating performant assembly for such loops. They're used everywhere, in almost every project. Optimized loops would mean a massive speed boost for all applications.
The text was updated successfully, but these errors were encountered: