Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Striding loop performance #938

Closed
jackmott opened this issue Feb 7, 2016 · 13 comments · Fixed by #16650
Closed

Striding loop performance #938

jackmott opened this issue Feb 7, 2016 · 13 comments · Fixed by #16650
Labels
Area-Compiler-Optimization The F# optimizer, release code gen etc. Feature Improvement
Milestone

Comments

@jackmott
Copy link
Contributor

jackmott commented Feb 7, 2016

System.numerics.vectors exposes a SIMD enhanced Vector classes. Using VS2015 Update 1, latest versions of .NET framework and F# and System.numerics.vectors the performance of System.Numerics is worse than not using it at all, for instance:

     let sumVectorLoop =
            let mutable total = Vector<int>.Zero
            for i in 0  .. COUNT/8-1 do
                total <- total + vecArray.[i]
            total

Is slower than the same operation on an array of integers:

     let sumsLoop =
            let mutable total = 0;
            for i in 0 .. COUNT - 1 do
                total <- total + numsArray.[i]
            total

I have confirmed that Vector.isHardwareAccelerated reports as true. I have confirmed that equivalent code in C# runs ~2x faster for the Vector approach. Interestingly, using Array.reduce on the vector array is faster than the imperative loop, which is the opposite of working with an array of ints, suggesting something may be amiss:

let sumVectorReduce =
        Array.reduce (fun a e -> a + e)  vecArray
@dsyme
Copy link
Contributor

dsyme commented Feb 7, 2016

Could you please give the exact code in C#? There must be a difference in F# and C# code generation which is tripping up the optimizations applied by the RyuJIT.

@dsyme dsyme added Bug Need More Info Impact-High (Internal MS Team use only) Describes an issue with extreme impact on existing code. and removed Bug labels Feb 7, 2016
@dsyme
Copy link
Contributor

dsyme commented Feb 7, 2016

BTW does anyone know the right people to contact on the RyuJIT team? They may well want to fix this there rather than fiddling with F# codegen, as whatever needs fixing may lead to more general and more robust loop optimizations.

@jackmott
Copy link
Contributor Author

jackmott commented Feb 7, 2016

Here you go

 const int COUNT = 10000000;
            int[] ints = new int[COUNT];
            for (int i = 0; i < COUNT-1;i++)
            {
                ints[i] = i;
            }

            Vector<int>[] vectors = new Vector<int>[COUNT / 8];
            for (int i = 0; i < COUNT / 8 - 1; i = i + 8)
            {
                vectors[i] = new Vector<int>(ints, i);
            }

            Stopwatch sw = new Stopwatch();
            sw.Start();
            Vector<int> sum = Vector<int>.Zero;

            for (int i = 0; i < COUNT/8-1;i++)
            {
                sum = sum + vectors[i];
            }

            sw.Stop();
            Console.WriteLine("Sum vector:" + sum + " time:" + sw.ElapsedMilliseconds);


            sw.Restart();
            int sumi = 0;
            for (int i = 0; i < COUNT-1; i++)
            {
                sumi = sumi + ints[i];
            }
            Console.WriteLine("Sum integers:" + sum + " time:" + sw.ElapsedMilliseconds);
            Console.ReadLine();

@jackmott
Copy link
Contributor Author

jackmott commented Feb 8, 2016

Here is the IL of each vector summing loop as reported by ILSpy:

C#:

IL_004e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0053: stloc.2
        IL_0054: ldc.i4.0
        IL_0055: stloc.s i
        IL_0057: br.s IL_006e
        // loop start (head: IL_006e)
            IL_0059: ldloc.2
            IL_005a: ldloc.1
            IL_005b: ldloc.s i
            IL_005d: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_0062: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0067: stloc.2
            IL_0068: ldloc.s i
            IL_006a: ldc.i4.1
            IL_006b: add
            IL_006c: stloc.s i

            IL_006e: ldloc.s i
            IL_0070: ldc.i4 1250000
            IL_0075: blt.s IL_0059
        // end loop

F#

IL_006a: ldloc.3
        IL_006b: stloc.1
        IL_006c: nop
        IL_006d: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0072: stloc.s total
        IL_0074: ldc.i4.0
        IL_0075: stloc.s i
        IL_0077: ldc.i4 10000000
        IL_007c: ldc.i4.8
        IL_007d: div
        IL_007e: stloc.2
        IL_007f: ldloc.2
        IL_0080: ldloc.s i
        IL_0082: blt.s IL_00a2
        // loop start (head: IL_0084)
            IL_0084: ldloc.s total
            IL_0086: ldloc.1
            IL_0087: ldloc.s i
            IL_0089: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_008e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0093: stloc.s total
            IL_0095: ldloc.s i
            IL_0097: ldc.i4.1
            IL_0098: add
            IL_0099: stloc.s i
            IL_009b: ldloc.s i
            IL_009d: ldloc.2
            IL_009e: ldc.i4.1
            IL_009f: add
            IL_00a0: bne.un.s IL_0084
        // end loop

@mellinoe
Copy link

mellinoe commented Feb 8, 2016

@CarolEidt works on the SIMD feature on RyuJIT. Carol, do you recognize any differences in the C#/F# IL that would lead to significantly worse performance?

On the usage side, I did want to point out that we don't usually recommend folks actually store their data in an array of Vector<int>'s. The idiomatic way to use the library would be to keep everything stored in your original int[] array, and construct Vector items directly from that. So your vectorized loop would be something like:

for (int i = 0; i < COUNT / Vector<int>.Count; i += Vector<int>.Count)
{
    sum = sum + new Vector<int>(ints, i);
}

Another thing to note is that your scalar and vector passes aren't necessarily computing the same thing. To get the sum of a single array's values efficiently, we'd need to something like a HorizontalAdd method (which would compute the sum of an individual Vector<T>'s elements. Right now you are left with a Vector<int> at the end of your vector computation, which you'd need to manually add together to get the same result as the scalar version.

@jackmott
Copy link
Contributor Author

jackmott commented Feb 8, 2016

The idiomatic approach seems to make use of an enumerator for the loop which kills performance:

  let sumVectorLoop =
        let mutable total = Vector<int>.Zero
        for i in 0  .. 8 .. COUNT-1 do
            total <- total + new Vector<int>(numsArray,i)
        total    

if I get rid of the .. 8 .. in the for loop and do that by hand, or use a for loop that counts by one and then multiply the index, a do while loop is generated instead and performance is in line with C#:

 let sumVectorLoop =
        let mutable total = Vector<int>.Zero
        for i in 0  .. COUNT/8-1 do
            total <- total + new Vector<int>(numsArray,i*8)
        total    

If there are good reasons why that style of for loop becomes an enumerator and this is just reflecting my lack of practice with the language feel free to close. Thanks for the help everyone.

@dsyme dsyme added Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code. and removed Impact-High (Internal MS Team use only) Describes an issue with extreme impact on existing code. labels Feb 9, 2016
@dsyme
Copy link
Contributor

dsyme commented Feb 9, 2016

@jackmott There is no fundamentally good reason why loops like that use an enumerator, except that the necessary optimizations just aren't done in the F# compiler for striding loops. It would be awesome to have this fixed.

@manofstick
Copy link
Contributor

@dsyme,

Ha, beat me to it by a few minutes :-)

In DetectAndOptimizeForExpression there is a check that is only optimizing for step of 1 or -1.

And then an appropriate change to mkFastForLoop would be required.

@dsyme
Copy link
Contributor

dsyme commented Feb 9, 2016

@manofstick And then real care near MinInt and MaxInt. (If necessary a dynamic check on entry to the loop could be used in those cases)

@dsyme dsyme changed the title System.numerics.vectors performance Striding loop performance Jul 5, 2016
@dsyme
Copy link
Contributor

dsyme commented Jul 7, 2016

I started to take a look at this, and it's not easy.

One problem is that the F# "FastIntegerLoop" TAST construct can't represent striding loops. It could be extended, but this has to be done with care since the construct can (and does) occur in optimization information and the representations of inlined functions. Ideally care should be taken that DLLs that generate this new construct be consumable by down-level F# compilers, but that's hard to arrange.

Another problem is that "F#-style loops" for x in n .. step .. m are currently generated using an "bne" branch-not-equals instruction at the end condition. This is done because m might be MaxInt. But this won't work for striding loops - a less-than operation is needed. But a less-than operation doesn't work when m is above MaxInt - step since a wrap-around occurs.

Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed. Perhaps I need to look more closely at C# code generation for these cases

@dsyme dsyme added Feature Request and removed Need More Info Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code. labels Apr 29, 2017
@cartermp cartermp added this to the Unknown milestone Aug 25, 2018
@markpattison
Copy link
Contributor

I hit this issue today. Below is a complete (.NET Core 3.1 + BenchmarkDotNet) console program illustrating the problem.


open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

type ForLoopComparison ()=
    let n = 1000000

    [<Benchmark>]
    member this.ForTo() =
        let mutable acc = 0

        for i = 1 to (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForIn() =
        let mutable acc = 0

        for i in 1 .. (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForInStep() =
        let mutable acc = 0

        for i in 2 .. 2 .. n do
            acc <- acc + i
        acc

[<EntryPoint>]
let main argv =
    BenchmarkRunner.Run<ForLoopComparison>() |> ignore
    0

The three methods give identical results. Benchmarked times on my PC are 235µs for the first two methods but 2728µs for the last one.

@cartermp
Copy link
Contributor

cartermp commented Mar 9, 2020

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

@dsyme
Copy link
Contributor

dsyme commented Mar 23, 2020

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

Yes, we should fix this, definitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compiler-Optimization The F# optimizer, release code gen etc. Feature Improvement
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants