Striding loop performance #938

jackmott · 2016-02-07T15:23:11Z

System.numerics.vectors exposes a SIMD enhanced Vector classes. Using VS2015 Update 1, latest versions of .NET framework and F# and System.numerics.vectors the performance of System.Numerics is worse than not using it at all, for instance:

     let sumVectorLoop =
            let mutable total = Vector<int>.Zero
            for i in 0  .. COUNT/8-1 do
                total <- total + vecArray.[i]
            total

Is slower than the same operation on an array of integers:

     let sumsLoop =
            let mutable total = 0;
            for i in 0 .. COUNT - 1 do
                total <- total + numsArray.[i]
            total

I have confirmed that Vector.isHardwareAccelerated reports as true. I have confirmed that equivalent code in C# runs ~2x faster for the Vector approach. Interestingly, using Array.reduce on the vector array is faster than the imperative loop, which is the opposite of working with an array of ints, suggesting something may be amiss:

let sumVectorReduce =
        Array.reduce (fun a e -> a + e)  vecArray

The text was updated successfully, but these errors were encountered:

dsyme · 2016-02-07T22:27:51Z

Could you please give the exact code in C#? There must be a difference in F# and C# code generation which is tripping up the optimizations applied by the RyuJIT.

dsyme · 2016-02-07T22:30:16Z

BTW does anyone know the right people to contact on the RyuJIT team? They may well want to fix this there rather than fiddling with F# codegen, as whatever needs fixing may lead to more general and more robust loop optimizations.

jackmott · 2016-02-07T22:31:37Z

Here you go

 const int COUNT = 10000000;
            int[] ints = new int[COUNT];
            for (int i = 0; i < COUNT-1;i++)
            {
                ints[i] = i;
            }

            Vector<int>[] vectors = new Vector<int>[COUNT / 8];
            for (int i = 0; i < COUNT / 8 - 1; i = i + 8)
            {
                vectors[i] = new Vector<int>(ints, i);
            }

            Stopwatch sw = new Stopwatch();
            sw.Start();
            Vector<int> sum = Vector<int>.Zero;

            for (int i = 0; i < COUNT/8-1;i++)
            {
                sum = sum + vectors[i];
            }

            sw.Stop();
            Console.WriteLine("Sum vector:" + sum + " time:" + sw.ElapsedMilliseconds);


            sw.Restart();
            int sumi = 0;
            for (int i = 0; i < COUNT-1; i++)
            {
                sumi = sumi + ints[i];
            }
            Console.WriteLine("Sum integers:" + sum + " time:" + sw.ElapsedMilliseconds);
            Console.ReadLine();

jackmott · 2016-02-08T01:47:01Z

Here is the IL of each vector summing loop as reported by ILSpy:

C#:

IL_004e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0053: stloc.2
        IL_0054: ldc.i4.0
        IL_0055: stloc.s i
        IL_0057: br.s IL_006e
        // loop start (head: IL_006e)
            IL_0059: ldloc.2
            IL_005a: ldloc.1
            IL_005b: ldloc.s i
            IL_005d: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_0062: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0067: stloc.2
            IL_0068: ldloc.s i
            IL_006a: ldc.i4.1
            IL_006b: add
            IL_006c: stloc.s i

            IL_006e: ldloc.s i
            IL_0070: ldc.i4 1250000
            IL_0075: blt.s IL_0059
        // end loop

F#

IL_006a: ldloc.3
        IL_006b: stloc.1
        IL_006c: nop
        IL_006d: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::get_Zero()
        IL_0072: stloc.s total
        IL_0074: ldc.i4.0
        IL_0075: stloc.s i
        IL_0077: ldc.i4 10000000
        IL_007c: ldc.i4.8
        IL_007d: div
        IL_007e: stloc.2
        IL_007f: ldloc.2
        IL_0080: ldloc.s i
        IL_0082: blt.s IL_00a2
        // loop start (head: IL_0084)
            IL_0084: ldloc.s total
            IL_0086: ldloc.1
            IL_0087: ldloc.s i
            IL_0089: ldelem.any valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>
            IL_008e: call valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0> valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<int32>::op_Addition(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!0>)
            IL_0093: stloc.s total
            IL_0095: ldloc.s i
            IL_0097: ldc.i4.1
            IL_0098: add
            IL_0099: stloc.s i
            IL_009b: ldloc.s i
            IL_009d: ldloc.2
            IL_009e: ldc.i4.1
            IL_009f: add
            IL_00a0: bne.un.s IL_0084
        // end loop

mellinoe · 2016-02-08T18:52:51Z

@CarolEidt works on the SIMD feature on RyuJIT. Carol, do you recognize any differences in the C#/F# IL that would lead to significantly worse performance?

On the usage side, I did want to point out that we don't usually recommend folks actually store their data in an array of Vector<int>'s. The idiomatic way to use the library would be to keep everything stored in your original int[] array, and construct Vector items directly from that. So your vectorized loop would be something like:

for (int i = 0; i < COUNT / Vector<int>.Count; i += Vector<int>.Count)
{
    sum = sum + new Vector<int>(ints, i);
}

Another thing to note is that your scalar and vector passes aren't necessarily computing the same thing. To get the sum of a single array's values efficiently, we'd need to something like a HorizontalAdd method (which would compute the sum of an individual Vector<T>'s elements. Right now you are left with a Vector<int> at the end of your vector computation, which you'd need to manually add together to get the same result as the scalar version.

jackmott · 2016-02-08T20:57:31Z

The idiomatic approach seems to make use of an enumerator for the loop which kills performance:

  let sumVectorLoop =
        let mutable total = Vector<int>.Zero
        for i in 0  .. 8 .. COUNT-1 do
            total <- total + new Vector<int>(numsArray,i)
        total

if I get rid of the .. 8 .. in the for loop and do that by hand, or use a for loop that counts by one and then multiply the index, a do while loop is generated instead and performance is in line with C#:

 let sumVectorLoop =
        let mutable total = Vector<int>.Zero
        for i in 0  .. COUNT/8-1 do
            total <- total + new Vector<int>(numsArray,i*8)
        total

If there are good reasons why that style of for loop becomes an enumerator and this is just reflecting my lack of practice with the language feel free to close. Thanks for the help everyone.

dsyme · 2016-02-09T01:08:53Z

@jackmott There is no fundamentally good reason why loops like that use an enumerator, except that the necessary optimizations just aren't done in the F# compiler for striding loops. It would be awesome to have this fixed.

manofstick · 2016-02-09T01:15:24Z

@dsyme,

Ha, beat me to it by a few minutes :-)

In DetectAndOptimizeForExpression there is a check that is only optimizing for step of 1 or -1.

And then an appropriate change to mkFastForLoop would be required.

dsyme · 2016-02-09T09:21:39Z

@manofstick And then real care near MinInt and MaxInt. (If necessary a dynamic check on entry to the loop could be used in those cases)

dsyme · 2016-07-07T13:27:43Z

I started to take a look at this, and it's not easy.

One problem is that the F# "FastIntegerLoop" TAST construct can't represent striding loops. It could be extended, but this has to be done with care since the construct can (and does) occur in optimization information and the representations of inlined functions. Ideally care should be taken that DLLs that generate this new construct be consumable by down-level F# compilers, but that's hard to arrange.

Another problem is that "F#-style loops" for x in n .. step .. m are currently generated using an "bne" branch-not-equals instruction at the end condition. This is done because m might be MaxInt. But this won't work for striding loops - a less-than operation is needed. But a less-than operation doesn't work when m is above MaxInt - step since a wrap-around occurs.

Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed. Perhaps I need to look more closely at C# code generation for these cases

markpattison · 2020-03-07T22:08:48Z

I hit this issue today. Below is a complete (.NET Core 3.1 + BenchmarkDotNet) console program illustrating the problem.


open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

type ForLoopComparison ()=
    let n = 1000000

    [<Benchmark>]
    member this.ForTo() =
        let mutable acc = 0

        for i = 1 to (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForIn() =
        let mutable acc = 0

        for i in 1 .. (n / 2) do
            acc <- acc + (2 * i)
        acc

    [<Benchmark>]
    member this.ForInStep() =
        let mutable acc = 0

        for i in 2 .. 2 .. n do
            acc <- acc + i
        acc

[<EntryPoint>]
let main argv =
    BenchmarkRunner.Run<ForLoopComparison>() |> ignore
    0

The three methods give identical results. Benchmarked times on my PC are 235µs for the first two methods but 2728µs for the last one.

cartermp · 2020-03-09T13:29:29Z

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

dsyme · 2020-03-23T14:23:44Z

@dsyme given the renewed focus on slicing (and its syntax) for .NET 5, what do you think of revisiting this? How common do you feel this scenario is for numeric programming in general?

Yes, we should fix this, definitely.

dsyme added Bug Need More Info Impact-High (Internal MS Team use only) Describes an issue with extreme impact on existing code. and removed Bug labels Feb 7, 2016

dsyme added Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code. and removed Impact-High (Internal MS Team use only) Describes an issue with extreme impact on existing code. labels Feb 9, 2016

dsyme added the good first issue label Jul 5, 2016

dsyme changed the title ~~System.numerics.vectors performance~~ Striding loop performance Jul 5, 2016

dsyme added Feature Request and removed Need More Info Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code. labels Apr 29, 2017

dsyme added Tenet-Performance and removed good first issue labels Jun 20, 2017

dsyme added Feature Improvement and removed Feature Request labels Aug 25, 2017

cartermp added this to the Unknown milestone Aug 25, 2018

cartermp modified the milestones: Feature requests and improvements, Backlog Oct 28, 2020

KevinRansom removed the Tenet-Performance label Jun 30, 2021

dsyme added Area-Compiler-Optimization The F# optimizer, release code gen etc. and removed Area-Compiler labels Mar 31, 2022

vzarytovskii added this to F# Compiler and Tooling Jun 17, 2022

vzarytovskii moved this to Not Planned in F# Compiler and Tooling Jun 17, 2022

albert-du mentioned this issue Jul 26, 2022

Optimize integer for loop code gen #13573

Closed

KevinRansom closed this as completed Jan 4, 2024

github-project-automation bot moved this from New to Done in F# Compiler and Tooling Jan 4, 2024

vzarytovskii reopened this Jan 4, 2024

github-project-automation bot moved this from Done to In Progress in F# Compiler and Tooling Jan 4, 2024

brianrourkeboll mentioned this issue Feb 20, 2024

Better integral range lowering: start..finish, start..step..finish #16650

Merged

3 tasks

vzarytovskii closed this as completed in #16650 Mar 6, 2024

github-project-automation bot moved this from In Progress to Done in F# Compiler and Tooling Mar 6, 2024

brianrourkeboll mentioned this issue Jul 1, 2024

Add dedicated method for ranges translation in CE fsharp/fslang-suggestions#1116

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Striding loop performance #938

Striding loop performance #938

jackmott commented Feb 7, 2016

dsyme commented Feb 7, 2016

dsyme commented Feb 7, 2016

jackmott commented Feb 7, 2016

jackmott commented Feb 8, 2016

mellinoe commented Feb 8, 2016

jackmott commented Feb 8, 2016

dsyme commented Feb 9, 2016

manofstick commented Feb 9, 2016

dsyme commented Feb 9, 2016

dsyme commented Jul 7, 2016

markpattison commented Mar 7, 2020

cartermp commented Mar 9, 2020

dsyme commented Mar 23, 2020

Striding loop performance #938

Striding loop performance #938

Comments

jackmott commented Feb 7, 2016

dsyme commented Feb 7, 2016

dsyme commented Feb 7, 2016

jackmott commented Feb 7, 2016

jackmott commented Feb 8, 2016

mellinoe commented Feb 8, 2016

jackmott commented Feb 8, 2016

dsyme commented Feb 9, 2016

manofstick commented Feb 9, 2016

dsyme commented Feb 9, 2016

dsyme commented Jul 7, 2016

markpattison commented Mar 7, 2020

cartermp commented Mar 9, 2020

dsyme commented Mar 23, 2020