-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Adding the new fastest mandelbrot implementation to benchmarks-game. #14287
Conversation
Also FYI. @mellinoe |
@dotnet-bot test Windows_NT x64 perf |
Thanks for updating. You should remove mandelbrot-4 in this change, we generally don't want to keep more than two variants of each to avoid cluttering the reporting tracking these. You'll also, once this gets merged, want to port it to the release/1.1.0 and release/2.0.0 branches, like #14094 and #14095. Since you're just touching files in these directories, simple cherry-picks should work. /cc @jorive |
There is an issue with the benchmark games in that they also include Jit time; so vectorizing isn't as fast because the Jit startup for Vectors is longer :-/ I was hoping #14244 would elevate this |
Will fix up the build errors shortly. |
@JosephTremoulet, any particular reason why |
|
@dotnet-bot test Windows_NT x64 perf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would be good to verify that an InnerIterationCount of 7
is still reasonable (our rule of thumb has been to try to make the duration reported by run-xunit-performance roughly 1000ns).
@JosephTremoulet I think you meant to say 1000ms, right? |
Whoops! Yes, ms, good catch, thanks. |
@JosephTremoulet, what is the correct way to validate that? Still not quite sure how to properly navigate bench view. |
+1 |
That is, I can see the numbers for the jobs, etc. I just don't know how to properly compare them for improvement/etc, since Mandelbrot 4 and Mandelbrot 7 are "separate" scenarios. I see 5619.01ms for Mandelbrot 4 and 6196.77ms for Mandelbrot 7 (duration), which seems backwards since 7 is measurably faster. |
What we've been doing to validate the iteration count is just run the benchmark via run-xunit-performance.cmd locally. The point is just to make sure it's neither too fast-running to produce good measurements and profiles nor so long-running that it wastes lab resources. The different variants of each test had their iteration counts set independently, to make sure we're getting usable measurements for each. Yes, this means that comparing them to each other in benchview is meaningless, but the goal there is to have usable measurements and look independently at their improvements/regressions over time, not to rank them against each other (which, of course, is what happens over at BenchmarksGame). |
@JosephTremoulet, what hardware are the actual jobs run against? If you have a 8-core, 16-thread machine (or higher) this can seriously skew the results as compared to a 4-core, 8-thread machine. |
Not sure... @jorive? |
Haswell machines with: 4 cores, 8 threads, 3.6GHz |
7 seems to be fine for the iteration count. |
Am I good to merge? |
Yes. And please port to release/1.1.0 and release/2.0.0 afterward. |
Port #14287 to release/2.0.0
FYI. @JosephTremoulet, @AndyAyersMS, @ViktorHofer, @danmosemsft
New fastest implementation. At this point, I'm pretty certain the only reason it isn't even faster is because the
Q6600
processor the official benches use likely doesn't support the optimization wheremovups
is as fast asmovaps
(it looks like that optimization was added in the subsequent microarchitecture).