Epic: ResizeProcessor performance improvements (Memory & CPU) #733

antonfirsov · 2018-10-10T23:06:53Z

I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized ResizeProcessor for 1.0.

Goals

If we implement all the tasks, I expect that:

The memory usage of ResizeProcessor will drop dramatically (Reduced by a factor of 20x - 50x for typical images.)
The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) (Current status: 120-130%)

Tasks

(1) Implement new data types for SOA representation of Vector4 (or other 4 channel) buffers that could be used as SOA counterparts of AOS types IMemoryOwner<Vector4>, Buffer2D<Vector4>, and Span<Vector4>. Something like class BufferOf4Channels<T>, class BufferOf4Channels2D<T> and ref struct BufferSegmentOf4Channels<T>.
(2) Implement bulk packing methods in PixelOperations<TPixel> for the BufferSegmentOf4Channels<float>. Rgba32 packing/unpacking should be optimized the same way it's done for Span<Vector4> packing.
(3) Integrate the SRgb companding (Compress/Expand) operations into the API-s defined in (2)
(4) Optional CPU optimization. Optimize the implementation of (3) for Rgba32, using lookup tables.
(5) Replace all Vector4 AOS buffers with SOA counterparts in ResizeProcessor
(6) Memory optimization. Implement Optimize memory consumption of ResizeProcessor #642, preferably when (5) is fully implemented. Parallelization should be dropped.
(7) CPU optimization. Implement vectorized convolution in ResizeKernel, using Vector4 by default, and Vector<float> if AVX2 is detected (Vector<float>.Count == 8)
(8) Optional CPU optimization. Vectorized Premultiply and UnPremultiply (both Vector4 and AVX2 variants)

Outlook

~~If we manage to implement these points, the bottleneck would be the Rgba32 <-> 4 x float unpacking/repacking operation. If we can optimize it, we can reach even more superior performance.~~ Update: Done in #742.

Alternatively, we can try implementing fixed-point math using Vector<ushort>.

As always, community help is welcome!

The text was updated successfully, but these errors were encountered:

antonfirsov · 2018-10-10T23:49:34Z

ResizeProcessor's current CPU sampling profile:

antonfirsov · 2018-11-04T19:34:33Z

I'm closing this, because it turned out that conversion to SOA buffers so ineffective without shuffling intrinsics, that it renders the whole thing unworthy for now.

We should replan this stuff when the experiments with .NET core 3.0 and the new intrinsics are started.

I hope it's still possible to close the performance gap with optimizations like #768.

JimBobSquarePants · 2018-11-04T21:29:43Z

Agreed. Let's wait untill then and reevaluate based on our current state and what tools we have available.

antonfirsov mentioned this issue Oct 10, 2018

Optimize memory consumption of ResizeProcessor #642

Closed

antonfirsov mentioned this issue Oct 20, 2018

Clean up and optimize byte<->float and Rgba32 <-> Vector4 conversion #742

Merged

4 tasks

JimBobSquarePants mentioned this issue Oct 24, 2018

Image resize performance when compared to other libraries #748

Closed

antonfirsov closed this as completed Nov 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

antonfirsov commented Oct 10, 2018 •

edited

Loading

antonfirsov commented Oct 10, 2018

antonfirsov commented Nov 4, 2018

JimBobSquarePants commented Nov 4, 2018

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Comments

antonfirsov commented Oct 10, 2018 • edited Loading

Goals

Tasks

Outlook

antonfirsov commented Oct 10, 2018

antonfirsov commented Nov 4, 2018

JimBobSquarePants commented Nov 4, 2018

antonfirsov commented Oct 10, 2018 •

edited

Loading