Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Closed
8 tasks
antonfirsov opened this issue Oct 10, 2018 · 3 comments
Closed
8 tasks

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

antonfirsov opened this issue Oct 10, 2018 · 3 comments

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Oct 10, 2018

I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized ResizeProcessor for 1.0.

Goals

If we implement all the tasks, I expect that:

  • The memory usage of ResizeProcessor will drop dramatically (Reduced by a factor of 20x - 50x for typical images.)
  • The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) (Current status: 120-130%)

Tasks

  • (1) Implement new data types for SOA representation of Vector4 (or other 4 channel) buffers that could be used as SOA counterparts of AOS types IMemoryOwner<Vector4>, Buffer2D<Vector4>, and Span<Vector4>. Something like class BufferOf4Channels<T>, class BufferOf4Channels2D<T> and ref struct BufferSegmentOf4Channels<T>.

  • (2) Implement bulk packing methods in PixelOperations<TPixel> for the BufferSegmentOf4Channels<float>. Rgba32 packing/unpacking should be optimized the same way it's done for Span<Vector4> packing.

  • (3) Integrate the SRgb companding (Compress/Expand) operations into the API-s defined in (2)

  • (4) Optional CPU optimization. Optimize the implementation of (3) for Rgba32, using lookup tables.

  • (5) Replace all Vector4 AOS buffers with SOA counterparts in ResizeProcessor

  • (6) Memory optimization. Implement Optimize memory consumption of ResizeProcessor #642, preferably when (5) is fully implemented. Parallelization should be dropped.

  • (7) CPU optimization. Implement vectorized convolution in ResizeKernel, using Vector4 by default, and Vector<float> if AVX2 is detected (Vector<float>.Count == 8)

  • (8) Optional CPU optimization. Vectorized Premultiply and UnPremultiply (both Vector4 and AVX2 variants)

Outlook

If we manage to implement these points, the bottleneck would be the Rgba32 <-> 4 x float unpacking/repacking operation. If we can optimize it, we can reach even more superior performance. Update: Done in #742.

Alternatively, we can try implementing fixed-point math using Vector<ushort>.

As always, community help is welcome!

@antonfirsov
Copy link
Member Author

ResizeProcessor's current CPU sampling profile:

image

@antonfirsov
Copy link
Member Author

I'm closing this, because it turned out that conversion to SOA buffers so ineffective without shuffling intrinsics, that it renders the whole thing unworthy for now.

We should replan this stuff when the experiments with .NET core 3.0 and the new intrinsics are started.

I hope it's still possible to close the performance gap with optimizations like #768.

@JimBobSquarePants
Copy link
Member

Agreed. Let's wait untill then and reevaluate based on our current state and what tools we have available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants