You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized ResizeProcessor for 1.0.
Goals
If we implement all the tasks, I expect that:
The memory usage of ResizeProcessor will drop dramatically (Reduced by a factor of 20x - 50x for typical images.)
The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) (Current status: 120-130%)
Tasks
(1) Implement new data types for SOA representation of Vector4 (or other 4 channel) buffers that could be used as SOA counterparts of AOS types IMemoryOwner<Vector4>, Buffer2D<Vector4>, and Span<Vector4>. Something like class BufferOf4Channels<T>, class BufferOf4Channels2D<T> and ref struct BufferSegmentOf4Channels<T>.
(2) Implement bulk packing methods in PixelOperations<TPixel> for the BufferSegmentOf4Channels<float>. Rgba32 packing/unpacking should be optimized the same way it's done for Span<Vector4> packing.
(3) Integrate the SRgb companding (Compress/Expand) operations into the API-s defined in (2)
(4) Optional CPU optimization. Optimize the implementation of (3) for Rgba32, using lookup tables.
(5) Replace all Vector4 AOS buffers with SOA counterparts in ResizeProcessor
(7) CPU optimization. Implement vectorized convolution in ResizeKernel, using Vector4 by default, and Vector<float> if AVX2 is detected (Vector<float>.Count == 8)
(8) Optional CPU optimization. Vectorized Premultiply and UnPremultiply (both Vector4 and AVX2 variants)
Outlook
If we manage to implement these points, the bottleneck would be the Rgba32 <-> 4 x float unpacking/repacking operation. If we can optimize it, we can reach even more superior performance.Update: Done in #742.
Alternatively, we can try implementing fixed-point math using Vector<ushort>.
As always, community help is welcome!
The text was updated successfully, but these errors were encountered:
I'm closing this, because it turned out that conversion to SOA buffers so ineffective without shuffling intrinsics, that it renders the whole thing unworthy for now.
We should replan this stuff when the experiments with .NET core 3.0 and the new intrinsics are started.
I hope it's still possible to close the performance gap with optimizations like #768.
I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized
ResizeProcessor
for 1.0.Goals
If we implement all the tasks, I expect that:
ResizeProcessor
will drop dramatically (Reduced by a factor of20x - 50x
for typical images.)Tasks
(1) Implement new data types for SOA representation of
Vector4
(or other 4 channel) buffers that could be used as SOA counterparts of AOS typesIMemoryOwner<Vector4>
,Buffer2D<Vector4>
, andSpan<Vector4>
. Something likeclass BufferOf4Channels<T>
,class BufferOf4Channels2D<T>
andref struct BufferSegmentOf4Channels<T>
.(2) Implement bulk packing methods in
PixelOperations<TPixel>
for theBufferSegmentOf4Channels<float>
.Rgba32
packing/unpacking should be optimized the same way it's done forSpan<Vector4>
packing.(3) Integrate the
SRgb
companding (Compress/Expand) operations into the API-s defined in (2)(4) Optional CPU optimization. Optimize the implementation of (3) for
Rgba32
, using lookup tables.(5) Replace all
Vector4
AOS buffers with SOA counterparts inResizeProcessor
(6) Memory optimization. Implement Optimize memory consumption of ResizeProcessor #642, preferably when (5) is fully implemented. Parallelization should be dropped.
(7) CPU optimization. Implement vectorized convolution in
ResizeKernel
, usingVector4
by default, andVector<float>
if AVX2 is detected (Vector<float>.Count == 8
)(8) Optional CPU optimization. Vectorized
Premultiply
andUnPremultiply
(bothVector4
and AVX2 variants)Outlook
If we manage to implement these points, the bottleneck would be theUpdate: Done in #742.Rgba32
<->4 x float
unpacking/repacking operation. If we can optimize it, we can reach even more superior performance.Alternatively, we can try implementing fixed-point math using
Vector<ushort>
.As always, community help is welcome!
The text was updated successfully, but these errors were encountered: