Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements around fixed #2418

Merged
merged 9 commits into from
Mar 30, 2023
Merged

Conversation

gfoidl
Copy link
Contributor

@gfoidl gfoidl commented Mar 26, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

  • avoid length-check in pinning spans
  • don't use a dummy fixed sized buffer to set the size of the Block8x8 struct
  • in Block8x8F.Load avoid unnecessary copying of data -- read it directly from the source

gfoidl added 3 commits March 26, 2023 15:10
Only where it seems profitable. E.g. not when a UnmanagedMemoryStream is constructed of that pointer.
Copy link
Contributor Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes for review.

@@ -167,7 +167,7 @@ public static float Compress(float channel)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void CompandAvx2(Span<Vector4> vectors, float[] table)
{
fixed (float* tablePointer = &table[0])
fixed (float* tablePointer = &MemoryMarshal.GetArrayDataReference(table))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids the bound-checks.
See sharplab for example.

@@ -644,7 +644,7 @@ public static class HwIntrinsics
ReadOnlySpan<byte> source,
Span<float> dest)
{
fixed (byte* sourceBase = source)
fixed (byte* sourceBase = &MemoryMarshal.GetReference(source))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids the length-check.

fixed (byte* sourceBase = source)

is compiled to (via pattern matching)

fixed (byte* ptr = &span.GetPinnableReference())

and Span<T>.GetPinnableReference does a length-check which we don't need here, so we can omit it.

See sharplab for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do such (out-of-hot-loop) length checks have a measurable impact on perf? The new code feels bloated, I'm not in favor of such changes if there are no tangible gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this method in HwIntrinsics I don't think it's really measurable in (micro-) benchmarks. But it reduces the machine code a bit and the length-check is just not needed at all here.

I'm more on the eager side to squeeze out unnecessary instructions.
But when you want me this change to revert I'm fine with that too.

The new code feels bloated

The C# code is bloated. That's the obvious downside.

Copy link
Member

@antonfirsov antonfirsov Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically call these methods for entire image scanlines (->100's or 1000's of elements in the span). I would assume having an extra bounds check before the loop negligible, unless there are benchmarks proving it otherwise.

The C# code is bloated.

When you turn it into semi-C, yes. But you don't always have to; we need to make informed, individual decisions regarding perf vs simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I'll revert.

@@ -15,25 +15,14 @@ namespace SixLabors.ImageSharp.Formats.Jpeg.Components;
/// 8x8 matrix of <see cref="short"/> coefficients.
/// </summary>
// ReSharper disable once InconsistentNaming
[StructLayout(LayoutKind.Explicit)]
internal unsafe partial struct Block8x8
[StructLayout(LayoutKind.Explicit, Size = 2 * Size)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unused fixed sized buffer to specify the size of the struct isn't neede, as the size can be set directly.

Codegen for creating the struct is also a bit better -- only zeroing the memory, instead of stack-cookie + zeroing.
Though that won't matter much -- the reduction of C# code is the motivation for this change.

Comment on lines -106 to -107
Block8x8F result = default;
result.LoadFrom(data);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going via instance method needs to copy two times, which can be avoided.

In sharplab it's hard to see (no method names, just the address of the call), so here from the JIT directly:

JIT assembly
; Assembly listing for method Foo:Default(System.Span`1[float]):Block8x8F
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       56                   push     rsi
       4881EC00010000       sub      rsp, 256
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000BH
       488B12               mov      rdx, bword ptr [rdx]
       488D0C24             lea      rcx, bword ptr [rsp]
       41B800010000         mov      r8d, 256
       E80330AF5F           call     CORINFO_HELP_MEMCPY
       488BCE               mov      rcx, rsi
       488D1424             lea      rdx, bword ptr [rsp]
       41B800010000         mov      r8d, 256
       E8F12FAF5F           call     CORINFO_HELP_MEMCPY
       488BC6               mov      rax, rsi

G_M000_IG03:                ;; offset=0032H
       4881C400010000       add      rsp, 256
       5E                   pop      rsi
       C3                   ret

; Total bytes of code 59

; Assembly listing for method Foo:PR(System.Span`1[float]):Block8x8F_PR
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       56                   push     rsi
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=0004H
       488B12               mov      rdx, bword ptr [rdx]
       488BCE               mov      rcx, rsi
       41B800010000         mov      r8d, 256
       E80B2FAF5F           call     CORINFO_HELP_MEMCPY
       488BC6               mov      rax, rsi

G_M000_IG03:                ;; offset=0018H
       5E                   pop      rsi
       C3                   ret

; Total bytes of code 26

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: is it the same gain when you compare use cases on call sites, ie.

Block8x8F b = Block8x8F.Load(data)`;

after the PR change vs. typical usage on main:

Block8x8F b = default;
b.LoadFrom(data);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what's done with the instance of Block8x8F. In this example Default2 shows no difference, as the JIT is smart enough to figure out what's going on. Namely: load the data onto the stack, then operate with the instance method of that type and use data on the stack.
But as soon the Block8x8F is passed around a copy needs to be made. Thus for me the instance LoadFrom has the potential for a perf-trap, and should be removed.

Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange for an value-type*. Here a static method seems the right fit, like int.Parse, Vector128.Load, etc.
As it's an internal type this change seems fine from this view.

* for reference types this is pretty much OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange

Note that Block8x8F is really huge. If I remember correctly, in early Jpeg code there were cases when existing on-stack blocks were being reused by repopulating it. If it's no longer the case in the current code, I'm happy with the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it -- all usages are safe. Block8x8F is either passed by reference, and only copied where needed (e.g. constructors).

existing on-stack blocks were being reused by repopulating

When I understand it right, then this is a simple example of it.

Both variants work and produce the same machine-code.
(It's because the compilers hoists the loop-local block variable, and as it's a value-type the stack space needs to be reserved anyway).
Even if there are two blocks read then it's the same machine code.

I'll see if the C# code here can be simplified.

@@ -25,7 +25,7 @@ public static unsafe bool CheckNonOpaque(Span<Bgra32> row)
ReadOnlySpan<byte> rowBytes = MemoryMarshal.AsBytes(row);
int i = 0;
int length = (row.Length * 4) - 3;
fixed (byte* src = rowBytes)
fixed (byte* src = &MemoryMarshal.GetReference(rowBytes))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, I've forgotten this one. Couldn't see them in the local diffs.

@JimBobSquarePants
Copy link
Member

Is this ready for another review?

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 29, 2023

Is this ready for another review?

Yes, please.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@JimBobSquarePants JimBobSquarePants merged commit b6736b0 into SixLabors:main Mar 30, 2023
@gfoidl gfoidl deleted the fixed-improvements branch March 30, 2023 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants