Neon #78

HEnquist · 2021-09-24T12:50:49Z

Now that all the needed intrinsics are available, it's time for some Neon!
This is basically a direct translation of the SSE code.
Running on Cortex A72 (a Raspberry Pi4), I get a speedup of about 50% for f32, and none for f64. The Neon unit of the A72 can only execute a single 128-bit operation at a time. But it can do two f64 operations in parallel, meaning there isn't really any advantage to Neon here. More advanced cores should do better.
To build this, you need a compiler that has this merged: rust-lang/rust#89145
Reason is here: rust-lang/stdarch#1220
Once the latest nightly can be used, I'll add a CI job.

Tweaked

HEnquist · 2021-09-24T22:33:18Z

Compared to the scalar version, on a Raspberry Pi 4.

ejmahler · 2021-09-25T01:43:33Z

This is great! If this depends on things that are in nightly, it will be quite a while before we can include this PR in a release. But I definitely am interested in getting it in. I’m very curious to see how a more powerful machine handles it. I’m also interested to see how the assembly differs between x64 and arm targets.

…

On Fri, Sep 24, 2021 at 3:33 PM Henrik Enquist ***@***.***> wrote: [image: neon_p2comp] <https://user-images.githubusercontent.com/6504678/134746092-3f433f8a-1c76-4052-a0b3-140dd246ee82.png> Compared to the scalar version, on a Raspberry Pi 4. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#78 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI2M6TLMETTHPUNH7X4AQ3UDT4DTANCNFSM5EV4RHVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

HEnquist · 2021-09-25T12:24:23Z

Yes I would also like to see how this performs on a more powerful cpu. I'm hoping to find something I can borrow, a Mac with the M1 chip for example.

The need for nightly to use Neon will probably remain for some time. There is a tracking issue here: rust-lang/rust#48556

The assembly contains the expected instructions, but I haven't compared to the SSE version. I'll do that and add here!

HEnquist · 2021-09-25T20:51:21Z

I tried with the float32 4-point butterfly.

SSE:

_ZN7rustfft3sse15sse_butterflies25SseF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h49da046bbff55ab4E:
	.cfi_startproc
	movups	(%rsi), %xmm0
	movups	16(%rsi), %xmm1
	movaps	%xmm0, %xmm2
	addps	%xmm1, %xmm2
	subps	%xmm1, %xmm0
	shufps	$180, %xmm0, %xmm0
	xorps	(%rdi), %xmm0
	movaps	%xmm2, %xmm1
	movlhps	%xmm0, %xmm1
	movhlps	%xmm2, %xmm0
	movaps	%xmm1, %xmm2
	addps	%xmm0, %xmm2
	subps	%xmm0, %xmm1
	movups	%xmm2, (%rcx)
	movups	%xmm1, 16(%rcx)
	retq
.Lfunc_end139:

Neon:

_ZN7rustfft4neon16neon_butterflies26NeonF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h955e89a7bbfe2845E:
	.cfi_startproc
	ldp	q0, q1, [x1]
	ldr	d2, [x0, #16]
	fadd	v3.4s, v0.4s, v1.4s
	fsub	v0.4s, v0.4s, v1.4s
	ext	v1.16b, v0.16b, v0.16b, #8
	rev64	v1.2s, v1.2s
	eor	v1.8b, v1.8b, v2.8b
	mov	v2.16b, v3.16b
	mov	v2.d[1], v0.d[0]
	mov	v0.d[1], v1.d[0]
	ext	v0.16b, v0.16b, v3.16b, #8
	ext	v0.16b, v3.16b, v0.16b, #8
	fadd	v1.4s, v2.4s, v0.4s
	fsub	v0.4s, v2.4s, v0.4s
	stp	q1, q0, [x3]
	ret
.Lfunc_end178:

And just for fun, the scalar x86_64:

_ZN7rustfft9algorithm11butterflies19Butterfly4$LT$T$GT$22perform_fft_contiguous17h6c658b75f901ae2cE:
	.cfi_startproc
	movsd	(%rsi), %xmm2
	movsd	8(%rsi), %xmm1
	movsd	16(%rsi), %xmm3
	movsd	24(%rsi), %xmm4
	movaps	%xmm2, %xmm0
	addps	%xmm3, %xmm0
	subps	%xmm3, %xmm2
	movlhps	%xmm2, %xmm0
	movaps	%xmm1, %xmm2
	addps	%xmm4, %xmm2
	subps	%xmm4, %xmm1
	movaps	%xmm1, %xmm3
	shufps	$85, %xmm1, %xmm3
	movaps	.LCPI94_0(%rip), %xmm4
	testb	%dil, %dil
	je	.LBB94_1
	xorps	%xmm4, %xmm3
	jmp	.LBB94_3
.LBB94_1:
	xorps	%xmm4, %xmm1
.LBB94_3:
	movlhps	%xmm3, %xmm1
	shufps	$36, %xmm1, %xmm2
	movaps	%xmm0, %xmm1
	addps	%xmm2, %xmm1
	subps	%xmm2, %xmm0
	movups	%xmm1, (%rcx)
	movups	%xmm0, 16(%rcx)
	retq
.Lfunc_end94:

HEnquist · 2021-09-26T20:19:59Z

I realized I could run this on a trial Amazon EC2 VM, with the Graviton2 CPU. It's supposedly based on the Cortex A76 core, which is a big upgrade from the A72. It does indeed perform quite a bit better:

src/neon/neon_utils.rs

ejmahler · 2021-09-27T00:31:41Z

How do cpu features work on arm? Does it just compile a fallback? Or does it trigger UB like on x86? I ask because it seems like there’s no way to query for support for new instructions like that

…

On Sun, Sep 26, 2021 at 5:09 PM Henrik Enquist ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/neon/neon_utils.rs <#78 (comment)>: > +} + +// transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] +// result is [x0, x2], [x1, x3] +#[inline(always)] +pub unsafe fn transpose_complex_2x2_f32(left: float32x4_t, right: float32x4_t) -> [float32x4_t; 2] { + let temp02 = extract_lo_lo_f32(left, right); + let temp13 = extract_hi_hi_f32(left, right); + [temp02, temp13] +} + +// Complex multiplication. +// Each input contains two complex values, which are multiplied in parallel. +#[inline(always)] +pub unsafe fn mul_complex_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { + // ARMv8.2-A introduced vcmulq_f32 and vcmlaq_f32 for complex multiplication, these intrinsics are not yet available. I have looked a bit more at these instructions after writing that. They are a quite recent addition from (IIRC) ARMv8.4. So there aren't that many arm chips out there that have them (most are ARMv8.2). It might be better to leave these functions like they are. I don't know if anyone is planning on adding them to Rust. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#78 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI2M6WEMSCM3FYBYNJ2X5TUD6Y3ZANCNFSM5EV4RHVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

HEnquist · 2021-09-27T07:51:05Z

It seems to work the same on arm, it exits with a SIGILL when I tell rustc to enable for example v8.2a on a cpu that only supports v8-a.
There is the is_aarch64_feature_detected macro to check features. The complex floating point math needs the "fcma" feature: https://github.com/rust-lang/stdarch/blob/master/crates/std_detect/src/detect/arch/aarch64.rs#L138

ejmahler · 2021-09-27T17:24:47Z

Fascinating. Well, at any rate i wouldn't want to add required support for it, if it's that new. Once it stabilizes, I can see doing something like rader's algorithm and avx2 where there's a fallback that doesn't require it.

HEnquist · 2021-09-28T10:02:51Z

The fixes were merged so now the normal nightly compiler can be used. The current state then is that the neon stuff is completely disabled on anything not aarch64. The stable compiler can be used like usual. On aarch64, by default it's also disabled and compiles on stable. But when enabling the neon feature, the neon code is enabled and it requires the latest nightly compiler.

Would you be ok with releasing a version that has nightly-only stuff hidden behind a feature like that? I noticed there are quite a few crates that to that, for example rand: https://crates.io/crates/rand

HEnquist · 2021-09-28T19:16:27Z

It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!).
I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.

ejmahler · 2021-09-28T19:19:05Z

I think we should make a release, and default the neon feature to disabled. Once a stable release has been out for 6months, we can default it to enabled I’ve already skimmed through it, but I’ll do a more through review in a day or two

…

On Tue, Sep 28, 2021 at 12:16 PM Henrik Enquist ***@***.***> wrote: It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!). I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#78 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI2M6S25ORBVO7H6J6KJNTUEIIBNANCNFSM5EV4RHVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

HEnquist · 2021-10-03T20:57:40Z

I saw that the neon interleaved load and store instructions got added to the stdarch library. These are quite useful and first results were promising. But then I seem to have hit some bug in rustc, see rust-lang/stdarch#1227. Not sure if stdarch was the right place to file the issue, hopefully someone can give some advice.
I can run cargo test, check, etc just fine, but benches crash hard.

I'm assuming that these instructions will make it into the nightly rust builds quite soon. But I have no idea how easy it will be to get the benches running again. I'm leaning against waiting with this PR until this stuff has been sorted out.

This is the branch that uses the new intrinsics:
https://github.com/HEnquist/RustFFT/tree/vldx

ejmahler

This looks great. I requested some minor changes, and once they're in I'd be happy to merge.

.github/workflows/run_test.yml

src/lib.rs

src/neon/neon_utils.rs

src/neon/neon_radix4.rs

Cargo.toml

ejmahler · 2021-10-18T19:32:10Z

I updated the name of the feature.

I'm satisfied with this PR at this point. I noticed that it's still marked as draft. Do you think it's ready? If so, mark it ready and I'll merge. If you think there's still work to do, no rush.

The one remaining review item I have centers around the pattern of writing let input_packed = read_complex_to_array!(input, {0, 2, 4, 6, 8, 10});

I noticed that in other places in this PR, there's an intrinsic to load/store data in an interleaved way. do you think that could be applied here? It doesn't need to be done as a part of this PR but it could be a future optimization.

HEnquist · 2021-10-18T19:44:57Z

Very nice! Thanks for merging :)
The interleaved loading and storing intrinsics aren't included in the nightly rust builds yet. Probably a good thing, since using them triggers some bug that crashes rustc.
I have a nearly complete implementation using them. I was thinking of submitting it as a separate PR once that bug is fixed and the normal nightly releases can be used. The bench results I managed to get from it looked quite promising, but I won't know for sure until I can compile the benches without crashing..

HEnquist · 2022-03-28T19:49:49Z

Neon on aarch64 will be available on stable rustc from version 1.61!
That should be released in May, but is available as nightly already.
Unfortunately the bug that crashes rustc when using interleaved load/store instructions is still there, so implementing that will have to wait (I'm guessing it will give a 5-10% speedup).
Does this seem like a reasonable plan?

I submit a PR to rename neon-nightly to just neon as soon as I can, with neon feature disabled by default.
This gets published to crates.io
When rustc finally works with the interleaved load/store I submit another PR for implementing that.
I wait 6 months after that, then make PR to enable neon by default.

ejmahler · 2022-03-29T03:19:03Z

I think that's a good plan. And we can just document that if you want to enable the neon feature, you need rusc 1.61 or newer.

I'm thinking about how to document+test this long-term, once we enable it by default. We could say "rustfft 6.2 requires rustc 1.6x if you're on aarch64 with the 'neon' feature enabled, or rustc 1.37 in all other configurations". Or will it be less confusing to just require 1.6x across the board? I don't know of any other features from recent rust versions that I want to use, but maybe from a user experience perspective it'll be easier for people to wrap their head around than requiring multiple versions.

We'll need to update our testing script to specifically test rustc 1.61 stable with neon enabled.

HEnquist · 2022-03-29T06:25:09Z

If we want to keep the requirement at 1.37 for everything except aarch64+neon, we could add a simple build script that checks this. Something like this: https://github.com/HEnquist/camilladsp/blob/next100/build.rs
That makes it possible to give a clear error message instead of failing with a long list of strange compiler errors.
But there can still be some confusion, so just requiring 1.61 across the board might be the better way. Not easy..

HEnquist added 11 commits September 20, 2021 12:12

WIP neon support

4143a29

All except prime butterflies

95c4708

Add prime butterflies

1f4598a

Enable prime butterflies in planner

7e2d004

Add more benches

9b97be6

Tweaks, cleanup

f4971a5

Format

d0a8420

Merge pull request #7 from HEnquist/tuningtest

a303802

Tweaked

Minor tweaks and some cleanup

c43c58c

Clean up docs

cb7926b

Format

39c073f

HEnquist added 2 commits September 25, 2021 22:12

Comparison bench and plotting tool for neon

6a1cc89

Add instructions to asmtest example

933e404

Clean up asmtest example

fef69dd

HEnquist added 3 commits September 27, 2021 00:43

Clean up, fix typo

2e0f8d3

Use easier numbers in test, to avoid failing on rounding errors

0bdfb27

Format

f847995

ejmahler reviewed Sep 26, 2021

View reviewed changes

src/neon/neon_utils.rs Show resolved Hide resolved

HEnquist added 2 commits September 28, 2021 09:08

Typo in avx fallback planner, ci for neon

6f72385

Typo in ci name

b282248

ejmahler requested changes Oct 12, 2021

View reviewed changes

HEnquist and others added 3 commits October 12, 2021 22:54

Changes after review

132631f

Format

8ff1ff1

Renamed the "neon" feature to "neon-nightly"

609d9b0

HEnquist marked this pull request as ready for review October 18, 2021 19:38

ejmahler approved these changes Oct 18, 2021

View reviewed changes

ejmahler merged commit 3dd012c into ejmahler:master Oct 18, 2021

OlegOAndreev mentioned this pull request Oct 18, 2021

Optimize HrtfSphere::sample_bilinear mrDIMAS/hrtf#4

Closed

HEnquist mentioned this pull request Apr 20, 2022

Add neon feature that works with stable rustc #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neon #78

Neon #78

HEnquist commented Sep 24, 2021 •

edited

Loading

HEnquist commented Sep 24, 2021

ejmahler commented Sep 25, 2021 via email

HEnquist commented Sep 25, 2021

HEnquist commented Sep 25, 2021

HEnquist commented Sep 26, 2021

ejmahler commented Sep 27, 2021 via email

HEnquist commented Sep 27, 2021

ejmahler commented Sep 27, 2021

HEnquist commented Sep 28, 2021

HEnquist commented Sep 28, 2021

ejmahler commented Sep 28, 2021 via email

HEnquist commented Oct 3, 2021

ejmahler left a comment

ejmahler commented Oct 18, 2021

HEnquist commented Oct 18, 2021

HEnquist commented Mar 28, 2022

ejmahler commented Mar 29, 2022

HEnquist commented Mar 29, 2022

Neon #78

Neon #78

Conversation

HEnquist commented Sep 24, 2021 • edited Loading

HEnquist commented Sep 24, 2021

ejmahler commented Sep 25, 2021 via email

HEnquist commented Sep 25, 2021

HEnquist commented Sep 25, 2021

HEnquist commented Sep 26, 2021

ejmahler commented Sep 27, 2021 via email

HEnquist commented Sep 27, 2021

ejmahler commented Sep 27, 2021

HEnquist commented Sep 28, 2021

HEnquist commented Sep 28, 2021

ejmahler commented Sep 28, 2021 via email

HEnquist commented Oct 3, 2021

ejmahler left a comment

Choose a reason for hiding this comment

ejmahler commented Oct 18, 2021

HEnquist commented Oct 18, 2021

HEnquist commented Mar 28, 2022

ejmahler commented Mar 29, 2022

HEnquist commented Mar 29, 2022

HEnquist commented Sep 24, 2021 •

edited

Loading