Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neon #78

Merged
merged 22 commits into from
Oct 18, 2021
Merged

Neon #78

merged 22 commits into from
Oct 18, 2021

Conversation

HEnquist
Copy link
Contributor

@HEnquist HEnquist commented Sep 24, 2021

Now that all the needed intrinsics are available, it's time for some Neon!
This is basically a direct translation of the SSE code.
Running on Cortex A72 (a Raspberry Pi4), I get a speedup of about 50% for f32, and none for f64. The Neon unit of the A72 can only execute a single 128-bit operation at a time. But it can do two f64 operations in parallel, meaning there isn't really any advantage to Neon here. More advanced cores should do better.
To build this, you need a compiler that has this merged: rust-lang/rust#89145
Reason is here: rust-lang/stdarch#1220
Once the latest nightly can be used, I'll add a CI job.

@HEnquist
Copy link
Contributor Author

neon_p2comp
Compared to the scalar version, on a Raspberry Pi 4.

@ejmahler
Copy link
Owner

ejmahler commented Sep 25, 2021 via email

@HEnquist
Copy link
Contributor Author

Yes I would also like to see how this performs on a more powerful cpu. I'm hoping to find something I can borrow, a Mac with the M1 chip for example.

The need for nightly to use Neon will probably remain for some time. There is a tracking issue here: rust-lang/rust#48556

The assembly contains the expected instructions, but I haven't compared to the SSE version. I'll do that and add here!

@HEnquist
Copy link
Contributor Author

I tried with the float32 4-point butterfly.

SSE:

_ZN7rustfft3sse15sse_butterflies25SseF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h49da046bbff55ab4E:
	.cfi_startproc
	movups	(%rsi), %xmm0
	movups	16(%rsi), %xmm1
	movaps	%xmm0, %xmm2
	addps	%xmm1, %xmm2
	subps	%xmm1, %xmm0
	shufps	$180, %xmm0, %xmm0
	xorps	(%rdi), %xmm0
	movaps	%xmm2, %xmm1
	movlhps	%xmm0, %xmm1
	movhlps	%xmm2, %xmm0
	movaps	%xmm1, %xmm2
	addps	%xmm0, %xmm2
	subps	%xmm0, %xmm1
	movups	%xmm2, (%rcx)
	movups	%xmm1, 16(%rcx)
	retq
.Lfunc_end139:

Neon:

_ZN7rustfft4neon16neon_butterflies26NeonF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h955e89a7bbfe2845E:
	.cfi_startproc
	ldp	q0, q1, [x1]
	ldr	d2, [x0, #16]
	fadd	v3.4s, v0.4s, v1.4s
	fsub	v0.4s, v0.4s, v1.4s
	ext	v1.16b, v0.16b, v0.16b, #8
	rev64	v1.2s, v1.2s
	eor	v1.8b, v1.8b, v2.8b
	mov	v2.16b, v3.16b
	mov	v2.d[1], v0.d[0]
	mov	v0.d[1], v1.d[0]
	ext	v0.16b, v0.16b, v3.16b, #8
	ext	v0.16b, v3.16b, v0.16b, #8
	fadd	v1.4s, v2.4s, v0.4s
	fsub	v0.4s, v2.4s, v0.4s
	stp	q1, q0, [x3]
	ret
.Lfunc_end178:

And just for fun, the scalar x86_64:

_ZN7rustfft9algorithm11butterflies19Butterfly4$LT$T$GT$22perform_fft_contiguous17h6c658b75f901ae2cE:
	.cfi_startproc
	movsd	(%rsi), %xmm2
	movsd	8(%rsi), %xmm1
	movsd	16(%rsi), %xmm3
	movsd	24(%rsi), %xmm4
	movaps	%xmm2, %xmm0
	addps	%xmm3, %xmm0
	subps	%xmm3, %xmm2
	movlhps	%xmm2, %xmm0
	movaps	%xmm1, %xmm2
	addps	%xmm4, %xmm2
	subps	%xmm4, %xmm1
	movaps	%xmm1, %xmm3
	shufps	$85, %xmm1, %xmm3
	movaps	.LCPI94_0(%rip), %xmm4
	testb	%dil, %dil
	je	.LBB94_1
	xorps	%xmm4, %xmm3
	jmp	.LBB94_3
.LBB94_1:
	xorps	%xmm4, %xmm1
.LBB94_3:
	movlhps	%xmm3, %xmm1
	shufps	$36, %xmm1, %xmm2
	movaps	%xmm0, %xmm1
	addps	%xmm2, %xmm1
	subps	%xmm2, %xmm0
	movups	%xmm1, (%rcx)
	movups	%xmm0, 16(%rcx)
	retq
.Lfunc_end94:

@HEnquist
Copy link
Contributor Author

I realized I could run this on a trial Amazon EC2 VM, with the Graviton2 CPU. It's supposedly based on the Cortex A76 core, which is a big upgrade from the A72. It does indeed perform quite a bit better:
rustfft_comp_aws

@ejmahler
Copy link
Owner

ejmahler commented Sep 27, 2021 via email

@HEnquist
Copy link
Contributor Author

It seems to work the same on arm, it exits with a SIGILL when I tell rustc to enable for example v8.2a on a cpu that only supports v8-a.
There is the is_aarch64_feature_detected macro to check features. The complex floating point math needs the "fcma" feature: https://github.com/rust-lang/stdarch/blob/master/crates/std_detect/src/detect/arch/aarch64.rs#L138

@ejmahler
Copy link
Owner

Fascinating. Well, at any rate i wouldn't want to add required support for it, if it's that new. Once it stabilizes, I can see doing something like rader's algorithm and avx2 where there's a fallback that doesn't require it.

@HEnquist
Copy link
Contributor Author

The fixes were merged so now the normal nightly compiler can be used. The current state then is that the neon stuff is completely disabled on anything not aarch64. The stable compiler can be used like usual. On aarch64, by default it's also disabled and compiles on stable. But when enabling the neon feature, the neon code is enabled and it requires the latest nightly compiler.

Would you be ok with releasing a version that has nightly-only stuff hidden behind a feature like that? I noticed there are quite a few crates that to that, for example rand: https://crates.io/crates/rand

@HEnquist
Copy link
Contributor Author

It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!).
I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.

@ejmahler
Copy link
Owner

ejmahler commented Sep 28, 2021 via email

@HEnquist
Copy link
Contributor Author

HEnquist commented Oct 3, 2021

I saw that the neon interleaved load and store instructions got added to the stdarch library. These are quite useful and first results were promising. But then I seem to have hit some bug in rustc, see rust-lang/stdarch#1227. Not sure if stdarch was the right place to file the issue, hopefully someone can give some advice.
I can run cargo test, check, etc just fine, but benches crash hard.

I'm assuming that these instructions will make it into the nightly rust builds quite soon. But I have no idea how easy it will be to get the benches running again. I'm leaning against waiting with this PR until this stuff has been sorted out.

This is the branch that uses the new intrinsics:
https://github.com/HEnquist/RustFFT/tree/vldx

Copy link
Owner

@ejmahler ejmahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I requested some minor changes, and once they're in I'd be happy to merge.

@ejmahler
Copy link
Owner

I updated the name of the feature.

I'm satisfied with this PR at this point. I noticed that it's still marked as draft. Do you think it's ready? If so, mark it ready and I'll merge. If you think there's still work to do, no rush.

The one remaining review item I have centers around the pattern of writing let input_packed = read_complex_to_array!(input, {0, 2, 4, 6, 8, 10});

I noticed that in other places in this PR, there's an intrinsic to load/store data in an interleaved way. do you think that could be applied here? It doesn't need to be done as a part of this PR but it could be a future optimization.

@HEnquist HEnquist marked this pull request as ready for review October 18, 2021 19:38
@ejmahler ejmahler merged commit 3dd012c into ejmahler:master Oct 18, 2021
@HEnquist
Copy link
Contributor Author

Very nice! Thanks for merging :)
The interleaved loading and storing intrinsics aren't included in the nightly rust builds yet. Probably a good thing, since using them triggers some bug that crashes rustc.
I have a nearly complete implementation using them. I was thinking of submitting it as a separate PR once that bug is fixed and the normal nightly releases can be used. The bench results I managed to get from it looked quite promising, but I won't know for sure until I can compile the benches without crashing..

@HEnquist
Copy link
Contributor Author

Neon on aarch64 will be available on stable rustc from version 1.61!
That should be released in May, but is available as nightly already.
Unfortunately the bug that crashes rustc when using interleaved load/store instructions is still there, so implementing that will have to wait (I'm guessing it will give a 5-10% speedup).
Does this seem like a reasonable plan?

  • I submit a PR to rename neon-nightly to just neon as soon as I can, with neon feature disabled by default.
  • This gets published to crates.io
  • When rustc finally works with the interleaved load/store I submit another PR for implementing that.
  • I wait 6 months after that, then make PR to enable neon by default.

@ejmahler
Copy link
Owner

I think that's a good plan. And we can just document that if you want to enable the neon feature, you need rusc 1.61 or newer.

I'm thinking about how to document+test this long-term, once we enable it by default. We could say "rustfft 6.2 requires rustc 1.6x if you're on aarch64 with the 'neon' feature enabled, or rustc 1.37 in all other configurations". Or will it be less confusing to just require 1.6x across the board? I don't know of any other features from recent rust versions that I want to use, but maybe from a user experience perspective it'll be easier for people to wrap their head around than requiring multiple versions.

We'll need to update our testing script to specifically test rustc 1.61 stable with neon enabled.

@HEnquist
Copy link
Contributor Author

If we want to keep the requirement at 1.37 for everything except aarch64+neon, we could add a simple build script that checks this. Something like this: https://github.com/HEnquist/camilladsp/blob/next100/build.rs
That makes it possible to give a clear error message instead of failing with a long list of strange compiler errors.
But there can still be some confusion, so just requiring 1.61 across the board might be the better way. Not easy..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants