Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM considerations for VirtualMachine #2771

Closed
Scooletz opened this issue Feb 9, 2021 · 1 comment
Closed

ARM considerations for VirtualMachine #2771

Scooletz opened this issue Feb 9, 2021 · 1 comment

Comments

@Scooletz
Copy link
Contributor

Scooletz commented Feb 9, 2021

I revisited the simd optimized code for VirtualMachine and I failed to find a way to improve it for ARM greatly. A few observations that sums up my reasoning/thinking:

  1. the current condition is Vector<byte>.Count == 32 which fails for ARM which is 128 bit long
  2. the are 4 bitwise VM opcodes starting from https://ethervm.io/#16
  3. I tried to find another match in system.runtime.intrinsics.arm but failed to find any

and this brought me to a stale mate. What I could now is to refactor VirtualMachine so that supports Vector<byte>.Count == 16. This would require at least some testing or viewing the code. I could use https://github.com/EgorBo/Disasmo for ARM. Unfortunately, due to dotnet/runtime#41518 it would fail because

However, certain JIT settings, such as the ISAs used, are largely dependent on the machine being run against and so you cannot currently check codegen for something like X86.Avx2 if your underlying hardware doesn't support it. Nor can you check for something like Arm.Dp from a x64 machine.

This would mean that without a physical ARM I could not look at the code. I digged a bit to find more about intrinsics and their actual gains in the following ones:

  1. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?page=57
  2. https://static.docs.arm.com/swog307215/a/Arm_Cortex-A76_Software_Optimization_Guide.pdf
  3. https://docs.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.arm.advsimd.and?view=net-5.0#System_Runtime_Intrinsi[…]Vector128_System_Byte__

Later I moved to find the output and the overhead for a specific bitwise case of the vectorized and not vectorized code:

ASIMD logical AND, BIC, EOR, MOV, MVN, ORN, ORR, NOT Latency: 2, Throughput: 2, Pipeline: V
Logical, shift, no flagset AND, BIC, EON, EOR, ORN, ORR Latency: 1 Throughput: Pipeline: I

where:

  • Latency, defined as the minimum latency seen by an operation dependent on an instruction in the described group.
  • Execution Throughput defined as the maximum throughput (in instructions per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A76 microarchitecture.

Having this numbers I can't see a clear win scenario for a simple bitwise operations. Currently, they are unrolled in as long and moving it to vector seems to be not that beneficial at all

@Scooletz
Copy link
Contributor Author

Scooletz commented Feb 9, 2021

FYI according to our conversation @tkstanczak

@Scooletz Scooletz closed this as completed Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant