You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I revisited the simd optimized code for VirtualMachine and I failed to find a way to improve it for ARM greatly. A few observations that sums up my reasoning/thinking:
the current condition is Vector<byte>.Count == 32 which fails for ARM which is 128 bit long
and this brought me to a stale mate. What I could now is to refactor VirtualMachine so that supports Vector<byte>.Count == 16. This would require at least some testing or viewing the code. I could use https://github.com/EgorBo/Disasmo for ARM. Unfortunately, due to dotnet/runtime#41518 it would fail because
However, certain JIT settings, such as the ISAs used, are largely dependent on the machine being run against and so you cannot currently check codegen for something like X86.Avx2 if your underlying hardware doesn't support it. Nor can you check for something like Arm.Dp from a x64 machine.
This would mean that without a physical ARM I could not look at the code. I digged a bit to find more about intrinsics and their actual gains in the following ones:
Later I moved to find the output and the overhead for a specific bitwise case of the vectorized and not vectorized code:
ASIMD logical AND, BIC, EOR, MOV, MVN, ORN, ORR, NOT Latency: 2, Throughput: 2, Pipeline: V Logical, shift, no flagset AND, BIC, EON, EOR, ORN, ORR Latency: 1 Throughput: Pipeline: I
where:
Latency, defined as the minimum latency seen by an operation dependent on an instruction in the described group.
Execution Throughput defined as the maximum throughput (in instructions per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A76 microarchitecture.
Having this numbers I can't see a clear win scenario for a simple bitwise operations. Currently, they are unrolled in as long and moving it to vector seems to be not that beneficial at all
The text was updated successfully, but these errors were encountered:
I revisited the simd optimized code for VirtualMachine and I failed to find a way to improve it for ARM greatly. A few observations that sums up my reasoning/thinking:
Vector<byte>.Count == 32
which fails for ARM which is 128 bit longand this brought me to a stale mate. What I could now is to refactor
VirtualMachine
so that supportsVector<byte>.Count == 16
. This would require at least some testing or viewing the code. I could use https://github.com/EgorBo/Disasmo for ARM. Unfortunately, due to dotnet/runtime#41518 it would fail becauseThis would mean that without a physical ARM I could not look at the code. I digged a bit to find more about intrinsics and their actual gains in the following ones:
Later I moved to find the output and the overhead for a specific bitwise case of the vectorized and not vectorized code:
ASIMD logical AND, BIC, EOR, MOV, MVN, ORN, ORR, NOT
Latency: 2, Throughput: 2, Pipeline: VLogical, shift, no flagset AND, BIC, EON, EOR, ORN, ORR
Latency: 1 Throughput: Pipeline: Iwhere:
Latency
, defined as the minimum latency seen by an operation dependent on an instruction in the described group.Execution Throughput
defined as the maximum throughput (in instructions per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A76 microarchitecture.Having this numbers I can't see a clear win scenario for a simple bitwise operations. Currently, they are unrolled in as
long
and moving it to vector seems to be not that beneficial at allThe text was updated successfully, but these errors were encountered: