-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move simd out of unstable #12524
Move simd out of unstable #12524
Conversation
Are these travis failures legit? Seems weird.. |
@cmr Thank you very much! SIMD registers width has been doubled (128bit vs 256bit) with new AVX capable CPUs, so I guess there will be need to create new data types like Is it possible to have a self expanding data type independent of CPU capabilities (SSE vs AVX) at language level too? This will avoid the need to a lot of boilerplate code on application side and enables applications to benefit from new SIMD hardware capabilities with a simple recompilation of the code for target hardware. Here is every programmer dream: (Warning: C code incoming!) /* f32exp_t : Expandable data type definition at language level */
#ifdef AVX_CAPABLE
#include <immintrin.h>
typedef f32exp_t __m256;
#define F32EXP_WIDTH 8
#elif SSE_CAPABLE
#include <xmmintrin.h>
typedef f32exp_t __m128;
#define F32EXP_WIDTH 4
#elif XXX_CAPABLE
/* Other CPU arch SIMD type with smaller width */
#elif NONE_CAPABLE
/* Old/unknown hardware fall back */
typedef f32exp_t float;
#define F32EXP_WIDTH 1
#endif
/* SIMD operator overloads */
/* Arithmetic addition operator */
/* If AVX_CAPABLE -> _mm256_add_ps -> vaddps */
/* If SSE_CAPABLE -> _mm_add_ps -> addps */
/* If NONE_CAPABLE -> usual add operator */
/* f32exp_t : Expandable data type at language level */
/* Now back to reality at application source code level */
void init(f32exp_t *input)
{
for (int i=0; i<F32EXP_WIDTH; i++)
{
/* f32exp init code */
}
}
void process(f32exp_t *input1, f32exp_t *input2, f32exp_t *output, int arraySize)
{
for (int i=0; i<arraySize; i++)
{
/* Smart lang/compiler using suitable SIMD arithmetic operator overload */
output[i] = input1[i] + input2[i];
}
} Woot, longest comment in my life, thanks for reading ;P |
@siavashserver Yes, constantly improving hardware is a problem. I don't think the right solution is to keep making more and more types. @cartazio had some opinions on this. I think we're going to end up using a very different API for SIMD, but maybe not. As it stands, though, we could indeed make a type that is "the widest vector type" for the platform. Not sure it'd be super useful, especially in public APIs (feature flags shouldn't affect ABI, I don't think) |
@cmr Absolutely, it's just going to raise a hell of data types in future (specially with upcoming AVX-512). I like the widest vector type idea, and then applications only need to know about the widest available f32/f64/int vector type on compile time and pack their data into SIMD vectors accordingly. Nobody really wants to miss the extra horse power of newer hardware when available by limiting themselves to fixed vector sizes ( I will be glad to find out more about @cartazio opinion on this too. |
thanks for pinging me! Bit buried this week but some quick thoughts. Ping me early next week if you want me to elaborate Automatic vectorization is completely unrelated to manual simd. Trying to mix the two... will probably make automatic vectorization less effective having magic "largest target simd size" is a bad idea. Why? Because any serious/intersting use of SIMD is all about the shuffles! And heres a fun fact: depending on the width of the simd, you have very very different shuffles. I also don't know off hand of any nice "size oblivious" way of describing those shuffles. (seriously, the cost model for how to even do this is pretty non uniform between cpu generations) If you want to be "future proof", unroll you inner loops so you have like 16-32-64 wee read and update operations, and pray LLVM vectorization can do something "ok" with them. Either you're using autovectorization in the compiler tooling, or you're using explicit simd. IN the latter case, you write a different version for every cpu micro arch variant to leverage the variations in load / store / compute latency for instructions, and actually will dispatch on which micro arch you're running on at runtime (because this is like 5 instructions out of 50-300 simd ops, so its really trivial overhead). (and i've not seen this in any compiler as an automatic thing) Oh and that even ignoring alignment, Well written simd code actually has different cases for when the data is aligned or not, and the amount of alignment depends on the simd size AND the cpu micro arch variant. (which i've not seen supported ) Depending on the use case, its worth while to write a whole bunch of variants that handle misaligned reads and writes (even though you now have to write potentially 6 or more different variants!). I've not seen any automatic schemes for handling varied alignment that i've been happy with. LLVM (i believe) somtimes does a "case split" for certain aligned vs unaligned versions, but i don't think it does in general (a) because the number of cases blows up and it then decides the code bloat is too expensive (kinda like with inlining) (b) something else. Basically, while theres def room for better type safe tooling for simd, i've never seen any good designs for a general purpose language that does more than just give you good type safe primops (though i'd love to see one, but i suspect its a phd sized project). Doing anything "magic" is very questionable in this case. I'm glossing over a lot of matters here. But basically the only real perfmance fun for simd is when you're writing software using shuffles. Shuffles don't "scale" with changing the size of the underlying vector. And the cost model of shuffles changes with the CPU microarch + the simd vector size. Its pretty fun to read up on the changes in shuffle ops between avx128,256, and 512. They're very very different. |
I hope that helps. I'm happy to opine more in a week or so when i'm a bit less busy |
@cartazio: I don't think unrolling the loops will help LLVM. The |
huh, really? I must be out of date. I thought SLP was enabled by default with 3.4 for O2 and that loop-vectorize is only in O3? |
The |
ahh. Ok. So the point being, if you want autovectorization, write scalar code. If you want to take advantage of the CPU specific features, work against those. I'm not sure if theres any way to sanely "auto widen" that will actually provide any robust perf guarantee. Theres a lot of none obvious details that go on in how those lane doublings work. And theres also the fact that a naive widening could actually (hypothetically) mess with pipelining of memory loads (though i could be wrong here) |
At this point, nobody knows the "correct" way to expose SIMD. There |
Closing due to inactivity, but it would nice to see |
(I'll update this after separate-libc lands) On Sun, Mar 16, 2014 at 1:52 AM, Alex Crichton notifications@github.comwrote:
|
r=me once #14115 lands |
It's still marked experimental. Also add documentation.
Closing in favor of #14331 |
It's still marked experimental. Also add documentation.