Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move simd out of unstable #12524

Closed
wants to merge 2 commits into from
Closed

Move simd out of unstable #12524

wants to merge 2 commits into from

Conversation

emberian
Copy link
Member

It's still marked experimental. Also add documentation.

@alexcrichton
Copy link
Member

Are these travis failures legit? Seems weird..

@siavashserver
Copy link

@cmr Thank you very much!

SIMD registers width has been doubled (128bit vs 256bit) with new AVX capable CPUs, so I guess there will be need to create new data types like f32x8, f64x4, etc.

Is it possible to have a self expanding data type independent of CPU capabilities (SSE vs AVX) at language level too? This will avoid the need to a lot of boilerplate code on application side and enables applications to benefit from new SIMD hardware capabilities with a simple recompilation of the code for target hardware.

Here is every programmer dream: (Warning: C code incoming!)

/* f32exp_t : Expandable data type definition at language level */

#ifdef AVX_CAPABLE
    #include <immintrin.h>
    typedef f32exp_t __m256;
    #define F32EXP_WIDTH 8
#elif SSE_CAPABLE
    #include <xmmintrin.h>
    typedef f32exp_t __m128;
    #define F32EXP_WIDTH 4
#elif XXX_CAPABLE
    /* Other CPU arch SIMD type with smaller width */
#elif NONE_CAPABLE
    /* Old/unknown hardware fall back */
    typedef f32exp_t float;
    #define F32EXP_WIDTH 1
#endif

/* SIMD operator overloads */
/* Arithmetic addition operator */
/* If AVX_CAPABLE -> _mm256_add_ps -> vaddps */
/* If SSE_CAPABLE -> _mm_add_ps -> addps */
/* If NONE_CAPABLE -> usual add operator */

/* f32exp_t : Expandable data type at language level */

/* Now back to reality at application source code level */
void init(f32exp_t *input)
{
    for (int i=0; i<F32EXP_WIDTH; i++)
    {
        /* f32exp init code */
    }
}

void process(f32exp_t *input1, f32exp_t *input2, f32exp_t *output, int arraySize)
{
    for (int i=0; i<arraySize; i++)
    {
         /* Smart lang/compiler using suitable SIMD arithmetic operator overload */
         output[i] = input1[i] + input2[i];
    }
}

Woot, longest comment in my life, thanks for reading ;P

@emberian
Copy link
Member Author

@siavashserver Yes, constantly improving hardware is a problem. I don't think the right solution is to keep making more and more types. @cartazio had some opinions on this. I think we're going to end up using a very different API for SIMD, but maybe not.

As it stands, though, we could indeed make a type that is "the widest vector type" for the platform. Not sure it'd be super useful, especially in public APIs (feature flags shouldn't affect ABI, I don't think)

@siavashserver
Copy link

@cmr Absolutely, it's just going to raise a hell of data types in future (specially with upcoming AVX-512). I like the widest vector type idea, and then applications only need to know about the widest available f32/f64/int vector type on compile time and pack their data into SIMD vectors accordingly.

Nobody really wants to miss the extra horse power of newer hardware when available by limiting themselves to fixed vector sizes (f32x4, f32x8, f32x16, etc) or manually write separate versions for each hardware generation over and over again.

I will be glad to find out more about @cartazio opinion on this too.

@cartazio
Copy link
Contributor

thanks for pinging me! Bit buried this week but some quick thoughts. Ping me early next week if you want me to elaborate

Automatic vectorization is completely unrelated to manual simd. Trying to mix the two... will probably make automatic vectorization less effective

having magic "largest target simd size" is a bad idea. Why? Because any serious/intersting use of SIMD is all about the shuffles! And heres a fun fact: depending on the width of the simd, you have very very different shuffles. I also don't know off hand of any nice "size oblivious" way of describing those shuffles. (seriously, the cost model for how to even do this is pretty non uniform between cpu generations)

If you want to be "future proof", unroll you inner loops so you have like 16-32-64 wee read and update operations, and pray LLVM vectorization can do something "ok" with them.

Either you're using autovectorization in the compiler tooling, or you're using explicit simd. IN the latter case, you write a different version for every cpu micro arch variant to leverage the variations in load / store / compute latency for instructions, and actually will dispatch on which micro arch you're running on at runtime (because this is like 5 instructions out of 50-300 simd ops, so its really trivial overhead). (and i've not seen this in any compiler as an automatic thing)

Oh and that even ignoring alignment, Well written simd code actually has different cases for when the data is aligned or not, and the amount of alignment depends on the simd size AND the cpu micro arch variant. (which i've not seen supported )

Depending on the use case, its worth while to write a whole bunch of variants that handle misaligned reads and writes (even though you now have to write potentially 6 or more different variants!). I've not seen any automatic schemes for handling varied alignment that i've been happy with. LLVM (i believe) somtimes does a "case split" for certain aligned vs unaligned versions, but i don't think it does in general (a) because the number of cases blows up and it then decides the code bloat is too expensive (kinda like with inlining) (b) something else.

Basically, while theres def room for better type safe tooling for simd, i've never seen any good designs for a general purpose language that does more than just give you good type safe primops (though i'd love to see one, but i suspect its a phd sized project). Doing anything "magic" is very questionable in this case.

I'm glossing over a lot of matters here. But basically the only real perfmance fun for simd is when you're writing software using shuffles. Shuffles don't "scale" with changing the size of the underlying vector. And the cost model of shuffles changes with the CPU microarch + the simd vector size. Its pretty fun to read up on the changes in shuffle ops between avx128,256, and 512. They're very very different.

@cartazio
Copy link
Contributor

I hope that helps. I'm happy to opine more in a week or so when i'm a bit less busy

@thestinger
Copy link
Contributor

@cartazio: I don't think unrolling the loops will help LLVM. The loop-vectorize pass is far more mature and capable than the slp-vectorize pass for scalar code, and it includes unrolling as part of vectorization. It will likely hurt a lot more than it helps, although there's a planned loop rerolling pass to undo premature unrolling so vectorization and LLVM's loop unrolling can do a better, platform-specific job.

@cartazio
Copy link
Contributor

huh, really? I must be out of date. I thought SLP was enabled by default with 3.4 for O2 and that loop-vectorize is only in O3?
I'll have to take some time to look into that more

@thestinger
Copy link
Contributor

The loop-vectorize pass is enabled at -O2 and -O3 while slp-vectorize is only enabled at -O3. Vectorizing loops is often worthwhile so the heuristics are less conservative, while scalar vectorization has to be much more conservative.

@cartazio
Copy link
Contributor

ahh. Ok.

So the point being, if you want autovectorization, write scalar code. If you want to take advantage of the CPU specific features, work against those. I'm not sure if theres any way to sanely "auto widen" that will actually provide any robust perf guarantee. Theres a lot of none obvious details that go on in how those lane doublings work. And theres also the fact that a naive widening could actually (hypothetically) mess with pipelining of memory loads (though i could be wrong here)

@nikomatsakis
Copy link
Contributor

At this point, nobody knows the "correct" way to expose SIMD. There
are use cases that demand precise access to intrinsics with fixed and
known vector widths, and which must be recoded for each new
architecture, and others that can easily be written in a generic
way. Probably we'll ultimately have to expose both.

@alexcrichton
Copy link
Member

Closing due to inactivity, but it would nice to see unstable go away!

@emberian
Copy link
Member Author

(I'll update this after separate-libc lands)

On Sun, Mar 16, 2014 at 1:52 AM, Alex Crichton notifications@github.comwrote:

Closed #12524 #12524.


Reply to this email directly or view it on GitHubhttps://github.com//pull/12524
.

http://octayn.net/

@emberian emberian reopened this May 14, 2014
@alexcrichton
Copy link
Member

r=me once #14115 lands

emberian added 2 commits May 16, 2014 09:52
It's still marked experimental. Also add documentation.
@alexcrichton
Copy link
Member

Closing in favor of #14331

@brson brson mentioned this pull request May 23, 2014
bors added a commit that referenced this pull request May 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants