-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable 64-bit vectorization #8452
Conversation
Sweet! |
Awesome! cc @JuliaBackports I wonder if this is related to the performance difference reported in JuliaLang/LinearAlgebra.jl#141 ? |
Amazing! |
#endif | ||
}; | ||
Type* jl_array_llvmt = | ||
StructType::create(getGlobalContext(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be jl_LLVMContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. In earlier lines, I see jl_LLVMContext
is used sometimes, and getGlobalContext
is used other times. What's the difference? Or is it just house-cleaning that hasn't been done yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there's no functional difference because we only use one context. I imagine with the threading work progressing we may at some point have more than one, so jl_LLVMContext
is preferred.
🍰 |
I can't solve the conflict that arises when cherry-picking this PR to If it is wanted, someone with a better understanding of the c++ files needs to do |
I'll take a look. |
Pushed into |
Yes, the title is correct despite the innocent appearance of the diff. The patch represents
jl_array_t
more accurately in LLVM. I implemented only thedata
andlength
fields since there was no clear benefit for representing the rest of the fields.Julia's pointer casting was confusing stride computations in LLVM's vectorizer, so it mistook unit-stride load/store as high cost gather/scatter. I had been wondering why the vectorizer was pricing a vector load at 51 clock cycles! I don't know why the problem did not exist for
Float32
, but it seemed some LLVM transform was pushing casts around differently when the size of the array element type matchedsizeof(jl_value_t*)
. So maybe title is wrong for 32-bit targets :-)Though we could try to get LLVM fixed, I think it's better for the code generator to "say what it means" and generate code closer to what Clang generates, since Clang is a prime client of LLVM.
With LLVM 3.3, the 64-bit versions of the benchmarks in
test/perf/simd
ran 1.6x to 3.9x faster for me with the patch than without, using an Intel 4th Generation Core i7 processor ("Haswell").