-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56
Comments
Perhaps this is because |
This is possibly unrelated and unhelpful so feel free to ignore, but I remember the dotc functions were particuarly nasty to work with, because they are the only BLAS functions that return a complex number, for which there is no unified ABI |
That is exactly the problem - which is why we use the CBLAS functions. But MKL does not have the 64_ suffixes for ILP64, it turns out. |
Maybe we should just implement |
Or perhaps MKL has other APIs for There's this but it's all C++: |
@chriselrod Would you know if a native julia |
It depends. In terms of multithreaded performance, no, Base Julia will not be competitive, but I also don't think multithreaded dot products are especially compelling. In single threaded performance, LLVM doesn't always make the best decisions by default either. It does however do well for Complex: julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
34.995 ns (0 allocations: 0 bytes)
128.82773f0
julia> @btime dotnative($x, $y)
48.371 ns (0 allocations: 0 bytes)
128.82771f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
283.276 ns (0 allocations: 0 bytes)
254.56534f0 + 11.200975f0im
julia> @btime dotnative($x, $y)
136.343 ns (0 allocations: 0 bytes)
254.56543f0 + 11.200975f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1268
Commit 955d427135* (2022-01-10 15:37 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.2.0)
CPU: Apple M1 Similarly, for CPUs with AVX512, LLVM prefers to only use 256 bit vectors, assuming some combination of downclocking / its poor handling of unvectorized remainders / chips only having a single 512 bit fma unit being problematic, so chips with 2 such units will do much better at power of 2 sizes when starting Julia with julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
26.785 ns (0 allocations: 0 bytes)
136.5527f0
julia> @btime dotnative($x, $y)
25.224 ns (0 allocations: 0 bytes)
136.5527f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
68.608 ns (0 allocations: 0 bytes)
255.64641f0 + 1.9186249f0im
julia> @btime dotnative($x, $y)
84.649 ns (0 allocations: 0 bytes)
255.64641f0 + 1.918611f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz With julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
27.055 ns (0 allocations: 0 bytes)
136.53f0
julia> @btime dotnative($x, $y)
21.177 ns (0 allocations: 0 bytes)
136.53f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
68.548 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648895f0im
julia> @btime dotnative($x, $y)
56.924 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648867f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz |
Getting a little more creative with the definitions, we can improve performance at odd sizes: julia> function dotnative_fastrem(x,y)
T = Base.promote_eltype(x, y)
s = zero(T)
N = length(x);
@assert length(y) == N "vectors should have equal length"
@inbounds @fastmath begin
N32 = N & -32
for i in 1:N32
s += x[i]'*y[i]
end
Nrem32 = N & 31
if Nrem32 ≥ 16
for i in 1:16
s += x[i+N32]' * y[i+N32]
end
N32 += 16
end
Nrem16 = Nrem32 & 15
if Nrem32 ≥ 8
for i in 1:8
s += x[i+N32]' * y[i+N32]
end
N32 += 8
end
Nrem8 = Nrem16 & 7
for i in N32+1:N
s += x[i]' * y[i]
end
end
return s
end
dotnative_fastrem (generic function with 1 method)
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Complex{Float32}, 511);
julia> y = rand(Complex{Float32}, 511);
julia> @btime dotnative($x,$y)
128.278 ns (0 allocations: 0 bytes)
257.9638f0 + 6.247218f0im
julia> @btime dotnative_fastrem($x,$y)
77.057 ns (0 allocations: 0 bytes)
257.96378f0 + 6.247218f0im This was just my first attempt. We can probably do better, but it's already a substantial improvement on this arch (Skylake-X with For comparison, the fastest LoopVectorization version: julia> using LoopVectorization, VectorizationBase
julia> function cdot_swizzle(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(T, ca)
b = reinterpret(T, cb)
reim = Vec(zero(T),zero(T)) # needs VectorizationBase
@turbo for i ∈ eachindex(a)
reim = vfmsubadd(vmovsldup(a[i]), b[i], vfmsubadd(vmovshdup(a[i]), vpermilps177(b[i]), reim))
end
Complex(reim(1), reim(2))
end
cdot_swizzle (generic function with 1 method)
julia> x = rand(Complex{Float32}, 511);
julia> y = rand(Complex{Float32}, 511);
julia> @btime cdot_swizzle($x,$y)
62.163 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im
julia> @btime dotnative($x,$y)
128.648 ns (0 allocations: 0 bytes)
255.22118f0 - 13.409902f0im
julia> @btime dotnative_fastrem($x,$y)
77.479 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im |
Thanks. That is pretty exhaustive! I feel that even in the trivial case, the difference is small enough that we may be ok going with a native implementation. |
Intel said they will have CBLAS functions with 64_ suffixes in the next release, but they recommend that we could provide wrappers of our own for cblas dot functions in LBT. They recommend something like this. @staticfloat would this work?
|
Should we tag and release 4.2 and pull it into Julia. After that, I can revive the MKL PR. |
There is already |
This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56
This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56
This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56
This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56
This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56
Using the
vs/lp64
branch of MKL.jlThe text was updated successfully, but these errors were encountered: