Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

ViralBShah · 2022-01-22T14:27:41Z

Using the vs/lp64 branch of MKL.jl

julia> using LinearAlgebra

julia> x = rand(ComplexF32, 10); dot(x,x)
6.305651f0 + 0.0f0im

julia> using MKL

julia> x = rand(ComplexF32, 10); dot(x,x)
Error: no BLAS/LAPACK library loaded!
1.0f-45 + 4.5916f-41im

julia> BLAS.dotc(x,x)
Error: no BLAS/LAPACK library loaded!
2.5705f-41 + 0.0f0im

The text was updated successfully, but these errors were encountered:

ViralBShah · 2022-01-22T14:34:18Z

Perhaps this is because cblas_cdotc_sub, cblas_zdotc_sub, cblas_zdotu_sub, and cblas_cdotu_sub do not have 64_ suffixes in MKL?

https://github.com/JuliaLang/julia/blob/1db8b8f160786c0ce23aed1fa865301fb9973329/stdlib/LinearAlgebra/src/blas.jl#L327

antoine-levitt · 2022-01-22T14:36:29Z

This is possibly unrelated and unhelpful so feel free to ignore, but I remember the dotc functions were particuarly nasty to work with, because they are the only BLAS functions that return a complex number, for which there is no unified ABI

ViralBShah · 2022-01-22T14:43:14Z

That is exactly the problem - which is why we use the CBLAS functions. But MKL does not have the 64_ suffixes for ILP64, it turns out.

ViralBShah · 2022-01-22T14:55:56Z

Maybe we should just implement dot in Julia instead of using BLAS.

ViralBShah · 2022-01-22T15:25:16Z

Or perhaps MKL has other APIs for dot that we may be able to use.

There's this but it's all C++:
https://www.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top/blas-and-sparse-blas-routines/blas-level-1-routines-and-functions/dot.html

ViralBShah · 2022-01-22T15:26:54Z

@chriselrod Would you know if a native julia dot in Base would be competitive with BLAS (without LoopVectorization)?

chriselrod · 2022-01-22T16:07:13Z

@chriselrod Would you know if a native julia dot in Base would be competitive with BLAS (without LoopVectorization)?

It depends. In terms of multithreaded performance, no, Base Julia will not be competitive, but I also don't think multithreaded dot products are especially compelling.

In single threaded performance, LLVM doesn't always make the best decisions by default either.
For example, on the M1, LLVM basically assumes it is an old phone chip with minimal out of order capabilities, and thus only limited unrolling is beneficial.

It does however do well for Complex:

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  34.995 ns (0 allocations: 0 bytes)
128.82773f0

julia> @btime dotnative($x, $y)
  48.371 ns (0 allocations: 0 bytes)
128.82771f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  283.276 ns (0 allocations: 0 bytes)
254.56534f0 + 11.200975f0im

julia> @btime dotnative($x, $y)
  136.343 ns (0 allocations: 0 bytes)
254.56543f0 + 11.200975f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1268
Commit 955d427135* (2022-01-10 15:37 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: Apple M1

Similarly, for CPUs with AVX512, LLVM prefers to only use 256 bit vectors, assuming some combination of downclocking / its poor handling of unvectorized remainders / chips only having a single 512 bit fma unit being problematic, so chips with 2 such units will do much better at power of 2 sizes when starting Julia with -C"native,-prefer-256-bit".
Julia started normally:

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  26.785 ns (0 allocations: 0 bytes)
136.5527f0

julia> @btime dotnative($x, $y)
  25.224 ns (0 allocations: 0 bytes)
136.5527f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  68.608 ns (0 allocations: 0 bytes)
255.64641f0 + 1.9186249f0im

julia> @btime dotnative($x, $y)
  84.649 ns (0 allocations: 0 bytes)
255.64641f0 + 1.918611f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz

With -C"native,-prefer-256-bit" (turns off preference for 256 bit vectors):

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  27.055 ns (0 allocations: 0 bytes)
136.53f0

julia> @btime dotnative($x, $y)
  21.177 ns (0 allocations: 0 bytes)
136.53f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  68.548 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648895f0im

julia> @btime dotnative($x, $y)
  56.924 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648867f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz

chriselrod · 2022-01-22T16:22:22Z

Getting a little more creative with the definitions, we can improve performance at odd sizes:

julia> function dotnative_fastrem(x,y)
         T = Base.promote_eltype(x, y)
         s = zero(T)
         N = length(x);
         @assert length(y) == N "vectors should have equal length"
         @inbounds @fastmath begin
           N32 = N & -32
           for i in 1:N32
             s += x[i]'*y[i]
           end
           Nrem32 = N & 31
           if Nrem32 ≥ 16
             for i in 1:16
               s += x[i+N32]' * y[i+N32]
             end
             N32 += 16
           end
           Nrem16 = Nrem32 & 15
           if Nrem32 ≥ 8
             for i in 1:8
               s += x[i+N32]' * y[i+N32]
             end
             N32 += 8
           end
           Nrem8 = Nrem16 & 7
           for i in N32+1:N
             s += x[i]' * y[i]
           end
         end
         return s
       end
dotnative_fastrem (generic function with 1 method)

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Complex{Float32}, 511);

julia> y = rand(Complex{Float32}, 511);

julia> @btime dotnative($x,$y)
  128.278 ns (0 allocations: 0 bytes)
257.9638f0 + 6.247218f0im

julia> @btime dotnative_fastrem($x,$y)
  77.057 ns (0 allocations: 0 bytes)
257.96378f0 + 6.247218f0im

This was just my first attempt. We can probably do better, but it's already a substantial improvement on this arch (Skylake-X with -prefer-256-bit) when the remainder is large.

For comparison, the fastest LoopVectorization version:

julia> using LoopVectorization, VectorizationBase

julia> function cdot_swizzle(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
         a = reinterpret(T, ca)
         b = reinterpret(T, cb)
         reim = Vec(zero(T),zero(T)) # needs VectorizationBase
         @turbo for i ∈ eachindex(a)
           reim = vfmsubadd(vmovsldup(a[i]), b[i], vfmsubadd(vmovshdup(a[i]), vpermilps177(b[i]), reim))
         end
         Complex(reim(1), reim(2))
       end
cdot_swizzle (generic function with 1 method)

julia> x = rand(Complex{Float32}, 511);

julia> y = rand(Complex{Float32}, 511);

julia> @btime cdot_swizzle($x,$y)
  62.163 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im

julia> @btime dotnative($x,$y)
  128.648 ns (0 allocations: 0 bytes)
255.22118f0 - 13.409902f0im

julia> @btime dotnative_fastrem($x,$y)
  77.479 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im

ViralBShah · 2022-01-22T16:43:13Z

Thanks. That is pretty exhaustive! I feel that even in the trivial case, the difference is small enough that we may be ok going with a native implementation.

ViralBShah · 2022-01-27T21:55:47Z

Intel said they will have CBLAS functions with 64_ suffixes in the next release, but they recommend that we could provide wrappers of our own for cblas dot functions in LBT.

They recommend something like this. @staticfloat would this work?

double cblas_ddot_64(const MKL_INT64 N, const double *X, const MKL_INT64 incX, const double *Y, const MKL_INT64 incY)
{
   return ddot_64(&N, X, &incX, Y, &incY);
}

ViralBShah · 2022-02-02T17:41:58Z

Should we tag and release 4.2 and pull it into Julia. After that, I can revive the MKL PR.

giordano · 2022-02-02T17:46:31Z

There is already v5.0.0, which I presume should go in Julia v1.8

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

ViralBShah mentioned this issue Jan 22, 2022

USE LBT 5.0 and the new MKL ILP64 suffixes JuliaLinearAlgebra/MKL.jl#104

Merged

ViralBShah changed the title ~~Missing dot(ComplexF32, ComplexF32) for LBT4~~ Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL Jan 22, 2022

ViralBShah mentioned this issue Jan 22, 2022

FCall macro and/or Fortran calling convention JuliaLang/julia#38872

Open

ViralBShah mentioned this issue Jan 31, 2022

Segfault with Python JuliaLinearAlgebra/MKL.jl#97

Closed

staticfloat linked a pull request Feb 1, 2022 that will close this issue

Add complex return style detection and cblas_*_sub workaround #61

Merged

staticfloat closed this as completed in #61 Feb 2, 2022

staticfloat added a commit to JuliaLang/julia that referenced this issue Feb 2, 2022

[libblastrampoline] Bump to v5.0.0

1db2672

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

staticfloat mentioned this issue Feb 2, 2022

[libblastrampoline] Bump to v5.0.1 JuliaLang/julia#44017

Merged

staticfloat added a commit to JuliaLang/julia that referenced this issue Feb 2, 2022

[libblastrampoline] Bump to v5.0.1

766ae6c

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

staticfloat added a commit to JuliaLang/julia that referenced this issue Feb 3, 2022

[libblastrampoline] Bump to v5.0.1 (#44017)

9c4b75d

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Feb 22, 2022

[libblastrampoline] Bump to v5.0.1 (JuliaLang#44017)

d022665

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Mar 8, 2022

[libblastrampoline] Bump to v5.0.1 (JuliaLang#44017)

fa493ac

This should allow us to use CBLAS symbols in MKL v2022. Verified that this fixes JuliaLinearAlgebra/libblastrampoline#56

danielwe mentioned this issue Jul 30, 2024

Incompatible with MKL.jl EnzymeAD/Enzyme.jl#1683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

ViralBShah commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 22, 2022 •

edited

Loading

antoine-levitt commented Jan 22, 2022

ViralBShah commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 22, 2022

chriselrod commented Jan 22, 2022 •

edited

Loading

chriselrod commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 27, 2022 •

edited

Loading

ViralBShah commented Feb 2, 2022

giordano commented Feb 2, 2022

Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

Comments

ViralBShah commented Jan 22, 2022 • edited Loading

ViralBShah commented Jan 22, 2022 • edited Loading

antoine-levitt commented Jan 22, 2022

ViralBShah commented Jan 22, 2022 • edited Loading

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 22, 2022

chriselrod commented Jan 22, 2022 • edited Loading

chriselrod commented Jan 22, 2022 • edited Loading

ViralBShah commented Jan 22, 2022

ViralBShah commented Jan 27, 2022 • edited Loading

ViralBShah commented Feb 2, 2022

giordano commented Feb 2, 2022

ViralBShah commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 22, 2022 •

edited

Loading

chriselrod commented Jan 22, 2022 •

edited

Loading

chriselrod commented Jan 22, 2022 •

edited

Loading

ViralBShah commented Jan 27, 2022 •

edited

Loading