Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS.scal! only supports StridedVectors #141

Closed
jiahao opened this issue Sep 19, 2014 · 15 comments
Closed

BLAS.scal! only supports StridedVectors #141

jiahao opened this issue Sep 19, 2014 · 15 comments
Labels
bug Something isn't working

Comments

@jiahao
Copy link
Member

jiahao commented Sep 19, 2014

As reported in julia-dev, this call to LinAlg.BLAS.scal! works but produces unexpected behavior:

julia> A=ones(4,4);B = sub(A,1:2,3:4); BLAS.scal!(length(B),10.0,B,1); A #should have A[1:2,3:4] .== 10
4x4 Array{Float64,2}:
 1.0  1.0  10.0  1.0
 1.0  1.0  10.0  1.0
 1.0  1.0  10.0  1.0
 1.0  1.0  10.0  1.0

It looks like the _scal BLAS function doesn't really support strided matrices; it's only meant to be used with strided vectors. The issue here is that the function handles only one stride in one dimension and so blithely ignores all the stride information wrapped into StridedArray, and only uses the stride passed by the argument incx.

Would there be any reason to pass scal! a StridedVector and a value of incx not equal to its stride?

If not, there are two reasonable solutions:

  1. Change DX::Union(Ptr{$elty},StridedArray{$elty}) to remove StridedArray entirely.
  2. Change DX::Union(Ptr{$elty},StridedArray{$elty}) to DX::Union(Ptr{$elty},StridedVector{$elty}) and remove incx.

Otherwise yet a third possibility is Option 2, but make incx default to the stride of DX.

@jiahao jiahao added bug Something isn't working linear algebra labels Sep 19, 2014
@nwh
Copy link

nwh commented Sep 19, 2014

It looks like dot! and axpy! will have the same issue because they accept StridedArrays as well.

As for option 2: we can't get a stride from a Ptr so incx is a required parameter.

It seems difficult to nicely mix Ptr with specialized array types in the same method.

@andreasnoack
Copy link
Member

It used to be that there functions were defined for Union(Ptr,Array) and hence you could be sure that the memory layout was contiguous. I needed one of these functions for a non-contiguous array view an extended the definition to Union(Ptr,StridedArray). This didn't restrict the old use and it allowed my use case, but it opened up for the problem you described.

As @timholy noted, most users shouldn't use the BLAS functions because most of the BLAS functionality is provided through other safe Julia function, .e.g dot, scale!,*, \. However, I think it would be a good idea to separate out the pointer versions and do checks of the input arguments in the non-pointer versions to avoid as many accidents as possible.

@Jutho
Copy link
Contributor

Jutho commented Sep 19, 2014

Restricting the argument of scal! to StridedVector would make it no longer accept a standard Array, which is in the end what happens when you call scale! on an Array{T,N}. However, there is a simple workaround to update

        BLAS.scal!(length(X), s, X, 1)

to

        BLAS.scal!(length(X), s, vec(X), 1)

on line 14 of linalg/dense.jl.

Another possibility is to change the argument to Union(StridedVector,Array) .

@StefanKarpinski
Copy link
Member

Is BLAS really going to be that much faster than Julia here? Would it be better just to write generic Julia code that does this scaling operation? In the dense and sparse cases, it ought to be quite fast.

@Jutho
Copy link
Contributor

Jutho commented Sep 19, 2014

I find this for large vectors

julia> A=randn(10^9);

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 1.196689975 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.885319205 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.915600821 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.920074259 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.877416718 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.855078435 seconds (80 bytes allocated)

and this for small vectors

julia> A=randn(10);
julia> @time for i=1:1000000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.025317606 seconds (0 bytes allocated)

julia> @time for i=1:1000000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.025013646 seconds (0 bytes allocated)

julia> @time for i=1:1000000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.024841576 seconds (0 bytes allocated)

julia> @time for i=1:1000000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.025048104 seconds (0 bytes allocated)

julia> @time for i=1:1000000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.026964141 seconds (0 bytes allocated)

julia> @time for i=1:1000000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.026892926 seconds (0 bytes allocated)

Seems like there is indeed no point in using BLAS for this.

@nwh
Copy link

nwh commented Sep 23, 2014

How about the following solution:

  • change primary scal! method back to DX::Union(Ptr{$elty},Array{$elty})
  • add scal! method that takes DX::Union(StridedVector{$elty},Array{$elty}) and passes a Ptr to primary scal!
  • add scal method that performs a copy.

Features:

  • works nicely with dense arrays
  • secondary methods work nicely with StridedVectors

Possible issue:

  • will there be a case when stride(A,1) != 1 where A is an Array?

It would look like this:

## scal
for (fname, elty) in ((:dscal_,:Float64),
                      (:sscal_,:Float32),
                      (:zscal_,:Complex128),
                      (:cscal_,:Complex64))
    @eval begin
        # SUBROUTINE DSCAL(N,DA,DX,INCX)
        function scal!(n::Integer, DA::$elty, DX::Union(Ptr{$elty},Array{$elty}), incx::Integer)
            ccall(($(string(fname)),libblas), Void,
                  (Ptr{BlasInt}, Ptr{$elty}, Ptr{$elty}, Ptr{BlasInt}),
                  &n, &DA, DX, &incx)
            DX
        end
        function scal!(DA::$elty, DX::Union(StridedVector{$elty},Array{$elty}))
            scal!(length(DX), DA, pointer(DX), stride(DX,1))
        end
        function scal(DA::$elty, DX::Union(StridedVector{$elty},Array{$elty}))
            scal!(DA,copy(DX))
        end
    end
end

@simonster
Copy link
Member

@Jutho For medium-sized to large arrays there is a non-negligible speed difference for me:

julia> A=randn(10^9);

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.870733492 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.853314168 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.856629254 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.50168307 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.517095924 seconds (80 bytes allocated)

julia> A=randn(2^14);

julia> @time for i=1:100000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.76773478 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.771294171 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.76905031 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.332148663 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.32993218 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.333357126 seconds (0 bytes allocated)

It's smaller for the large array if I disable_threaded_libs() but still there.

@simonster
Copy link
Member

Unfortunately JuliaLang/julia#8452 doesn't seem to have made a difference here. What does help is defining:

function generic_scale!(X::AbstractArray, s::Number)
    for i = 1:length(X)
        @inbounds X[i] = X[i]*s
    end
    X
end

Apparently we can generate more efficient code if we know that the input and output are the same array pointer, although there's still a gap:

julia> A=randn(10^9);

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.758208847 seconds (80 bytes allocated)

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.757586562 seconds (80 bytes allocated)

julia> A=randn(2^14);

julia> @time for i=1:100000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.420178572 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.generic_scale!(A,3);end
elapsed time: 0.421258876 seconds (0 bytes allocated)

For a fairer comparison, with disable_threaded_libs():

julia> A=randn(10^9);

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.690413984 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.701861179 seconds (80 bytes allocated)

julia> A=randn(2^14);

julia> @time for i=1:100000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.325523797 seconds (0 bytes allocated)

julia> @time for i=1:100000;Base.LinAlg.scale!(A,3);end
elapsed time: 0.31559956 seconds (0 bytes allocated)

The difference is smaller, but there still is one. It might be interesting to try this with LLVM SVN.

@JeffBezanson
Copy link
Member

We should add that definition. We're handicapping ourselves against BLAS here.

@andreasnoack
Copy link
Member

The difference between the generic version and BLAS varies much between machines. On my MacBook

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 0.966372654 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.777787976 seconds (80 bytes allocated)

and on julia.mit.edu

julia> @time Base.LinAlg.generic_scale!(A,3);
elapsed time: 2.947719677 seconds (80 bytes allocated)

julia> @time Base.LinAlg.scale!(A,3);
elapsed time: 0.785620924 seconds (80 bytes allocated)

@nwh
Copy link

nwh commented Sep 26, 2014

@andreasnoack How would the solution I proposed above affect your use case for StridedArray? I'd be happy to make the changes and submit a pull request. Cheers!

@andreasnoack
Copy link
Member

Your proposal is good. Please open a pull request. I guess that it is only necessary to define the "second" method for StridedVector as the "first" method covers Array, right?

@nwh
Copy link

nwh commented Sep 26, 2014

The second and third methods:

        function scal!(DA::$elty, DX::Union(StridedVector{$elty},Array{$elty}))
            scal!(length(DX), DA, pointer(DX), stride(DX,1))
        end
        function scal(DA::$elty, DX::Union(StridedVector{$elty},Array{$elty}))
            scal!(DA,copy(DX))
        end

compute the length and stride from the inputs. Array is present in case somebody wants to scale all elements of an entire (possibly multi-dimensional) contiguous array.

Question: is it always true that stride(A,1) == 1 when A is an Array?

@andreasnoack
Copy link
Member

Yes.

@nwh
Copy link

nwh commented Nov 25, 2014

@jiahao this issue can be closed, a fix was merged: JuliaLang/julia#9141

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants