Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fall backs to parent working and AbstractArray2 #120

Merged
merged 11 commits into from
Feb 17, 2021
Merged

Conversation

Tokazama
Copy link
Member

@Tokazama Tokazama commented Feb 9, 2021

Originally this was suppose to be a simple proposal for AbstractArray2, but I inevitably found a bunch of little things that needed to be fixed along the way.

  1. Reorganizing files: pulled stuff out of "stridelayout.jl", and "dimensions.jl" into "axes.jl" and "size.jl", b/c it was all kind of mixed up. I'm completely fine with doing something else, but I just wanted to get all like methods next to each other so they are easier to track.
  2. Just using the parent trait inheritance pattern on size, axes, and indexing methods.
  3. AbstractArray2: Although we've designed things to work outside of a type hierarchy, we still need to manually redirect dispatch from Base to ArrayInterface in a lot of cases. This makes that easier.

I'm completely open to suggestions, so if you think I should change how stuff is named, organized, or just dropping some feature, feel free to be vocal.

@codecov
Copy link

codecov bot commented Feb 9, 2021

Codecov Report

Merging #120 (a06a1f5) into master (ee80672) will increase coverage by 0.04%.
The diff coverage is 81.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   85.86%   85.90%   +0.04%     
==========================================
  Files           8       10       +2     
  Lines        1528     1618      +90     
==========================================
+ Hits         1312     1390      +78     
- Misses        216      228      +12     
Impacted Files Coverage Δ
src/ranges.jl 91.48% <47.05%> (ø)
src/indexing.jl 88.55% <54.54%> (-0.48%) ⬇️
src/axes.jl 66.66% <66.66%> (ø)
src/size.jl 80.48% <80.48%> (ø)
src/ArrayInterface.jl 85.58% <88.00%> (+0.72%) ⬆️
src/dimensions.jl 95.23% <92.06%> (+3.16%) ⬆️
src/stridelayout.jl 87.14% <96.72%> (+5.36%) ⬆️
src/static.jl 86.41% <100.00%> (+0.52%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee80672...5387002. Read the comment docs.

@Tokazama
Copy link
Member Author

I'm trying to improve coverage still, but having problems with ReshapedArray b/c it drops all static size information and is also used by ReinterpretArray.

@chriselrod
Copy link
Collaborator

chriselrod commented Feb 10, 2021

I'm trying to improve coverage still, but having problems with ReshapedArray b/c it drops all static size information and is also used by ReinterpretArray.

The reinterpretarray case is particularly important, as it's essential for efficiently going from an array of structs to a struct of arrays.
I planning/working on adding this as an optimization to LoopVectorization soon.

E.g. if you have an array of structs, you can do a reshaped-reinterpret

julia> using VectorizationBase

julia> colors = [(R = rand(Float32), G = rand(Float32), B = rand(Float32)) for _ in 1:200];

julia> colormat = reinterpret(reshape, Float32, colors)
3×200 reinterpret(reshape, Float32, ::Vector{NamedTuple{(:R, :G, :B), Tuple{Float32, Float32, Float32}}}) with eltype Float32:
 0.533352  0.27818   0.185949  0.178599  0.942     0.52459    0.433129  0.973397  0.119951  0.320551    0.266196  0.555827  0.211765   0.547991  0.423878  0.614966  0.292452   0.650146   0.412988
 0.103244  0.26153   0.471125  0.162027  0.629155  0.0509689  0.717035  0.853053  0.811974  0.199296     0.544779  0.968282  0.213631   0.969433  0.33376   0.722512  0.287656   0.40756    0.319273
 0.335378  0.912518  0.8327    0.896587  0.424958  0.964567   0.346463  0.883965  0.82452   0.807947     0.984961  0.765952  0.0311273  0.902841  0.635449  0.420161  0.0574269  0.0528734  0.335193

And then using the Unroll type (LoopVectorization will eventually automate this):

help?> VectorizationBase.Unroll
  Unroll{AU,F,N,AV,W,M,X}(i::I)

    •  AU: Unrolled axis

    •  F: Factor, step size per unroll. If AU == AV, F == W means successive loads. 1 would mean offset by 1, e.g. x{1:8], x[2:9], and x[3:10].

    •  N: How many times is it unrolled

    •  AV: Vectorized axis # 0 means not vectorized, some sort of reduction

    •  W: vector width

    •  M: bitmask indicating whether each factor is masked

    •  X: stride between loads of vectors along axis AV.

    •  i::I - index

julia> vload(stridedpointer(colormat), VectorizationBase.Unroll{1,1,3,2,16}((1,1)))
3 x Vec{16, Float32}
Vec{16, Float32}<0.5333525f0, 0.27818024f0, 0.18594885f0, 0.17859852f0, 0.94200015f0, 0.52458966f0, 0.4331286f0, 0.97339725f0, 0.11995125f0, 0.32055116f0, 0.44533563f0, 0.83760893f0, 0.045946956f0, 0.59923065f0, 0.99825144f0, 0.013801932f0>
Vec{16, Float32}<0.103244424f0, 0.26153028f0, 0.47112453f0, 0.16202688f0, 0.62915516f0, 0.050968885f0, 0.7170353f0, 0.85305285f0, 0.81197417f0, 0.19929564f0, 0.7722609f0, 0.026382446f0, 0.9953768f0, 0.03576827f0, 0.44099808f0, 0.33057177f0>
Vec{16, Float32}<0.33537757f0, 0.9125179f0, 0.8326999f0, 0.89658654f0, 0.42495847f0, 0.96456707f0, 0.34646308f0, 0.88396525f0, 0.82452047f0, 0.8079473f0, 0.67117286f0, 0.53255033f0, 0.27501118f0, 0.8738842f0, 0.6906295f0, 0.44903874f0>

You can load vectors of colors; the three vectors above are R, G, and B. The reason the static strides are essential for making this efficient is that they let you avoid slow gather instructions, and instead just do 3 consecutive loads, and then permute the data into the correct order:

# julia> @code_native syntax=:intel debuginfo=:none vload(stridedpointer(colormat), VectorizationBase.Unroll{1,1,3,2,16}((1,1)))
        .text
        mov     rax, qword ptr [rdx]
        mov     rcx, qword ptr [rdx + 8]
        lea     rcx, [rcx + 2*rcx]
        shl     rcx, 2
        lea     rax, [rcx + 4*rax]
        mov     rcx, qword ptr [rsi]
        vmovups zmm0, zmmword ptr [rcx + rax - 16]
        vmovups zmm1, zmmword ptr [rcx + rax + 48]
        vmovups zmm2, zmmword ptr [rcx + rax + 112]
        movabs  rax, offset .rodata
        vmovaps zmm3, zmmword ptr [rax]
        mov     rax, rdi
        vpermi2ps       zmm3, zmm0, zmm1
        movabs  rcx, 139882633240320
        vmovaps zmm4, zmmword ptr [rcx]
        vpermi2ps       zmm4, zmm3, zmm2
        movabs  rcx, 139882633240384
        vmovaps zmm3, zmmword ptr [rcx]
        movabs  rcx, 139882633240448
        vmovaps zmm5, zmmword ptr [rcx]
        vpermi2ps       zmm3, zmm1, zmm0
        vpermi2ps       zmm5, zmm3, zmm2
        movabs  rcx, offset .rodata.cst32
        vbroadcasti64x4 zmm3, ymmword ptr [rcx] # zmm3 = mem[0,1,2,3,0,1,2,3]
        vpermd  zmm2, zmm3, zmm2
        movabs  rcx, 139882633240512
        vmovaps zmm3, zmmword ptr [rcx]
        vpermi2ps       zmm3, zmm0, zmm1
        mov     cl, -32
        kmovd   k1, ecx
        vmovapd zmm3 {k1}, zmm2
        vmovaps zmmword ptr [rdi], zmm4
        vmovaps zmmword ptr [rdi + 64], zmm5
        vmovapd zmmword ptr [rdi + 128], zmm3
        vzeroupper
        ret
        nop     word ptr cs:[rax + rax]

Note the performance gain static sizing gives us:

julia> @benchmark vload(stridedpointer($colormat), VectorizationBase.Unroll{1,1,3,2,16}((1,1)))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.460 ns (0.00% GC)
  median time:      2.955 ns (0.00% GC)
  mean time:        2.984 ns (0.00% GC)
  maximum time:     20.861 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> colormatrix = copy(colormat); typeof(colormatrix)
Matrix{Float32} (alias for Array{Float32, 2})

julia> @benchmark vload(stridedpointer($colormatrix), VectorizationBase.Unroll{1,1,3,2,16}((1,1)))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.010 ns (0.00% GC)
  median time:      10.189 ns (0.00% GC)
  mean time:        10.417 ns (0.00% GC)
  maximum time:     35.157 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

The copy/Matrix{Float32} lost this information, forcing it to use gathers.

@chriselrod
Copy link
Collaborator

I think using abstract types as an easy opt-in is fine. because we can't make base methods use our traits.

Splitting things into more files is fine too, but I'll have to take a deeper look at this PR later.

Base.axes(A::AbstractArray2, dim) = ArrayInterface.axes(A, dim)

Base.strides(A::AbstractArray2) = ArrayInterface.strides(A)
Base.strides(A::AbstractArray2, dim) = ArrayInterface.strides(A, dim)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the size and stride methods convert to Int to avoid potential type instabilities/problems in generic code unaware of ArrayInterface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These only exist to pass things forward to the ArrayInterface implementation. But if you look at the actually implementation...

axes(a, dim) = axes(a, to_dims(a, dim))
function axes(a::A, dim::Integer) where {A}
    if parent_type(A) <: A
        return Base.axes(a, Int(dim))
    else
        return axes(parent(a), to_parent_dims(A, dim))
    end
end

...StaticInt is never passed to a method in Base but persists internally, preventing premature conversion of StaticInt to an Int and relying on constant propagation for inference.

Copy link
Collaborator

@chriselrod chriselrod Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I define a statically sized AbstractArray2 using the ArtayInterface, then Base.size will suddenly return StaticInts as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it made sense because StaticInt is part of the public interface and we want to propagate static information, avoiding a dependence on constant propagation when dealing with deeply nested types. I'm sure we will run into some situations where methods explicitly require Int, but I also assumed that the end goal was for StaticInt to become standard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that users might assume that size returns an NTuple{N,Int}, and end up inadvertently writing code like myprod that doesn't handle heterogenous tuples well:

julia> using ArrayInterface: StaticInt

julia> sz = (StaticInt(4), 5);

julia> @code_warntype prod(sz)
MethodInstance for prod(::Tuple{StaticInt{4}, Int64})
  from prod(x::Tuple{Any, Vararg{Any}}) in Base at tuple.jl:480
Arguments
  #self#::Core.Const(prod)
  x::Tuple{StaticInt{4}, Int64}
Body::Int64
1%1 = Core._apply_iterate(Base.iterate, Base.:*, x)::Int64
└──      return %1


julia> function myprod(x)
           p = 1
           for a  x
               p *= a
           end
           p
       end
myprod (generic function with 1 method)

julia> @code_warntype myprod(sz)
MethodInstance for myprod(::Tuple{StaticInt{4}, Int64})
  from myprod(x) in Main at REPL[10]:1
Arguments
  #self#::Core.Const(myprod)
  x::Tuple{StaticInt{4}, Int64}
Locals
  @_3::Union{Nothing, Tuple{Union{StaticInt{4}, Int64}, Int64}}
  p::Int64
  a::Union{StaticInt{4}, Int64}
Body::Int64
1 ─       (p = 1)
│   %2  = x::Tuple{StaticInt{4}, Int64}
│         (@_3 = Base.iterate(%2))
│   %4  = (@_3::Core.Const((static(4), 2)) === nothing)::Core.Const(false)
│   %5  = Base.not_int(%4)::Core.Const(true)
└──       goto #4 if not %5
2%7  = @_3::Union{Tuple{StaticInt{4}, Int64}, Tuple{Int64, Int64}}
│         (a = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│         (p = p * a)
│         (@_3 = Base.iterate(%2, %9))
│   %12 = (@_3 === nothing)::Bool%13 = Base.not_int(%12)::Bool
└──       goto #4 if not %13
3 ─       goto #2
4return p


julia> @btime myprod($(Ref(sz))[])
  14.331 ns (1 allocation: 16 bytes)
20

julia> @btime prod($(Ref(sz))[])
  0.989 ns (0 allocations: 0 bytes)
20

I don't want to accidentally cause bad performance in other libraries. I have no idea how prevalent code like this is though, and would hope authors would fix it.

I also figure that if code is set up to take advantage of static size information, it could probably also take the step to use ArrayInterface.size & co instead. E.g., many of my libraries have using ArrayInterface: size, strides.

But if code is written in ignorance of these, is it more likely that they'd benefit from the forced constant prop, or that they'd be hurt by the need for constant prop to ensure type stability?

As is, I have definitions like here:

@inline Base.size(A::AbstractStrideArray) = map(Int, size(A))
@inline Base.strides(A::AbstractStrideArray) = map(Int, strides(A))

@inline create_axis(s, ::Zero) = CloseOpen(s)
@inline create_axis(s, ::One) = One():s
@inline create_axis(s, o) = CloseOpen(s, s+o)

@inline ArrayInterface.axes(A::AbstractStrideArray) = map(create_axis, size(A), offsets(A))
@inline Base.axes(A::AbstractStrideArray) = axes(A)

@inline ArrayInterface.offsets(A::PtrArray) = getfield(getfield(A, :ptr), :offsets)
@inline ArrayInterface.static_length(A::AbstractStrideArray) = prod(size(A))

# type stable, because index known at compile time
@inline type_stable_select(t::NTuple, ::StaticInt{N}) where {N} = t[N]
@inline type_stable_select(t::Tuple, ::StaticInt{N}) where {N} = t[N]
# type stable, because tuple is homogenous
@inline type_stable_select(t::NTuple, i::Integer) = t[i]
# make the tuple homogenous before indexing
@inline type_stable_select(t::Tuple, i::Integer) = map(Int, t)[i]

@inline function ArrayInterface.axes(A::AbstractStrideVector, i::Integer)
    if i == 1
        o = type_stable_select(offsets(A), i)
        s = type_stable_select(size(A), i)
        return create_axis(s, o)
    else
        return One():1
    end
end
@inline function ArrayInterface.axes(A::AbstractStrideArray, i::Integer)
    o = type_stable_select(offsets(A), i)
    s = type_stable_select(size(A), i)
    create_axis(s, o)
end
@inline Base.axes(A::AbstractStrideArray, i::Integer) = axes(A, i)

Meant to try and avoid placing too many performance foot-guns about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the examples Chris. I think it nicely illustrates the problems with returning StaticInt in the wild.

It will undoubtedly cause problems introducing StaticInt into the public space, although StaticArrays already does this and returns SOneTo for its axes. Perhaps another way of approaching this is asking ourselves if we really do plan for something like StaticInt to be passed around in the future. If we do then at what point do we start pushing people to make that adjustment? Having to move back and forth between static and dynamic space so that users don't have to learn about using things like NTuple{N} vs Tuple{Vararg{Any,N}} seems like a mistake to me, but so does country music and I'm not going to take that away from people either.

@timholy, you had some valuable input when we made StaticInt. Any opinions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it so that strides and size will return Int types. Returning statically sized axes seems to work for StaticArrays so I left that in. I think this also makes it so that new subtypes of AbstractArray2 can just define axes and axes_types and ArrayInterface.size and ArrayInterface.strides can catch any static info.

Implemented generated version of Base.size_to_strides so that we don't
have to rely on base definitions. Makes it so we don't rely on
Base.stride to reconstruct strides. Does NOT change any test results
Some dimension specific methods have been using fxn(A, dim) =
fxn(A)[dim]. This doesn't work b/c technically there are infinite
dimensions past ndims(A), but they are of size 1. So the general pattern
for a static dim method is

- put in type domain
`fxn(x, dim) = fxn(typeof(x), dim)`
- ensure proper dims becomes `StaticInt` or `Int`
`fxn(::Type{T}, dim) where {T} = fxn(T, to_dims(T, dim))`
- execute function
```
function fxn(::Type{T}, dim::Integer) where {T}
    if ndims(T) < dim
        return ndim_plus_default
    else
        return fxn(T)[dim]
    end
end
```
When executing on instances it's a similar pattern but we don't need to
worry about puting everything in the type domain.
Realized that this method is basically checks if strides can still be
derived after indexing, which is what we need for
defines_strides(::SubArray).
@Tokazama
Copy link
Member Author

@chriselrod , I think this PR is in a pretty good state for a serious review now. If people ultimately decide AbstractArray2 is a terrible idea it will still probably be useful to include some form of the Wrapper type in the tests. It helped uncover a lot of bugs where traits weren't propagating properly between layers.

Copy link
Collaborator

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

end
end

function _reshaped_axis_type(::Type{R}, dim::StaticInt) where {T,N,S,A,R<:ReinterpretArray{T,N,S,A}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know at what point inference gives up and decides it isn't worth specializing, but I've run into problems with pass-throughs like dim::StaticInt instead of ::StaticInt{dim} to force specialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall if it was sufficient to just add the static parameter like this

function _reshaped_axis_type(::Type{R}, dim::StaticInt{D}) where {T,N,S,A,R<:ReinterpretArray{T,N,S,A},D}

Or do we need to call StaticInt(D) in between each pass?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall, but suspect it is enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle that later. I'm making a PR to add support for BitArrays, and it'd be nice to avoid merge conflicts.
If we want to address things like dim::StaticInt (which should be failing @inferred tests if it was causing problems), we can do that in another PR.

I saw problems when these went through a chain of calls. I don't know what the heuristics are, but it's probably fine below a certain depth.

@chriselrod
Copy link
Collaborator

chriselrod commented Feb 17, 2021

Actually, checking out this PR, AbstractArray2 doesn't agree with LoopVectorization:

julia> using ArrayInterface: static

julia> using ArrayInterface, VectorizationBase, LoopVectorization

julia> struct NamedDimsWrapper{L,T,N,P<:AbstractArray{T,N}} <: ArrayInterface.AbstractArray2{T,N}
           parent::P
           NamedDimsWrapper{L}(p) where {L} = new{L,eltype(p),ndims(p),typeof(p)}(p)
       end

julia> ArrayInterface.parent_type(::Type{T}) where {P,T<:NamedDimsWrapper{<:Any,<:Any,<:Any,P}} = P

julia> ArrayInterface.has_dimnames(::Type{T}) where {T<:NamedDimsWrapper} = true

julia> ArrayInterface.dimnames(::Type{T}) where {L,T<:NamedDimsWrapper{L}} = static(Val(L))

julia> function ArrayInterface.dimnames(::Type{T}, dim) where {L,T<:NamedDimsWrapper{L}}
           if ndims(T) < dim
               return static(:_)
           else
               return static(L[dim])
           end
       end

julia> ArrayInterface.has_dimnames(::Type{T}) where {T<:NamedDimsWrapper} = true

julia> Base.parent(x::NamedDimsWrapper) = x.parent

julia> d = (static(:x), static(:y))
(static(:x), static(:y))

julia> x = NamedDimsWrapper{d}(ones(2,2));

julia> ArrayInterface.device(x)
ArrayInterface.CPUPointer()

julia> pointer(x)
ERROR: conversion to pointer not defined for NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] unsafe_convert(#unused#::Type{Ptr{Float64}}, a::NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}})
   @ Base ./pointer.jl:67
 [3] pointer(x::NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}})
   @ Base ./abstractarray.jl:1116
 [4] top-level scope
   @ REPL[13]:1

julia> stridedpointer(x)
ERROR: conversion to pointer not defined for NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] unsafe_convert(#unused#::Type{Ptr{Float64}}, a::NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}})
   @ Base ./pointer.jl:67
 [3] pointer(x::NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}})
   @ Base ./abstractarray.jl:1116
 [4] memory_reference
   @ ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:36 [inlined]
 [5] memory_reference
   @ ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:35 [inlined]
 [6] stridedpointer(A::NamedDimsWrapper{(static(:x), static(:y)), Float64, 2, Matrix{Float64}})
   @ VectorizationBase ~/.julia/dev/VectorizationBase/src/strided_pointers/stridedpointers.jl:71
 [7] top-level scope
   @ REPL[14]:1

julia> LoopVectorization.check_args(x)
true

ArrayInterface.device shouldn't return CPUPointer unless pointer and the other methods to make stridedpointer work are actually defined correctly.

@chriselrod
Copy link
Collaborator

chriselrod commented Feb 17, 2021

Maybe we should get rid of:

function device(::Type{T}) where {T<:AbstractArray}
    P = parent_type(T)
    T === P ? CPUIndex() : device(P)
end

and just return CPUIndex().

This problem isn't specific to AbstractArray2, but to the interface methods more generally:
They should be consistent in either looking to the parent_type, or not doing so.
Doing so is convenient. Not doing so is safe. The current behavior is neither and just does not work.

I'm good with merging this though, and making this consistent in another PR.

@chriselrod chriselrod merged commit baccd56 into master Feb 17, 2021
@Tokazama Tokazama deleted the abstract-type branch February 17, 2021 15:11
@Tokazama Tokazama mentioned this pull request Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants