Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NNlib.bias_act! #2327

Merged
merged 4 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ See also [github's page](https://github.com/FluxML/Flux.jl/releases) for a compl
* The `Flux.Optimise` module has been deprecated in favor of the Optimisers.jl package.
Now Flux re-exports the optimisers from Optimisers.jl. Most users will be uneffected by this change.
The module is still available for now, but will be removed in a future release.
* Most Flux layers will [re-use memory via `NNlib.bias_act!`](https://github.com/FluxML/Flux.jl/pull/2327), when possible.

## v0.14.22
* Data movement between devices is now provided by [MLDataDevices.jl](https://github.com/LuxDL/MLDataDevices.jl).
Expand Down
5 changes: 2 additions & 3 deletions src/layers/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,8 @@ end

function (a::Dense)(x::AbstractVecOrMat)
_size_check(a, x, 1 => size(a.weight, 2))
σ = NNlib.fast_act(a.σ, x) # replaces tanh => tanh_fast, etc
xT = _match_eltype(a, x) # fixes Float64 input, etc.
return σ.(a.weight * xT .+ a.bias)
return NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths
end

function (a::Dense)(x::AbstractArray)
Expand Down Expand Up @@ -466,7 +465,7 @@ function (a::Bilinear)(x::AbstractMatrix, y::AbstractMatrix)
Z = reshape(Wyx, (d_z, :))

# @einsum out[o,s] := σ(Z[o,i] + b[o])
σ.(Z .+ b)
NNlib.bias_act!(σ, Z, b) # σ.(Z .+ b)
end

(a::Bilinear)(x::AbstractVecOrMat) = a(x, x)
Expand Down
9 changes: 3 additions & 6 deletions src/layers/conv.jl
Original file line number Diff line number Diff line change
Expand Up @@ -196,10 +196,9 @@ ChainRulesCore.@non_differentiable conv_dims(::Any, ::Any)

function (c::Conv)(x::AbstractArray)
_conv_size_check(c, x)
σ = NNlib.fast_act(c.σ, x)
cdims = conv_dims(c, x)
xT = _match_eltype(c, x)
σ.(conv(xT, c.weight, cdims) .+ conv_reshape_bias(c))
NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUCompiler doesn't like this when c.σ === sigmoid and a bias is set, https://buildkite.com/julialang/flux-dot-jl/builds/4240#018a62b9-4aa7-4a4a-80fe-661494ca9939/351-799. It's not clear to me why Dense would be fine given it uses the same machinery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging. Error is on

broadcast!(::ComposedFunction{typeof(sigmoid_fast), typeof(+)}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})

where ComposedFunction comes from here:

https://github.com/FluxML/NNlib.jl/blob/1b30040fabadd41efa0d9dde5841b90f9f85cf2d/src/bias_act.jl#L32-L33

Agree it's odd that Dense doesn't hit the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replicate this issue with just CUDA.jl and NNlib, so we should consider adding some GPU tests for bias_act! on the NNlib side. Interestingly enough normal sigmoid works just fine, so something is strange with sigmoid_fast in particular.

Copy link
Member

@ToucheSir ToucheSir Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a theory now based on more testing. sigmoid_fast also works if one removes the @inline. I think what's happening is that with the @inline, it's being inlined into the body of ComposedFunction too early and preventing ComposedFunction itself from being inlined because its body is now too complex.

Edit: confirmed with Cthulhu. Not sure what the best course of action here would be. Do we rely heavily on the @inline for CPU perf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays. Uglier but preserves CPU performance if there is some gain there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays

Good point. Allowing this is precisely why fast_act takes a second argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it looks like this error still persists :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to see how it worked with Enzyme etc, but still didn't get around to fixing this error.

Can save a lot of memory but haven't seen much of a speedup out of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the error solved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU tests currently pass.

Attempting to explicitly trigger this, by testing some gradients with CUDA and sigmoid, I see no errors & no wrong answers.

julia> using Flux, CUDA

julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, sigmoid), Dense(32 => 10));

julia> img = rand32(28, 28, 1, 128);

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, sigmoid),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, sigmoid),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, sigmoid),
           Dense(120 => 84, sigmoid), 
           Dense(84 => 10),
       );

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 41.608467
 20.979347
  2.015152

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
  0.9354934
 -1.4983172
 -0.6205859
 -0.6315984
  0.6592647
  1.2965859

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3]
3-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 41.60848
 20.979351
  2.015153

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias
6-element CuArray{Float32, 1, CUDA.DeviceMemory}:
  0.93553036
 -1.498424
 -0.6206611
 -0.63131595
  0.6591014
  1.2970955

julia> @eval Flux begin  # core of this: https://github.com/FluxML/Flux.jl/pull/2327

       function (a::Dense)(x::AbstractVecOrMat)
         _size_check(a, x, 1 => size(a.weight, 2))
          xT = _match_eltype(a, x)  # fixes Float64 input, etc.
          NNlib.bias_act!(a.σ, a.weight * xT, a.bias)  # does σ.(W*x .+ b), with fast paths
       end

       function (c::Conv)(x::AbstractArray)
         _conv_size_check(c, x)
         cdims = conv_dims(c, x)
         xT = _match_eltype(c, x)
         NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))
       end

       end

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 41.608467
 20.979347
  2.015152

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
  0.9354934
 -1.4983172
 -0.6205859
 -0.6315984
  0.6592647
  1.2965859

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3]
3-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 41.60848
 20.979351
  2.015153

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias
6-element CuArray{Float32, 1, CUDA.DeviceMemory}:
  0.93553036
 -1.498424
 -0.6206611
 -0.63131595
  0.6591014
  1.2970955

end

_channels_in(l::Conv) = size(l.weight, ndims(l.weight)-1) * l.groups
Expand Down Expand Up @@ -350,10 +349,9 @@ ChainRulesCore.@non_differentiable conv_transpose_dims(::Any, ::Any)

function (c::ConvTranspose)(x::AbstractArray)
_conv_size_check(c, x)
σ = NNlib.fast_act(c.σ, x)
cdims = conv_transpose_dims(c, x)
xT = _match_eltype(c, x)
σ.(∇conv_data(xT, c.weight, cdims) .+ conv_reshape_bias(c))
NNlib.bias_act!(c.σ, ∇conv_data(xT, c.weight, cdims), conv_reshape_bias(c))
end

function Base.show(io::IO, l::ConvTranspose)
Expand Down Expand Up @@ -493,10 +491,9 @@ ChainRulesCore.@non_differentiable crosscor_dims(::Any, ::Any)

function (c::CrossCor)(x::AbstractArray)
_conv_size_check(c, x)
σ = NNlib.fast_act(c.σ, x)
cdims = crosscor_dims(c, x)
xT = _match_eltype(c, x)
σ.(crosscor(xT, c.weight, cdims) .+ conv_reshape_bias(c))
NNlib.bias_act!(c.σ, crosscor(xT, c.weight, cdims), conv_reshape_bias(c))
end

function Base.show(io::IO, l::CrossCor)
Expand Down
2 changes: 1 addition & 1 deletion src/layers/normalise.jl
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ function _norm_layer_forward(
β = reshape(l.β, affine_shape)

scale = γ ./ sqrt.(σ² .+ eps)
bias = -scale .* μ .+ β
bias = .-scale .* μ .+ β
l.λ.(scale .* x .+ bias)
end
CarloLucibello marked this conversation as resolved.
Show resolved Hide resolved

Expand Down
Loading