Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NNlib.bias_act! #2327

Merged
merged 4 commits into from
Nov 8, 2024
Merged

Use NNlib.bias_act! #2327

merged 4 commits into from
Nov 8, 2024

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Sep 4, 2023

Uses FluxML/NNlib.jl#457 to speed up & save memory, up to half the memory for a forward pass. Largest savings in the gradient will be for large batch size, and activation functions like identity, relu, tanh whose input need not be stored.

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> img = rand32(28, 28, 1, 128);

julia> @btime $lenet($img);
  min 867.875 μs, mean 1.434 ms (160 allocations, 5.60 MiB)  # before
  min 831.500 μs, mean 1.100 ms (149 allocations, 3.31 MiB)  # after

julia> @btime gradient(m -> sum(abs2, m($img)), $lenet);
  min 7.128 ms, mean 10.280 ms (567 allocations, 14.19 MiB)
  min 6.296 ms, mean 6.930 ms (546 allocations, 9.61 MiB)

Closes #2151 which I forgot about.

Edit, now also with Enzyme, for which there is no special code -- it is able to understand the mutation, and benefits slightly. (Why it's slower than Zygote here I don't know, that's EnzymeAD/Enzyme.jl#2069 which is an orthogonal question.)

julia> @btime $lenet($img);
  min 655.583 μs, mean 1.107 ms (160 allocations, 5.60 MiB)  # before
  min 628.458 μs, mean 836.427 μs (149 allocations, 3.31 MiB)  # after

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);  # Zygote, as above, different computer
  min 4.979 ms, mean 6.300 ms (558 allocations, 14.18 MiB)
  min 4.759 ms, mean 5.683 ms (541 allocations, 9.61 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 8.347 ms, mean 9.752 ms (538 allocations, 15.42 MiB)
  min 7.365 ms, mean 8.791 ms (518 allocations, 10.83 MiB)

cdims = conv_dims(c, x)
xT = _match_eltype(c, x)
σ.(conv(xT, c.weight, cdims) .+ conv_reshape_bias(c))
NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUCompiler doesn't like this when c.σ === sigmoid and a bias is set, https://buildkite.com/julialang/flux-dot-jl/builds/4240#018a62b9-4aa7-4a4a-80fe-661494ca9939/351-799. It's not clear to me why Dense would be fine given it uses the same machinery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging. Error is on

broadcast!(::ComposedFunction{typeof(sigmoid_fast), typeof(+)}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})

where ComposedFunction comes from here:

https://github.com/FluxML/NNlib.jl/blob/1b30040fabadd41efa0d9dde5841b90f9f85cf2d/src/bias_act.jl#L32-L33

Agree it's odd that Dense doesn't hit the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replicate this issue with just CUDA.jl and NNlib, so we should consider adding some GPU tests for bias_act! on the NNlib side. Interestingly enough normal sigmoid works just fine, so something is strange with sigmoid_fast in particular.

Copy link
Member

@ToucheSir ToucheSir Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a theory now based on more testing. sigmoid_fast also works if one removes the @inline. I think what's happening is that with the @inline, it's being inlined into the body of ComposedFunction too early and preventing ComposedFunction itself from being inlined because its body is now too complex.

Edit: confirmed with Cthulhu. Not sure what the best course of action here would be. Do we rely heavily on the @inline for CPU perf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays. Uglier but preserves CPU performance if there is some gain there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could always override fast_act for GPU arrays

Good point. Allowing this is precisely why fast_act takes a second argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it looks like this error still persists :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to see how it worked with Enzyme etc, but still didn't get around to fixing this error.

Can save a lot of memory but haven't seen much of a speedup out of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the error solved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU tests currently pass.

Attempting to explicitly trigger this, by testing some gradients with CUDA and sigmoid, I see no errors & no wrong answers.

julia> using Flux, CUDA

julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, sigmoid), Dense(32 => 10));

julia> img = rand32(28, 28, 1, 128);

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, sigmoid),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, sigmoid),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, sigmoid),
           Dense(120 => 84, sigmoid), 
           Dense(84 => 10),
       );

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 41.608467
 20.979347
  2.015152

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
  0.9354934
 -1.4983172
 -0.6205859
 -0.6315984
  0.6592647
  1.2965859

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3]
3-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 41.60848
 20.979351
  2.015153

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias
6-element CuArray{Float32, 1, CUDA.DeviceMemory}:
  0.93553036
 -1.498424
 -0.6206611
 -0.63131595
  0.6591014
  1.2970955

julia> @eval Flux begin  # core of this: https://github.com/FluxML/Flux.jl/pull/2327

       function (a::Dense)(x::AbstractVecOrMat)
         _size_check(a, x, 1 => size(a.weight, 2))
          xT = _match_eltype(a, x)  # fixes Float64 input, etc.
          NNlib.bias_act!(a.σ, a.weight * xT, a.bias)  # does σ.(W*x .+ b), with fast paths
       end

       function (c::Conv)(x::AbstractArray)
         _conv_size_check(c, x)
         cdims = conv_dims(c, x)
         xT = _match_eltype(c, x)
         NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))
       end

       end

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 41.608467
 20.979347
  2.015152

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
  0.9354934
 -1.4983172
 -0.6205859
 -0.6315984
  0.6592647
  1.2965859

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3]
3-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 41.60848
 20.979351
  2.015153

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias
6-element CuArray{Float32, 1, CUDA.DeviceMemory}:
  0.93553036
 -1.498424
 -0.6206611
 -0.63131595
  0.6591014
  1.2970955

src/layers/basic.jl Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.37%. Comparing base (c86580b) to head (31fd7cf).
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2327       +/-   ##
===========================================
+ Coverage   33.54%   60.37%   +26.82%     
===========================================
  Files          31       31               
  Lines        1911     1938       +27     
===========================================
+ Hits          641     1170      +529     
+ Misses       1270      768      -502     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcabbott mcabbott added this to the v0.15 milestone Nov 6, 2024
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
@mcabbott
Copy link
Member Author

mcabbott commented Nov 8, 2024

Let's do this. If it's a disaster for some reason on 0.15 we can easily revert.

@mcabbott mcabbott merged commit af1e5fc into FluxML:master Nov 8, 2024
19 of 21 checks passed
@mcabbott mcabbott deleted the bias_act branch November 8, 2024 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants